Monday, 16 June 2025
29.3 C
Singapore
28.2 C
Thailand
20.1 C
Indonesia
28.7 C
Philippines

Meta’s plain Maverick AI model falls behind in benchmark rankings

Meta’s plain Maverick AI model underperforms against rivals after benchmark controversy, raising concerns over testing fairness.

Earlier this week, you might have noticed that Meta found itself in hot water over how it used its Maverick AI model. The tech giant had submitted an experimental version of the model, one that hasn’t been released to the public, to a popular AI benchmark platform called LM Arena. The goal was to show off high performance, but the move didn’t sit well with many in the community.

As a result, LM Arena’s team issued an apology and quickly updated their policies. They also decided to score only the original, unmodified version of Meta’s Maverick going forward. As it turns out, that version doesn’t perform very well compared to its competitors.

Meta’s vanilla Maverick doesn’t impress

When LM Arena ranked the model’s basic version— “Llama-4-Maverick-17B-128E-Instruct”—it ended up below several other major AI models. These included OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. Most of those models have already been around for some time, which makes Maverick’s lower performance even more noticeable.

So, why did the experimental version do so well while the standard one lagged behind? Meta explained that the version they originally submitted—“Llama-4-Maverick-03-26-Experimental”—was tuned specifically to work well in conversations. This fine-tuning for “conversationality” likely helped the model perform better on LM Arena.

That’s because LM Arena relies on human reviewers to compare model responses and decide which is better. A model made to chat smoothly is more likely to be chosen in this kind of comparison, even if it’s not smarter or more useful in other ways.

Benchmark games raise questions

This situation has sparked a debate about how benchmarks are used in artificial intelligence. LM Arena, while popular, isn’t always considered a reliable way to measure how good a model is. It’s useful for some things, like comparing how well different models hold a conversation, but it doesn’t give a full picture of how they perform in real-world situations.

There’s also concern about fairness. When a company fine-tunes a model to score higher on a benchmark, it might not reflect how the model behaves in your hands as a developer or user. It can lead to false impressions and make it hard to judge whether a model is useful for your needs.

In short, scoring high on a single benchmark doesn’t always mean a model is better. It just means it’s better at that specific test.

Meta responds and looks ahead

In a response to the controversy, a Meta spokesperson said hat the company is always testing different versions of its models. In their words, the version that performed well was “a chat-optimised version we experimented with that also performs well on LM Arena.”

Meta has now released the open-source version of Llama 4, which includes the plain Maverick model. The company looks forward to seeing how developers customise the model to suit their needs.

“We’re excited to see what they will build and look forward to their ongoing feedback,” the spokesperson said.Now that the original version is out in the wild, it’s up to the wider AI community to test it and see whether it can shine in areas beyond friendly chats.

Hot this week

Milestone brings AI-driven smart city platform to Europe, starting with Genoa

Milestone expands Project Hafnia to Europe, using AI and video data to power smart cities starting with Genoa, supported by NVIDIA and Nebius.

Amazon taps nuclear power to boost AWS cloud energy supply

Amazon signs a 1.92 GW nuclear energy deal with Talen to power AWS cloud and explore new small modular reactors in Pennsylvania.

Redmagic 10S Pro launches in Singapore with faster gaming performance and exclusive offers

Redmagic 10S Pro lands in Singapore with overclocked performance, S$270 early bird deals, and a free cooling fan for a limited time.

Apple unveils macOS Tahoe with smarter tools and a new look

Apple reveals macOS Tahoe, which will be released this autumn and feature a fresh design, iPhone link upgrades, and smarter Spotlight tools.

Apple’s visionOS 26 brings spatial widgets, lifelike avatars, and shared experiences

Apple’s visionOS 26 update brings spatial widgets, improved avatars, and shared headset experiences for a more immersive digital world.

Informatica deepens partnership with Databricks to support new Iceberg and OLTP services

Informatica joins Databricks as launch partner for new Iceberg and OLTP solutions, introducing AI tools to speed up GenAI development.

Hong Kong opens skies to larger drones in bid to grow low-altitude economy

Hong Kong will allow the testing of larger drones to boost its low-altitude economy and improve logistics, following mainland China's lead.

Hong Kong to build new AI supercomputing centre in bid to lead global tech race

Hong Kong plans a new AI supercomputing centre to boost its tech hub status and support growing start-ups across the Greater Bay Area.

Steam adds full native support for Apple Silicon Macs

Steam runs natively on Apple Silicon Macs, ditching Rosetta 2 for smoother performance and better gaming on M1 and M2 devices.

Related Articles

Popular Categories