Sunday, 7 December 2025
26.8 C
Singapore
23.4 C
Thailand
20.3 C
Indonesia
26.8 C
Philippines

Meta’s plain Maverick AI model falls behind in benchmark rankings

Meta’s plain Maverick AI model underperforms against rivals after benchmark controversy, raising concerns over testing fairness.

Earlier this week, you might have noticed that Meta found itself in hot water over how it used its Maverick AI model. The tech giant had submitted an experimental version of the model, one that hasn’t been released to the public, to a popular AI benchmark platform called LM Arena. The goal was to show off high performance, but the move didn’t sit well with many in the community.

As a result, LM Arena’s team issued an apology and quickly updated their policies. They also decided to score only the original, unmodified version of Meta’s Maverick going forward. As it turns out, that version doesn’t perform very well compared to its competitors.

Meta’s vanilla Maverick doesn’t impress

When LM Arena ranked the model’s basic version— “Llama-4-Maverick-17B-128E-Instruct”—it ended up below several other major AI models. These included OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. Most of those models have already been around for some time, which makes Maverick’s lower performance even more noticeable.

So, why did the experimental version do so well while the standard one lagged behind? Meta explained that the version they originally submitted—“Llama-4-Maverick-03-26-Experimental”—was tuned specifically to work well in conversations. This fine-tuning for “conversationality” likely helped the model perform better on LM Arena.

That’s because LM Arena relies on human reviewers to compare model responses and decide which is better. A model made to chat smoothly is more likely to be chosen in this kind of comparison, even if it’s not smarter or more useful in other ways.

Benchmark games raise questions

This situation has sparked a debate about how benchmarks are used in artificial intelligence. LM Arena, while popular, isn’t always considered a reliable way to measure how good a model is. It’s useful for some things, like comparing how well different models hold a conversation, but it doesn’t give a full picture of how they perform in real-world situations.

There’s also concern about fairness. When a company fine-tunes a model to score higher on a benchmark, it might not reflect how the model behaves in your hands as a developer or user. It can lead to false impressions and make it hard to judge whether a model is useful for your needs.

In short, scoring high on a single benchmark doesn’t always mean a model is better. It just means it’s better at that specific test.

Meta responds and looks ahead

In a response to the controversy, a Meta spokesperson said hat the company is always testing different versions of its models. In their words, the version that performed well was “a chat-optimised version we experimented with that also performs well on LM Arena.”

Meta has now released the open-source version of Llama 4, which includes the plain Maverick model. The company looks forward to seeing how developers customise the model to suit their needs.

“We’re excited to see what they will build and look forward to their ongoing feedback,” the spokesperson said.Now that the original version is out in the wild, it’s up to the wider AI community to test it and see whether it can shine in areas beyond friendly chats.

Hot this week

Kargo Technologies outlines plan for 40,000-vehicle EV shift by 2035

Kargo Technologies sets a 2035 target to deploy 40,000 electric vehicles and build an AI-driven Electrified Silk Road across Asia.

Nvidia partners with Mistral AI to accelerate new open model family

Nvidia and Mistral AI launch the Mistral 3 model family to boost enterprise AI performance across cloud and edge platforms.

Micron’s exit from Crucial signals a turning point for consumer memory

Micron ends its Crucial consumer line as it shifts focus to AI and enterprise memory, marking a major change in the PC hardware market.

Asia PGI unveils AI-powered PathGen outbreak intelligence platform

Asia PGI previews PathGen, a new AI-powered outbreak intelligence tool designed to speed up disease detection and response across Asia.

Sony introduces A7 V with updated sensor, faster processing, and improved stabilisation

Sony launches the A7 V with a new sensor, a faster processor, and upgraded stabilisation, targeting hybrid shooters with enhanced features.

Google highlights Singapore’s top trending searches in 2025

Google reveals Singapore’s top trending searches for 2025, highlighting SG60 celebrations, elections, pop culture and financial concerns.

HPE expands hybrid cloud portfolio with new virtualisation, security and AI capabilities

HPE expands its GreenLake cloud portfolio with new virtualisation, security and AI capabilities to support modern hybrid cloud demands.

EOY music, comics and arts festival returns with new venue and expanded programme

EOY 2025 returns with a new venue, international guests and expanded activities celebrating Japanese pop culture in Singapore.

Tiger Brokers: Bringing institutional-grade AI intelligence to global retail investors

AI is redefining retail investing as platforms like Tiger Brokers’ TigerAI integrate verified intelligence, personalisation, and long-term wealth management to empower global investors.

Related Articles

Popular Categories