Wednesday, 22 October 2025
28.8 C
Singapore
29.1 C
Thailand
30 C
Indonesia
28.8 C
Philippines

Meta’s plain Maverick AI model falls behind in benchmark rankings

Meta’s plain Maverick AI model underperforms against rivals after benchmark controversy, raising concerns over testing fairness.

Earlier this week, you might have noticed that Meta found itself in hot water over how it used its Maverick AI model. The tech giant had submitted an experimental version of the model, one that hasn’t been released to the public, to a popular AI benchmark platform called LM Arena. The goal was to show off high performance, but the move didn’t sit well with many in the community.

As a result, LM Arena’s team issued an apology and quickly updated their policies. They also decided to score only the original, unmodified version of Meta’s Maverick going forward. As it turns out, that version doesn’t perform very well compared to its competitors.

Meta’s vanilla Maverick doesn’t impress

When LM Arena ranked the model’s basic version— “Llama-4-Maverick-17B-128E-Instruct”—it ended up below several other major AI models. These included OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. Most of those models have already been around for some time, which makes Maverick’s lower performance even more noticeable.

So, why did the experimental version do so well while the standard one lagged behind? Meta explained that the version they originally submitted—“Llama-4-Maverick-03-26-Experimental”—was tuned specifically to work well in conversations. This fine-tuning for “conversationality” likely helped the model perform better on LM Arena.

That’s because LM Arena relies on human reviewers to compare model responses and decide which is better. A model made to chat smoothly is more likely to be chosen in this kind of comparison, even if it’s not smarter or more useful in other ways.

Benchmark games raise questions

This situation has sparked a debate about how benchmarks are used in artificial intelligence. LM Arena, while popular, isn’t always considered a reliable way to measure how good a model is. It’s useful for some things, like comparing how well different models hold a conversation, but it doesn’t give a full picture of how they perform in real-world situations.

There’s also concern about fairness. When a company fine-tunes a model to score higher on a benchmark, it might not reflect how the model behaves in your hands as a developer or user. It can lead to false impressions and make it hard to judge whether a model is useful for your needs.

In short, scoring high on a single benchmark doesn’t always mean a model is better. It just means it’s better at that specific test.

Meta responds and looks ahead

In a response to the controversy, a Meta spokesperson said hat the company is always testing different versions of its models. In their words, the version that performed well was “a chat-optimised version we experimented with that also performs well on LM Arena.”

Meta has now released the open-source version of Llama 4, which includes the plain Maverick model. The company looks forward to seeing how developers customise the model to suit their needs.

“We’re excited to see what they will build and look forward to their ongoing feedback,” the spokesperson said.Now that the original version is out in the wild, it’s up to the wider AI community to test it and see whether it can shine in areas beyond friendly chats.

Hot this week

Veeam launches new data cloud platform for managed service providers

Veeam launches Data Cloud for MSPs, a new SaaS platform that simplifies data resilience, strengthens security, and helps providers scale services.

Keeper Security triples revenue in Japan as zero-trust demand surges across APAC

Keeper Security triples revenue in Japan and expands across APAC as organisations adopt zero-trust security to counter rising cyber threats.

Samsung partners with Nvidia to develop custom CPUs and XPUs for AI dominance

Nvidia partners with Samsung to develop custom CPUs and XPUs, expanding its NVLink Fusion ecosystem to strengthen its AI hardware dominance.

Google brings Gemini-powered automation to Sheets

Google adds Gemini-powered AI automation to Sheets, allowing users to complete multi-step edits and formatting tasks in one simple command.

GigaDevice opens new Tokyo office to strengthen Japan presence and global collaboration

GigaDevice opens a new Tokyo office to strengthen local services, deepen collaboration, and drive innovation in Japan’s semiconductor market.

SFIC unveils five-year roadmap to strengthen Singapore’s furniture industry

SFIC launches its 2026–2030 roadmap to drive innovation, digitalisation, and global growth for Singapore’s furniture industry.

Twitch CEO responds to streamer assault at TwitchCon 2025

Twitch CEO Dan Clancy responds to streamer Emiru’s assault at TwitchCon 2025 amid criticism over safety and Twitch’s handling of the incident.

Microsoft releases emergency Windows 11 update to fix recovery bug

Microsoft has issued an emergency Windows 11 update to fix a recovery bug that disabled USB mouse and keyboard support in WinRE.

Whisker introduces Litter-Robot 5 Pro with AI facial recognition for cats

Whisker introduces the Litter-Robot 5 Pro, featuring AI facial recognition and new smart features for advanced cat care.

Related Articles