Tuesday, 5 August 2025
29.2 C
Singapore
30.3 C
Thailand
19 C
Indonesia
28.6 C
Philippines

Meta’s plain Maverick AI model falls behind in benchmark rankings

Meta’s plain Maverick AI model underperforms against rivals after benchmark controversy, raising concerns over testing fairness.

Earlier this week, you might have noticed that Meta found itself in hot water over how it used its Maverick AI model. The tech giant had submitted an experimental version of the model, one that hasn’t been released to the public, to a popular AI benchmark platform called LM Arena. The goal was to show off high performance, but the move didn’t sit well with many in the community.

As a result, LM Arena’s team issued an apology and quickly updated their policies. They also decided to score only the original, unmodified version of Meta’s Maverick going forward. As it turns out, that version doesn’t perform very well compared to its competitors.

Meta’s vanilla Maverick doesn’t impress

When LM Arena ranked the model’s basic version— “Llama-4-Maverick-17B-128E-Instruct”—it ended up below several other major AI models. These included OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. Most of those models have already been around for some time, which makes Maverick’s lower performance even more noticeable.

So, why did the experimental version do so well while the standard one lagged behind? Meta explained that the version they originally submitted—“Llama-4-Maverick-03-26-Experimental”—was tuned specifically to work well in conversations. This fine-tuning for “conversationality” likely helped the model perform better on LM Arena.

That’s because LM Arena relies on human reviewers to compare model responses and decide which is better. A model made to chat smoothly is more likely to be chosen in this kind of comparison, even if it’s not smarter or more useful in other ways.

Benchmark games raise questions

This situation has sparked a debate about how benchmarks are used in artificial intelligence. LM Arena, while popular, isn’t always considered a reliable way to measure how good a model is. It’s useful for some things, like comparing how well different models hold a conversation, but it doesn’t give a full picture of how they perform in real-world situations.

There’s also concern about fairness. When a company fine-tunes a model to score higher on a benchmark, it might not reflect how the model behaves in your hands as a developer or user. It can lead to false impressions and make it hard to judge whether a model is useful for your needs.

In short, scoring high on a single benchmark doesn’t always mean a model is better. It just means it’s better at that specific test.

Meta responds and looks ahead

In a response to the controversy, a Meta spokesperson said hat the company is always testing different versions of its models. In their words, the version that performed well was “a chat-optimised version we experimented with that also performs well on LM Arena.”

Meta has now released the open-source version of Llama 4, which includes the plain Maverick model. The company looks forward to seeing how developers customise the model to suit their needs.

“We’re excited to see what they will build and look forward to their ongoing feedback,” the spokesperson said.Now that the original version is out in the wild, it’s up to the wider AI community to test it and see whether it can shine in areas beyond friendly chats.

Hot this week

AI streaming service Showrunner launches in alpha, inviting users to create animated scenes

Fable launches Showrunner, an AI-powered streaming service that lets users create animated scenes and explore in-house content.

AI-powered search tools threaten the survival of the online news industry

AI-generated summaries are cutting search traffic to news sites, threatening ad revenue and prompting legal and strategic shifts in the media sector.

Microsoft’s Bing gains ground as Google’s search share slips

Microsoft’s Bing gains US and global search share, challenging Google’s dominance with AI-powered updates and increased ad revenue.

NTT DATA and Mistral AI partner to deliver secure and sustainable private AI for enterprises

NTT DATA and Mistral AI are joining forces to deliver secure, sustainable enterprise-grade AI for regulated industries worldwide.

eero 7 and eero Pro 7 launch in Singapore with full Wi-Fi 7 support

eero launches Wi-Fi 7-ready eero 7 and eero Pro 7 in Singapore, offering fast speeds, advanced security, and smart home integration.

Armis surpasses US$300 million ARR as demand for cyber risk management rises

Armis surpasses US$300 million in ARR as demand for cyber risk and CPS security solutions drives rapid growth across global enterprises.

Cloudera acquires Taikun to expand cloud data capabilities for AI

Cloudera acquires Taikun to expand its data and AI platform, enabling hybrid and multi-cloud deployment across all enterprise environments.

CreditAccess Grameen speeds up digital transformation with Mendix low-code platform

CreditAccess Grameen adopts Mendix low-code platform to speed up digital transformation, cut audit time, and improve nationwide operations.

Singapore construction firms lead in tech adoption, boosting performance and safety

Singapore’s construction industry boosts tech investment to 28% of spending, improving project outcomes, safety, and digital maturity.

Related Articles

Popular Categories