Wednesday, 21 May 2025
27.1 C
Singapore
29.6 C
Thailand
20.7 C
Indonesia
29 C
Philippines

OpenAI’s latest reasoning AI models are more prone to making mistakes

OpenAI’s new o3 and o4-mini AI models perform better in some areas but hallucinate more often than their predecessors, raising concerns.

You might expect newer AI models to make fewer mistakes — but OpenAI’s latest releases, o3, and o4-mini, are proving otherwise. These two advanced “reasoning models,” launched recently by the makers of ChatGPT, actually generate more hallucinations than many of the company’s older systems. In the world of artificial intelligence, a “hallucination” is when a model confidently gives you an answer that sounds right — but is entirely made up.

This development has caught many experts by surprise. Historically, each new model in OpenAI’s lineup has improved slightly in how often it hallucinates. But the o3 and o4-mini break that trend. According to internal tests, they hallucinate more than previous reasoning models like o1, o1-mini, and o3-mini—and even more than general-purpose models such as GPT-4o.

What’s worrying is that OpenAI doesn’t yet understand why this is happening.

More answers, more problems

In a technical report, OpenAI admitted that “more research is needed” to understand why hallucinations are on the rise with these new reasoning models. While o3 and o4-mini do better in some areas—like coding and maths—they also tend to give more responses overall. That means they make more correct claims but also more incorrect or imagined ones.

For example, on PersonQA — an in-house benchmark created by OpenAI to test how well models know facts about people — o3 hallucinated 33% of the time. That’s more than double the rate of o1 (16%) and o3-mini (14.8%). O4-mini fared even worse, hallucinating 48% of the time on the same test.

Third-party researchers are also raising concerns. A nonprofit AI lab, Transluce, reported that o3 made up actions it claimed to have taken. In one case, o3 said it ran code on a 2021 MacBook Pro outside of ChatGPT — something it simply can’t do. Neil Chowdhury, a Transluce researcher and former OpenAI employee, suggested the type of training used for the o-series models may increase this problem instead of reducing it.

Sarah Schwettmann, Transluce’s co-founder, said o3’s higher hallucination rate could limit its usefulness. And Kian Katanforoosh, a Stanford professor and CEO of the tech upskilling company Workera, said his team has noticed o3 making up website links. The model often provides links that don’t lead anywhere when clicked.

Why hallucinations matter for real-world use

While some argue that hallucinations help AIs be more creative, they can be a major risk in professional settings. Imagine a legal firm using AI to draft contracts — if the AI inserts false information, it could have serious consequences.

One possible solution being explored is giving AI models access to web searches. OpenAI’s GPT-4o, when paired with web search, reached 90% accuracy on another benchmark called SimpleQA. It’s possible that adding search tools to reasoning models like o3 and o4-mini might help reduce hallucinations — though this comes with trade-offs, such as sharing your prompts with third-party services.

The growing problem of hallucinations comes when the AI world shifts its focus to reasoning models. These models offer strong performance without needing massive computing power. But as this new wave of AI develops, it’s clear that better reasoning doesn’t always mean better accuracy.

“We’re constantly researching how to reduce hallucinations and improve model reliability,” said Niko Felix, a spokesperson for OpenAI. “It’s a major priority for us moving forward.”

As AI continues to evolve, the balance between intelligence and reliability remains a tricky challenge that researchers are racing to solve.

Hot this week

Kingston reveals high-speed Fury Renegade G5 SSD for gamers and creators

Kingston's Fury Renegade G5 SSD hits 14,800MB/s speeds, built for high-performance gaming and content creation.

Singapore’s Changi Terminal 5 to bring major tech and travel upgrades

Changi Airport’s new Terminal 5 brings smart systems, solar power, and new transport links, opening in the mid-2030s with big upgrades.

GitLab: How ‘vibe coding’ and agentic AI are changing the rules of software development

Vibe coding and agentic AI are transforming software development, reshaping roles, workflows, and team dynamics.

ASUS Vivobook S14 and S16: Work smarter, stay connected, and unwind with ease

Work smarter and longer with ASUS Vivobook S14 and S16, featuring AI tools, 20-hour battery life, and sleek, secure designs.

NVIDIA unveils new tools to drive humanoid robot development

NVIDIA introduces GR00T N1.5, GR00T-Dreams, and Blackwell systems to drive humanoid robot development and physical AI with synthetic data tools.

ASUS ROG showcases new esports gear and partnerships at Computex 2025

ASUS ROG unveils new esports gear and partnerships at Computex 2025, including keyboards, mice, monitors, and pro collaborations.

Vertagear and Audi launch premium gaming chair collection inspired by automotive craftsmanship

Vertagear and Audi unveil a premium gaming chair line that blends ergonomic comfort with automotive-inspired luxury design.

Xiaomi launches 3-nanometre chip to rival Apple and Qualcomm

Xiaomi unveiled the 3-nm XRing O1 chip for its new phone and tablet, matching Apple and Qualcomm in the global semiconductor race.

US buyer activity rises on Alibaba.com after tariff pause agreement

US buyers flood Alibaba.com after a 90-day US-China tariff pause, boosting inquiries by over 40% and driving holiday stock orders early.

Related Articles

Popular Categories