Saturday, 11 October 2025
28.6 C
Singapore
26.4 C
Thailand
21 C
Indonesia
27.2 C
Philippines

NVIDIA Blackwell redefines AI inference performance with record-breaking InferenceMAX results

NVIDIA Blackwell leads the new InferenceMAX benchmarks with unmatched AI performance, 15x ROI, and record-breaking efficiency.

NVIDIA’s Blackwell platform has topped the new SemiAnalysis InferenceMAX v1 benchmarks, demonstrating record performance and industry-leading efficiency in real-world AI workloads. The results highlight the platform’s ability to deliver superior return on investment (ROI) and dramatically reduce total cost of ownership (TCO), reinforcing its position as the preferred choice for large-scale AI inference.

InferenceMAX v1 is the first independent benchmark to measure the total cost of compute across a wide range of models and scenarios, reflecting the real-world demands of modern AI. As applications evolve from single-response outputs to complex, multi-step reasoning, the economics of inference are becoming critical. Blackwell’s strong performance underscores how NVIDIA’s full-stack approach meets these demands, combining advanced hardware with continuous software optimisation.

A key result from the benchmarks shows that a US$5 million investment in an NVIDIA GB200 NVL72 system can generate US$75 million in DSR1 token revenue, representing a 15-fold ROI. NVIDIA’s B200 system also achieves a cost of just two cents per million tokens on the open-source gpt-oss model, a fivefold reduction in cost per token within two months.

“Inference is where AI delivers value every day,” said Ian Buck, vice president of hyperscale and high-performance computing at NVIDIA. “These results show that NVIDIA’s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.”

Full-stack innovation drives performance gains

NVIDIA’s leadership in inference performance is built on deep collaboration with the open-source community and continuous hardware-software co-design. Partnerships with OpenAI, Meta, and DeepSeek AI ensure that major models like gpt-oss 120B, Llama 3 70B, and DeepSeek R1 are optimised for NVIDIA’s infrastructure, enabling organisations to run the latest models more efficiently.

The company’s TensorRT LLM v1.0 library represents a major advance in performance, using parallelisation techniques and NVIDIA NVLink Switch’s 1,800 GB/s bidirectional bandwidth to accelerate inference. New techniques such as speculative decoding in the gpt-oss-120b-Eagle3-v2 model further boost efficiency, tripling throughput to 30,000 tokens per GPU and achieving 100 tokens per second per user.

For large-scale models such as Llama 3.3 70B, which require extensive computational resources, the NVIDIA Blackwell B200 sets new performance records in the InferenceMAX v1 benchmarks. It delivers more than 10,000 tokens per second per GPU and 50 tokens per second per user, quadrupling throughput compared with the previous-generation H200 GPU.

Efficiency reshapes AI economics

As AI deployments scale, efficiency metrics like tokens per watt, cost per million tokens, and tokens per second per user are becoming just as important as raw throughput. The Blackwell architecture delivers 10 times more throughput per megawatt compared with its predecessor, directly translating into increased token revenue and improved operational efficiency.

This efficiency also reduces the cost per million tokens by 15 times, significantly lowering operating costs and enabling wider adoption of AI technologies across industries. The InferenceMAX benchmarks use the Pareto frontier to demonstrate how Blackwell balances cost, power consumption, throughput, and responsiveness to maximise ROI across varied production workloads.

While some systems achieve peak performance in specific scenarios, NVIDIA’s holistic approach ensures sustained efficiency and value in real-world environments. This comprehensive optimisation is essential for enterprises shifting from pilot projects to full-scale AI “factories” — infrastructures designed to transform data into tokens, insights, and decisions in real time.

Blackwell’s performance is underpinned by its advanced features, including the NVFP4 low-precision format for improved efficiency without sacrificing accuracy, fifth-generation NVLink for connecting up to 72 GPUs as a unified processor, and the NVLink Switch, which enables high concurrency through advanced parallelisation algorithms. Combined with continuous software updates, these innovations have more than doubled Blackwell’s performance since its initial launch.

With a vast ecosystem of hundreds of millions of GPUs deployed globally, over seven million CUDA developers, and contributions to more than 1,000 open-source projects, NVIDIA’s platform is designed to scale and evolve alongside the AI industry. Its Think SMART framework further supports enterprises in optimising cost per token, managing latency service-level agreements, and adapting to dynamic workloads.

As benchmarks like InferenceMAX continue to evolve, they will remain critical tools for organisations looking to make informed infrastructure decisions. NVIDIA’s results show that performance, efficiency, and ROI are not competing goals — they can be achieved together through a full-stack approach, setting a new standard for the future of AI inference.

Hot this week

ChatGPT update lets users interact directly with apps

OpenAI’s new ChatGPT update enables users to access and control popular apps, such as Spotify, Canva, and Booking.com, directly within chats.

Call of Duty offers a free week to thwart Battlefield resurgence

Activision makes Black Ops 6 free for a week starting 9 October to counter Battlefield 6’s launch.

Google offers free AI Pro plan to students in Singapore

Google is offering students in Singapore a free one-year subscription to its AI Pro plan, featuring Gemini 2.5 Pro and powerful learning tools.

Microsoft enhances Photos app with AI features exclusive to Copilot+ PCs

Microsoft’s new AI-powered Photos app offers smarter image sorting and enhancement, but is limited to Copilot+ PCs with built-in NPUs.

Coursera partners with OpenAI to make trusted learning content available in ChatGPT

Coursera joins OpenAI’s first generation of ChatGPT apps, making trusted learning content accessible to millions of users worldwide.

Armis and Fortinet expand partnership to boost cyber resilience for global businesses

Armis and Fortinet have expanded their partnership to enhance cyber resilience with deeper integration, unified visibility, and automated security enforcement.

Google offers free AI Pro plan to students in Singapore

Google is offering students in Singapore a free one-year subscription to its AI Pro plan, featuring Gemini 2.5 Pro and powerful learning tools.

Semperis launches unified identity recovery and crisis management solution

Semperis launches Ready1 for Identity Crisis Management, combining identity recovery and crisis management to speed cyberattack response and recovery.

Infor launches industry-focused AI agents to transform enterprise operations

Infor launches industry-specific AI agents, new cloud migration tools, and enhanced process mining to transform enterprise workflows and accelerate automation.

Related Articles

Popular Categories