Tuesday, 25 November 2025
29.1 C
Singapore
27.7 C
Thailand
26.1 C
Indonesia
26.9 C
Philippines

NVIDIA Blackwell redefines AI inference performance with record-breaking InferenceMAX results

NVIDIA Blackwell leads the new InferenceMAX benchmarks with unmatched AI performance, 15x ROI, and record-breaking efficiency.

NVIDIA’s Blackwell platform has topped the new SemiAnalysis InferenceMAX v1 benchmarks, demonstrating record performance and industry-leading efficiency in real-world AI workloads. The results highlight the platform’s ability to deliver superior return on investment (ROI) and dramatically reduce total cost of ownership (TCO), reinforcing its position as the preferred choice for large-scale AI inference.

InferenceMAX v1 is the first independent benchmark to measure the total cost of compute across a wide range of models and scenarios, reflecting the real-world demands of modern AI. As applications evolve from single-response outputs to complex, multi-step reasoning, the economics of inference are becoming critical. Blackwell’s strong performance underscores how NVIDIA’s full-stack approach meets these demands, combining advanced hardware with continuous software optimisation.

A key result from the benchmarks shows that a US$5 million investment in an NVIDIA GB200 NVL72 system can generate US$75 million in DSR1 token revenue, representing a 15-fold ROI. NVIDIA’s B200 system also achieves a cost of just two cents per million tokens on the open-source gpt-oss model, a fivefold reduction in cost per token within two months.

“Inference is where AI delivers value every day,” said Ian Buck, vice president of hyperscale and high-performance computing at NVIDIA. “These results show that NVIDIA’s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.”

Full-stack innovation drives performance gains

NVIDIA’s leadership in inference performance is built on deep collaboration with the open-source community and continuous hardware-software co-design. Partnerships with OpenAI, Meta, and DeepSeek AI ensure that major models like gpt-oss 120B, Llama 3 70B, and DeepSeek R1 are optimised for NVIDIA’s infrastructure, enabling organisations to run the latest models more efficiently.

The company’s TensorRT LLM v1.0 library represents a major advance in performance, using parallelisation techniques and NVIDIA NVLink Switch’s 1,800 GB/s bidirectional bandwidth to accelerate inference. New techniques such as speculative decoding in the gpt-oss-120b-Eagle3-v2 model further boost efficiency, tripling throughput to 30,000 tokens per GPU and achieving 100 tokens per second per user.

For large-scale models such as Llama 3.3 70B, which require extensive computational resources, the NVIDIA Blackwell B200 sets new performance records in the InferenceMAX v1 benchmarks. It delivers more than 10,000 tokens per second per GPU and 50 tokens per second per user, quadrupling throughput compared with the previous-generation H200 GPU.

Efficiency reshapes AI economics

As AI deployments scale, efficiency metrics like tokens per watt, cost per million tokens, and tokens per second per user are becoming just as important as raw throughput. The Blackwell architecture delivers 10 times more throughput per megawatt compared with its predecessor, directly translating into increased token revenue and improved operational efficiency.

This efficiency also reduces the cost per million tokens by 15 times, significantly lowering operating costs and enabling wider adoption of AI technologies across industries. The InferenceMAX benchmarks use the Pareto frontier to demonstrate how Blackwell balances cost, power consumption, throughput, and responsiveness to maximise ROI across varied production workloads.

While some systems achieve peak performance in specific scenarios, NVIDIA’s holistic approach ensures sustained efficiency and value in real-world environments. This comprehensive optimisation is essential for enterprises shifting from pilot projects to full-scale AI “factories” — infrastructures designed to transform data into tokens, insights, and decisions in real time.

Blackwell’s performance is underpinned by its advanced features, including the NVFP4 low-precision format for improved efficiency without sacrificing accuracy, fifth-generation NVLink for connecting up to 72 GPUs as a unified processor, and the NVLink Switch, which enables high concurrency through advanced parallelisation algorithms. Combined with continuous software updates, these innovations have more than doubled Blackwell’s performance since its initial launch.

With a vast ecosystem of hundreds of millions of GPUs deployed globally, over seven million CUDA developers, and contributions to more than 1,000 open-source projects, NVIDIA’s platform is designed to scale and evolve alongside the AI industry. Its Think SMART framework further supports enterprises in optimising cost per token, managing latency service-level agreements, and adapting to dynamic workloads.

As benchmarks like InferenceMAX continue to evolve, they will remain critical tools for organisations looking to make informed infrastructure decisions. NVIDIA’s results show that performance, efficiency, and ROI are not competing goals — they can be achieved together through a full-stack approach, setting a new standard for the future of AI inference.

Hot this week

HP and Dell turn off HEVC support on selected laptop models

HP and Dell turn off HEVC support on selected laptops, limiting browser playback and prompting users to rely on third-party software.

OVHcloud outlines new AI and quantum strategy at its 2025 summit

OVHcloud unveils new AI and quantum solutions at its 2025 summit, expanding its cloud ecosystem and international growth plans.

New report shows most Singaporeans say work falls short of expectations

New research shows most Singaporeans feel their jobs fall short of expectations, highlighting a growing gap between workers and employers.

LG launches world’s first 45-inch 5K2K OLED gaming monitor in Singapore

LG brings the world’s first 45-inch 5K2K OLED gaming monitor to Singapore with high refresh rates, Dual-Mode switching and advanced display technology.

TikTok tests new tools to help users manage AI-generated content

TikTok tests an AI content slider and invisible watermarks to help users control and identify AI-generated videos on the platform.

Google warns staff of rapid scaling demands to keep pace with AI growth

Google tells staff it must double AI capacity every six months as leaders warn of rapid growth, rising demand, and tough years ahead.

OnePlus confirms 15R launch date as part of three-device announcement

OnePlus confirms the 17 December launch of the 15R, Watch Lite, and Pad Go 2, with UK pre-order discounts and added perks.

Singapore sees surge in ransomware attacks during holidays, Semperis study finds

A new Semperis study shows 59% of ransomware attacks in Singapore occur during holidays, driven by reduced staffing and major corporate events.

LG launches world’s first 45-inch 5K2K OLED gaming monitor in Singapore

LG brings the world’s first 45-inch 5K2K OLED gaming monitor to Singapore with high refresh rates, Dual-Mode switching and advanced display technology.

Related Articles

Popular Categories