NVIDIA’s Blackwell platform has topped the new SemiAnalysis InferenceMAX v1 benchmarks, demonstrating record performance and industry-leading efficiency in real-world AI workloads. The results highlight the platform’s ability to deliver superior return on investment (ROI) and dramatically reduce total cost of ownership (TCO), reinforcing its position as the preferred choice for large-scale AI inference.
InferenceMAX v1 is the first independent benchmark to measure the total cost of compute across a wide range of models and scenarios, reflecting the real-world demands of modern AI. As applications evolve from single-response outputs to complex, multi-step reasoning, the economics of inference are becoming critical. Blackwell’s strong performance underscores how NVIDIA’s full-stack approach meets these demands, combining advanced hardware with continuous software optimisation.
A key result from the benchmarks shows that a US$5 million investment in an NVIDIA GB200 NVL72 system can generate US$75 million in DSR1 token revenue, representing a 15-fold ROI. NVIDIA’s B200 system also achieves a cost of just two cents per million tokens on the open-source gpt-oss model, a fivefold reduction in cost per token within two months.
“Inference is where AI delivers value every day,” said Ian Buck, vice president of hyperscale and high-performance computing at NVIDIA. “These results show that NVIDIA’s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.”
Full-stack innovation drives performance gains
NVIDIA’s leadership in inference performance is built on deep collaboration with the open-source community and continuous hardware-software co-design. Partnerships with OpenAI, Meta, and DeepSeek AI ensure that major models like gpt-oss 120B, Llama 3 70B, and DeepSeek R1 are optimised for NVIDIA’s infrastructure, enabling organisations to run the latest models more efficiently.
The company’s TensorRT LLM v1.0 library represents a major advance in performance, using parallelisation techniques and NVIDIA NVLink Switch’s 1,800 GB/s bidirectional bandwidth to accelerate inference. New techniques such as speculative decoding in the gpt-oss-120b-Eagle3-v2 model further boost efficiency, tripling throughput to 30,000 tokens per GPU and achieving 100 tokens per second per user.
For large-scale models such as Llama 3.3 70B, which require extensive computational resources, the NVIDIA Blackwell B200 sets new performance records in the InferenceMAX v1 benchmarks. It delivers more than 10,000 tokens per second per GPU and 50 tokens per second per user, quadrupling throughput compared with the previous-generation H200 GPU.
Efficiency reshapes AI economics
As AI deployments scale, efficiency metrics like tokens per watt, cost per million tokens, and tokens per second per user are becoming just as important as raw throughput. The Blackwell architecture delivers 10 times more throughput per megawatt compared with its predecessor, directly translating into increased token revenue and improved operational efficiency.
This efficiency also reduces the cost per million tokens by 15 times, significantly lowering operating costs and enabling wider adoption of AI technologies across industries. The InferenceMAX benchmarks use the Pareto frontier to demonstrate how Blackwell balances cost, power consumption, throughput, and responsiveness to maximise ROI across varied production workloads.
While some systems achieve peak performance in specific scenarios, NVIDIA’s holistic approach ensures sustained efficiency and value in real-world environments. This comprehensive optimisation is essential for enterprises shifting from pilot projects to full-scale AI “factories” — infrastructures designed to transform data into tokens, insights, and decisions in real time.
Blackwell’s performance is underpinned by its advanced features, including the NVFP4 low-precision format for improved efficiency without sacrificing accuracy, fifth-generation NVLink for connecting up to 72 GPUs as a unified processor, and the NVLink Switch, which enables high concurrency through advanced parallelisation algorithms. Combined with continuous software updates, these innovations have more than doubled Blackwell’s performance since its initial launch.
With a vast ecosystem of hundreds of millions of GPUs deployed globally, over seven million CUDA developers, and contributions to more than 1,000 open-source projects, NVIDIA’s platform is designed to scale and evolve alongside the AI industry. Its Think SMART framework further supports enterprises in optimising cost per token, managing latency service-level agreements, and adapting to dynamic workloads.
As benchmarks like InferenceMAX continue to evolve, they will remain critical tools for organisations looking to make informed infrastructure decisions. NVIDIA’s results show that performance, efficiency, and ROI are not competing goals — they can be achieved together through a full-stack approach, setting a new standard for the future of AI inference.