NVIDIA’s Blackwell Ultra architecture has set new standards in AI performance, achieving record results in the latest MLPerf Inference v5.1 benchmarks. The NVIDIA GB300 NVL72 rack-scale system, powered by Blackwell Ultra, delivered the highest throughput on the new reasoning inference benchmark, outperforming previous-generation systems by up to 45% in DeepSeek-R1 inference throughput compared with GB200 NVL72 platforms.
Inference performance is vital for the economics of AI infrastructure. Higher throughput enables more tokens to be processed at speed, boosting revenue, lowering total cost of ownership (TCO) and increasing overall productivity. NVIDIA’s latest achievement underscores its ongoing efforts to push the limits of AI factory performance.
Enhanced architecture and full-stack optimisation
The Blackwell Ultra architecture builds on the foundation of its predecessor with significant improvements. Each GPU now offers 1.5 times more NVFP4 AI compute and double the attention-layer acceleration compared with Blackwell. It also features up to 288GB of HBM3e memory per GPU, providing greater capacity for large-scale AI workloads.
A key factor in these results is NVIDIA’s full-stack co-design approach, integrating hardware and software innovations. Blackwell and Blackwell Ultra incorporate hardware acceleration for NVFP4, NVIDIA’s custom 4-bit floating point format. This delivers better accuracy than other FP4 formats and comparable results to higher-precision options. NVIDIA TensorRT Model Optimizer quantised models such as DeepSeek-R1, Llama 3.1 405B, and Llama 2 70B to NVFP4, and together with the open-source TensorRT-LLM library, enabled higher performance without sacrificing accuracy.
The company also highlighted disaggregated serving, a technique that separates context processing and token generation for large language models. This method was crucial to record-breaking results on the Llama 3.1 405B Interactive benchmark, nearly doubling performance per GPU compared with traditional serving approaches.
Broad industry adoption and availability
NVIDIA’s ecosystem of partners also contributed to the strong benchmark results. Submissions came from leading cloud providers and server makers including Azure, Broadcom, Cisco, CoreWeave, Dell Technologies, HPE, Lenovo, Oracle, Supermicro, and the University of Florida. These results demonstrate that the market-leading performance of NVIDIA’s AI platform is accessible across a wide range of systems and services.
The company made its first benchmark submissions using the NVIDIA Dynamo inference framework, further strengthening its position in AI optimisation. Organisations deploying AI applications can now leverage these advances through major cloud platforms and server vendors, benefiting from reduced TCO and higher returns on investment.
NVIDIA’s record-breaking results in MLPerf Inference v5.1 reaffirm its leadership in AI computing. By delivering stronger performance and efficiency, Blackwell Ultra provides a compelling platform for enterprises building and scaling next-generation AI applications.