Leading inference providers cut AI costs by up to 10x with open source models on NVIDIA Blackwell
Inference providers are cutting AI token costs by up to 10x by combining open source models with NVIDIA Blackwell infrastructure.
Artificial intelligence systems across healthcare, gaming, customer service and agentic applications are increasingly judged by a single unit of output, the token. Every medical note generated, game response produced, or customer query resolved is broken down into tokens that must be processed during inference. As demand for AI-driven interactions grows, organisations are facing a central question, whether they can afford to scale token usage without costs rising faster than value.
Table Of Content
Recent research from MIT suggests that infrastructure and algorithmic improvements are already shifting this balance. The study found that inference costs for frontier-level AI performance are falling by up to 10x each year due to efficiency gains. These reductions are not driven by a single breakthrough, but by a combination of open source models, hardware optimisation and tightly integrated software stacks. Together, these factors are reshaping what providers describe as tokenomics, the cost structure governing how many tokens can be processed for a given investment.
A useful comparison is a high-speed printing press. If output increases far faster than the cost of ink, power and machinery, the cost of each page falls. In the same way, when token output grows more quickly than infrastructure spending, the cost per token drops. This dynamic is now being seen among leading inference providers such as Baseten, DeepInfra, Fireworks AI and Together AI, which report cost-per-token reductions of up to 10x by running open source models on the NVIDIA Blackwell platform rather than the earlier Hopper generation.
Healthcare and gaming show early gains from lower token costs
In healthcare, AI is often deployed to reduce the administrative burden on clinicians. Tasks such as medical coding, documentation and insurance processing consume significant time and can limit patient interaction. Sully.ai addresses this by developing AI “employees” that handle routine clinical workflows. As usage increased, the company found that closed source models introduced unpredictable latency, rising inference costs and limited control over updates.
Sully.ai moved to open source models deployed through Baseten’s Model API on NVIDIA Blackwell GPUs. By using low-precision NVFP4 formats alongside TensorRT-LLM and the NVIDIA Dynamo inference framework, Baseten achieved higher throughput per dollar than on Hopper-based systems. Sully.ai reported that inference costs fell by 90%, equivalent to a 10x reduction, while response times for critical workflows improved by 65%. The company estimates that more than 30 million minutes have been returned to physicians by reducing time spent on data entry.
In gaming, similar cost pressures emerge as engagement scales. Latitude’s AI Dungeon and its upcoming role-playing platform, Voyage, rely on large language models to respond to every player action. Each interaction triggers an inference request, meaning costs rise directly with player activity. At the same time, response times must remain fast to preserve immersion.
Latitude runs large open source models on DeepInfra’s inference platform powered by NVIDIA Blackwell. For a mixture-of-experts model, DeepInfra reduced the cost per million tokens from 20 cents on Hopper to 10 cents on Blackwell. Using Blackwell’s native NVFP4 precision cut the cost further to 5 cents, delivering a 4x reduction overall while maintaining model accuracy. This allowed Latitude to deploy more capable models without compromising player experience, even during traffic spikes.
Agentic AI and customer service push scale further
More complex agentic systems place even heavier demands on inference infrastructure. Sentient Labs focuses on open source reasoning systems that coordinate multiple specialised agents. Its first application, Sentient Chat, integrates more than a dozen agents, meaning a single user query can trigger a cascade of autonomous interactions and significant compute overhead.
To manage this, Sentient adopted Fireworks AI’s inference platform on NVIDIA Blackwell. With a Blackwell-optimised stack, the company achieved 25–50% better cost efficiency compared with its previous Hopper-based deployment. Higher throughput per GPU enabled the platform to support a viral launch, handling 1.8 million waitlisted users in 24 hours and processing 5.6 million queries in a single week while maintaining low latency.
In customer service, voice-based AI places particularly strict requirements on responsiveness. Even small delays can disrupt conversations and reduce trust. Decagon builds AI agents for enterprise support, with voice as its most demanding channel. The company required sub-second responses under unpredictable traffic conditions and a cost structure that could support continuous operation.
Decagon runs its multimodel voice stack on NVIDIA Blackwell GPUs through Together AI. Optimisations included speculative decoding, caching repeated conversation elements and automatic scaling to manage traffic surges. As a result, response times stayed under 400 milliseconds even when processing thousands of tokens per query. Cost per voice interaction fell by 6x compared with closed source proprietary models, combining open source and in-house models with Blackwell’s hardware and Together’s inference stack.
Extreme codesign reshapes AI economics
Across these use cases, a common theme is the impact of hardware and software codesign. NVIDIA Blackwell’s architecture is designed to optimise compute, networking and inference software together, allowing token output to scale faster than infrastructure cost. The NVIDIA GB200 NVL72 system extends this approach, delivering up to a 10x reduction in cost per token for reasoning mixture-of-experts models compared with Hopper.
This trajectory is set to continue with the NVIDIA Rubin platform, which integrates six new chips into a single AI system and is expected to deliver a further 10x increase in performance and a 10x reduction in token costs over Blackwell. For enterprises, these developments suggest that AI scaling is becoming less constrained by economics and more defined by how effectively organisations can integrate open models with optimised infrastructure.





