AI scaling constraints emerge as capacity limits drive production failures
Datadog finds 5% of AI requests fail in production, with capacity limits emerging as a key barrier to scaling AI systems reliably.
Operational constraints are becoming a primary barrier to scaling AI systems, according to new data from Datadog’s State of AI Engineering 2026 report. Based on anonymised usage data from thousands of organisations running AI in production, the findings point to system complexity and infrastructure limits, rather than model capability, as the main source of failure.
Table Of Content
Around 5% of AI model requests fail in production environments, with nearly 60% of those failures linked to capacity limits. These issues translate into slow responses, errors, and disrupted application performance, particularly as organisations increase usage and system complexity.
Multi-model and agent complexity raise operational overhead
The report finds that 69% of companies now operate three or more AI models within production systems. This multi-model approach reflects a move towards combining capabilities across providers, but it also increases the number of dependencies and potential points of failure.
OpenAI remains the most widely used provider, with a 63% share. Adoption of Google Gemini and Anthropic Claude has also increased, rising by 20 and 23 percentage points respectively. The broader mix of providers introduces additional orchestration and routing requirements, which can complicate system design.
At the same time, agent framework adoption has doubled year-on-year. These frameworks enable more advanced workflows and automation but add further layers to production systems. As more components are introduced, the challenge shifts from building functionality to maintaining reliability under load.
The volume of data processed by AI systems is also increasing. The report notes that the average number of tokens per request has more than doubled for median-use teams and quadrupled for heavy users. Larger inputs place greater strain on infrastructure and increase the likelihood of bottlenecks when capacity is constrained.
Capacity limits drive failures as systems scale
Failures are increasingly tied to how systems are structured rather than what models can achieve. Fragmented workflows, inefficient routing, and repeated retries contribute to instability as workloads scale. These design issues amplify the impact of infrastructure constraints, particularly when systems are pushed towards peak capacity.
“AI is starting to look a lot like the early days of cloud,” said Yanbing Li, Chief Product Officer at Datadog. “The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won’t just build better models – they’ll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago.”
The comparison highlights a shift in focus from model performance to operational management. As with early cloud adoption, increased flexibility introduces new layers of complexity that require dedicated monitoring and control.
Observability becomes central to production reliability
The report points to observability as a key requirement for managing AI systems at scale. Visibility across infrastructure, model behaviour, and agent workflows is necessary to identify bottlenecks and maintain performance under varying loads.
“The next wave of agent failures won’t be about what agents can’t do but what teams can’t observe,” said Guillermo Rauch, CEO at Vercel. “We built agentic infrastructure at Vercel because agents need the same production feedback loops as great software. Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential.”
Datadog positions this as a shift in how AI systems are operated. “Innovation alone isn’t enough,” Li added. “To scale AI with confidence, organizations need real-time visibility across the entire stack – from GPU utilization to model behavior to agent workflows. Visibility and operational control are what allow teams to move fast without sacrificing reliability or governance. At scale, how you operate AI may matter more than the models you choose.”
The findings indicate that as AI deployment expands, system reliability is increasingly determined by infrastructure capacity, workflow design, and monitoring capabilities rather than model selection alone.





