Scaling Autonomous Agents in Production

Beyond the Prototype

Every engineering team can build an agent that works in a demo. The challenge — the real, grinding, unsexy challenge — is making it work at scale, reliably, in production, when real money and real users are on the line.

This is what we've learned deploying autonomous agents that handle thousands of concurrent workflows.

The Architecture

Our production agent systems follow a layered architecture:

Layer 1: Request Ingestion

Every agent workflow starts with a request. This could be a webhook from an external system, a user action, or a scheduled trigger. The ingestion layer does three things:

Validates the request schema
Deduplicates against recent requests (webhook providers love sending duplicates)
Enqueues the request with a priority score

We use PostgreSQL with SKIP LOCKED for our work queue. It's not as trendy as Kafka or RabbitMQ, but it gives us exactly-once processing semantics with zero operational overhead. For our scale (thousands of concurrent workflows, not millions), it's the right tool.

Layer 2: Planning

The planner receives a validated request and decomposes it into a directed acyclic graph (DAG) of tasks. Each task has:

A type (LLM call, tool invocation, data fetch, human review)
Dependencies on other tasks
A timeout and retry policy
Success criteria that the evaluator will check

The planner itself is an LLM call, but it's heavily constrained. We use structured output (JSON schema) to ensure the plan is machine-parseable, and we validate the plan against a set of rules before executing it.

Critical rule: no plan can exceed a cost budget. The planner estimates the cost of each task, and if the total exceeds the budget for that workflow type, it either simplifies the plan or rejects the request.

Layer 3: Execution

The executor processes the task DAG in topological order, running independent tasks in parallel. Each task runs in an isolated context with:

Its own timeout (default 30 seconds for LLM calls, 60 seconds for tool invocations)
Its own error handler (retry, replan, or escalate)
Structured logging of inputs, outputs, and timing

We run executors as stateless workers on auto-scaling infrastructure. Each worker pulls a task from the queue, executes it, writes the result, and pulls the next one. This gives us horizontal scaling with no coordination overhead.

Layer 4: Evaluation

After each task completes, the evaluator checks the output against the success criteria defined in the plan. If evaluation fails:

Soft failure: the output is suboptimal but usable → log a warning, continue
Hard failure: the output is wrong → trigger replanning with the failure context
Critical failure: the system is in an unexpected state → halt and escalate to human

The evaluator is also an LLM call, but a cheap one. We use a smaller, faster model for evaluation and reserve the expensive models for the actual work.

Layer 5: Observability

Every layer emits structured traces. We can reconstruct the full history of any workflow: what was planned, what was executed, what was evaluated, and what decisions were made at each step.

This isn't optional. Without observability, debugging production agents is impossible. You can't reproduce the issue locally because the behavior depends on the specific LLM responses, which are non-deterministic.

Error Recovery Patterns

The most important thing we've learned about production agents: errors are not exceptional. They are the normal case.

LLM calls fail. APIs return unexpected responses. Rate limits get hit. Timeouts fire. In a system that makes dozens of external calls per workflow, something will go wrong in almost every run.

Our error recovery follows a hierarchy:

Pattern 1: Retry with Backoff

For transient errors (rate limits, network timeouts, 5xx responses), we retry with exponential backoff. Simple, boring, effective.

Pattern 2: Replan

For semantic errors (the LLM generated an invalid tool call, or the tool returned unexpected data), we don't retry the same thing. We send the error context back to the planner and ask for an alternative approach.

This is the key insight: replanning is not retrying. Retrying assumes the same action will succeed next time. Replanning assumes the action was wrong and generates a different path.

Pattern 3: Degrade Gracefully

For non-critical tasks, we allow graceful degradation. If a task fails and it's not on the critical path, we mark it as skipped and continue. The final output might be less complete, but it's still useful.

Pattern 4: Escalate

For critical failures that the system can't recover from, we escalate to a human. But we don't just send an error message — we send the full context: what was attempted, why it failed, what alternatives were considered, and what the human needs to decide.

Cost Management at Scale

At thousands of concurrent workflows, LLM costs add up. Here's how we manage them:

Model Routing

Not every task needs the most expensive model. We route tasks to models based on complexity:

Classification, extraction, formatting → small model (fast, cheap)
Analysis, reasoning, creative generation → large model (slow, expensive)
Evaluation, validation → medium model (balanced)

This alone reduced our costs by 60%.

Response Caching

Many agent workflows include deterministic steps — formatting a template, extracting fields from a known schema, classifying into fixed categories. We cache these aggressively.

Our cache hit rate is around 35%, which means we're avoiding 35% of our LLM calls entirely.

Budget Guardrails

Every workflow type has a cost budget. If a workflow is approaching its budget, the system simplifies remaining tasks or terminates early with a partial result.

We've never had a runaway cost incident because the guardrails are enforced at the infrastructure layer, not the application layer. The executor physically cannot make more LLM calls once the budget is exhausted.

Lessons Learned

Postgres is enough. We started with a complex event-driven architecture. We simplified to Postgres queues and never looked back.

Replanning beats retrying. When an agent fails, don't do the same thing again. Do a different thing.

Observability is not optional. You will debug production issues. Make it possible.

Cost controls must be infrastructure-level. Application-level cost controls get bypassed. Infrastructure-level controls don't.

Humans are part of the system. Design the escalation path as carefully as you design the happy path.

What's Next

We're currently working on predictive scaling — using historical workflow patterns to pre-warm executors before demand spikes. And we're experimenting with multi-model consensus — running critical tasks on multiple models and taking the majority answer — to improve reliability on high-stakes workflows.

The frontier of agent engineering isn't about making smarter agents. It's about making more reliable, more observable, and more cost-effective systems. That's where we're focused.