Scaling an AI Agent to 300k+ Users on Kubernetes
TL;DR
Scaling a social-commerce AI agent to 300k+ users meant queueing, autoscaling on Kubernetes, and tight cost control. Here is the architecture that survived.
Scaling a social-commerce AI agent to 300k+ users meant queueing, autoscaling on Kubernetes, and tight per-message cost control. This post is what survived the scale; the path here is the BikroyBuddy case study extended.
The shape of the load
Social commerce traffic is spiky. Quiet at 4 AM, peaking at 8 PM and during sales events. A naive provisioning approach overpays for the quiet times or chokes during peaks.
Architecture at 300k+
| Layer | Tech | Why |
|---|---|---|
| Ingress | NGINX Ingress on EKS | Standard, observable |
| Message queue | Redis Streams + AWS SQS | Backpressure absorption |
| Workers | Go services on EKS | Autoscaled on queue depth |
| LLM router | openrouter-free-infer | Cost control by routing |
| Data | PostgreSQL (RDS), Redis | Cached hot path |
| Observability | Prometheus + Grafana + OpenTelemetry | Per-conversation traces |
Microservices as One Engineer covers the lower-scale version of this stack.
The decisive moves
1. Queue before compute
Every inbound message hits Redis first. The HTTP gateway just enqueues; it never blocks on LLM calls. This decouples ingestion latency from inference latency — gateway latency stays under 50 ms even when LLMs slow down.
2. Autoscale on queue depth
We do not autoscale on CPU. We autoscale on queue depth per worker. When the queue grows faster than workers can drain, K8s scales out. When it shrinks, we scale in.
3. Cost-aware LLM routing
Most messages are routine ("how much is the red shirt?"). We route ~80% to cheap or free LLMs via openrouter-free-infer. Premium LLMs handle the ambiguous 20%. This single decision cuts the LLM bill by 4-5x.
4. Per-conversation idempotency
Every message has an idempotency key. Re-deliveries from upstream (Meta, WhatsApp) are deduped at the gateway. This sounds obvious; missing it caused the worst incident of the year.
5. Bounded conversation context
We never send the full conversation history to the LLM. A Manager sanitizes context and passes only what matters. Without this, token cost balloons with conversation length.
What broke at scale
DB connection pools. 100 workers × 10 conns each = 1000 conns on a Postgres instance configured for 200. Easy mistake; explicit pool sizing fixed it.
Idempotency edge cases. Two retries within the same second produced two responses. Fixed by hardening the dedupe window.
Cost spikes during traffic surges. Even with cost routing, peak hours could 5x the daily spend. We added per-merchant rate limits with backpressure.
Why Kubernetes here
For BikroyBuddy at 5k users, ECS Fargate was the right call. At 300k+, k8s pays off — the autoscaling primitives, the HPA, the deployment tooling. The crossover is somewhere around 50k MAU for chat-heavy products.
What I'd do differently
Start with idempotency from day one, not day 200. Every retry-related bug is preventable with discipline early.
FAQ
Q: Why not Lambda? A: Cold starts plus per-invoke billing add up. For sustained traffic with backpressure, containers win.
Q: How do you keep costs predictable at scale? A: Per-merchant rate limits, model routing, and aggressive caching. The biggest single move is LLM routing.
Q: What's the next scaling milestone? A: 1M+ users requires a different data layer — sharded Postgres or a switch to columnar storage for analytical loads.
Written by Shihab Shahriar Antor — Founder of Shahriar Labs. See my projects or hire me.
Written by
Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. Creator of LetX, QuantumSketch, and more.