Kubernetes
Scaling
AI Agent
Architecture

Scaling an AI Agent to 300k+ Users on Kubernetes

Shihab Shahriar Antor
9 min read

TL;DR

Scaling a social-commerce AI agent to 300k+ users meant queueing, autoscaling on Kubernetes, and tight cost control. Here is the architecture that survived.

Scaling a social-commerce AI agent to 300k+ users meant queueing, autoscaling on Kubernetes, and tight per-message cost control. This post is what survived the scale; the path here is the BikroyBuddy case study extended.

The shape of the load

Social commerce traffic is spiky. Quiet at 4 AM, peaking at 8 PM and during sales events. A naive provisioning approach overpays for the quiet times or chokes during peaks.

Architecture at 300k+

LayerTechWhy
IngressNGINX Ingress on EKSStandard, observable
Message queueRedis Streams + AWS SQSBackpressure absorption
WorkersGo services on EKSAutoscaled on queue depth
LLM routeropenrouter-free-inferCost control by routing
DataPostgreSQL (RDS), RedisCached hot path
ObservabilityPrometheus + Grafana + OpenTelemetryPer-conversation traces

Microservices as One Engineer covers the lower-scale version of this stack.

The decisive moves

1. Queue before compute

Every inbound message hits Redis first. The HTTP gateway just enqueues; it never blocks on LLM calls. This decouples ingestion latency from inference latency — gateway latency stays under 50 ms even when LLMs slow down.

2. Autoscale on queue depth

We do not autoscale on CPU. We autoscale on queue depth per worker. When the queue grows faster than workers can drain, K8s scales out. When it shrinks, we scale in.

3. Cost-aware LLM routing

Most messages are routine ("how much is the red shirt?"). We route ~80% to cheap or free LLMs via openrouter-free-infer. Premium LLMs handle the ambiguous 20%. This single decision cuts the LLM bill by 4-5x.

4. Per-conversation idempotency

Every message has an idempotency key. Re-deliveries from upstream (Meta, WhatsApp) are deduped at the gateway. This sounds obvious; missing it caused the worst incident of the year.

5. Bounded conversation context

We never send the full conversation history to the LLM. A Manager sanitizes context and passes only what matters. Without this, token cost balloons with conversation length.

What broke at scale

DB connection pools. 100 workers × 10 conns each = 1000 conns on a Postgres instance configured for 200. Easy mistake; explicit pool sizing fixed it.

Idempotency edge cases. Two retries within the same second produced two responses. Fixed by hardening the dedupe window.

Cost spikes during traffic surges. Even with cost routing, peak hours could 5x the daily spend. We added per-merchant rate limits with backpressure.

Why Kubernetes here

For BikroyBuddy at 5k users, ECS Fargate was the right call. At 300k+, k8s pays off — the autoscaling primitives, the HPA, the deployment tooling. The crossover is somewhere around 50k MAU for chat-heavy products.

What I'd do differently

Start with idempotency from day one, not day 200. Every retry-related bug is preventable with discipline early.

FAQ

Q: Why not Lambda? A: Cold starts plus per-invoke billing add up. For sustained traffic with backpressure, containers win.

Q: How do you keep costs predictable at scale? A: Per-merchant rate limits, model routing, and aggressive caching. The biggest single move is LLM routing.

Q: What's the next scaling milestone? A: 1M+ users requires a different data layer — sharded Postgres or a switch to columnar storage for analytical loads.


Written by Shihab Shahriar Antor — Founder of Shahriar Labs. See my projects or hire me.

Written by

Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. Creator of LetX, QuantumSketch, and more.

Share this mission log