The 429 Problem

You launch your AI wrapper. It hits the front page of Hacker News. Suddenly, your LLM provider sends back a wall of HTTP 429 Too Many Requests errors. Your app crashes, users leave, and your launch is ruined.

Rate limits are inevitable. But downtime isn't.

At Shahriar Labs, we realized that relying on a single LLM provider is a single point of failure. We needed a load balancer for LLMs. That is why we open-sourced freelm.

Building Resilient AI

freelm (available on npm and PyPI) isn't just an API wrapper. It is a robust gateway designed around Circuit Breaking and Automatic Failover.

1. Interleaved Failover Routing

When you provide freelm with multiple keys (e.g., OpenRouter, Gemini, and NIM), it doesn't just exhaust one provider before moving to the next.

It uses an interleaved routing strategy: it attempts to use the best model from your primary provider. If that fails, it tries the best model from your secondary provider. This ensures you always get high-quality responses, even during a failover, rather than degrading to a weak fallback model on the primary provider.

2. Intelligent Circuit Breakers

When an API key returns a 429 Rate Limited or a 5xx Server Error:

freelm immediately opens the circuit for that specific key.
The key is placed on a cooldown timer (e.g., 60 seconds).
Subsequent requests bypass this key entirely, avoiding wasted time and avoiding further penalization from the provider.
Once the cooldown expires, the circuit enters a "half-open" state to test if the provider is healthy again.

3. Quota Awareness

Even better than failing over is predicting a failure. freelm implements a Token Bucket algorithm internally. It knows the default RPM (Requests Per Minute) limits of the free tiers it supports.

If freelm sees that your Groq key is about to hit its 30 RPM limit, it proactively routes the 31st request to Gemini—before Groq even has a chance to reject it.

The Developer Experience

The best part about this architecture? You don't have to write a single try/except or try/catch block.

// Node.js Example
import { OpenAI } from 'freelm/compat';

const client = new OpenAI(); // Automatically reads your environment variables

// If OpenRouter is down, this will transparently hit Gemini instead.
const completion = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", "content": "Why is failover important?" }],
});

console.log(completion.choices[0].message.content);

Conclusion

Stop letting rate limits dictate your uptime. By pooling multiple free providers using freelm, you can achieve enterprise-grade reliability on a zero-dollar budget.

Handling LLM Rate Limits and Failovers Automatically