Handling LLM Rate Limits and Failovers Automatically
TL;DR
Rate limits (HTTP 429) are the silent killers of AI applications. Learn how freelm implements intelligent circuit breaking and provider failover to achieve 99.9% uptime on free tiers.
The 429 Problem
You launch your AI wrapper. It hits the front page of Hacker News. Suddenly, your LLM provider sends back a wall of HTTP 429 Too Many Requests errors. Your app crashes, users leave, and your launch is ruined.
Rate limits are inevitable. But downtime isn't.
At Shahriar Labs, we realized that relying on a single LLM provider is a single point of failure. We needed a load balancer for LLMs. That is why we open-sourced freelm.
Building Resilient AI
freelm (available on npm and PyPI) isn't just an API wrapper. It is a robust gateway designed around Circuit Breaking and Automatic Failover.
1. Interleaved Failover Routing
When you provide freelm with multiple keys (e.g., OpenRouter, Gemini, and NIM), it doesn't just exhaust one provider before moving to the next.
It uses an interleaved routing strategy: it attempts to use the best model from your primary provider. If that fails, it tries the best model from your secondary provider. This ensures you always get high-quality responses, even during a failover, rather than degrading to a weak fallback model on the primary provider.
2. Intelligent Circuit Breakers
When an API key returns a 429 Rate Limited or a 5xx Server Error:
freelmimmediately opens the circuit for that specific key.- The key is placed on a cooldown timer (e.g., 60 seconds).
- Subsequent requests bypass this key entirely, avoiding wasted time and avoiding further penalization from the provider.
- Once the cooldown expires, the circuit enters a "half-open" state to test if the provider is healthy again.
3. Quota Awareness
Even better than failing over is predicting a failure. freelm implements a Token Bucket algorithm internally. It knows the default RPM (Requests Per Minute) limits of the free tiers it supports.
If freelm sees that your Groq key is about to hit its 30 RPM limit, it proactively routes the 31st request to Gemini—before Groq even has a chance to reject it.
The Developer Experience
The best part about this architecture? You don't have to write a single try/except or try/catch block.
// Node.js Example
import { OpenAI } from 'freelm/compat';
const client = new OpenAI(); // Automatically reads your environment variables
// If OpenRouter is down, this will transparently hit Gemini instead.
const completion = await client.chat.completions.create({
model: "auto",
messages: [{ role: "user", "content": "Why is failover important?" }],
});
console.log(completion.choices[0].message.content);
Conclusion
Stop letting rate limits dictate your uptime. By pooling multiple free providers using freelm, you can achieve enterprise-grade reliability on a zero-dollar budget.
Written by
Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. Creator of LetX, QuantumSketch, and more.