freelm Documentation

freelm is a free, always-up LLM client for Python and Node.js/TypeScript. It pools six free-tier providers — OpenRouter, Google AI Studio (Gemini), NVIDIA NIM, Groq, Cerebras, and Mistral — behind one OpenAI-compatible call with automatic key rotation, cross-provider failover, circuit breaking, and live model discovery.

PyPI · pip install freelm npm · npm install freelm GitHub Source

Installation

Install freelm from the package manager for your language. Python 3.8+ and Node.js 18+ are supported. The Node.js package has zero runtime dependencies and uses the built-in fetch API.

Python (3.8+)

bash

pip install freelm

View on PyPI

Node.js / TypeScript (18+)

bash

npm install freelm

View on npm

Environment Variables

Set whichever keys you have. freelm auto-detects them. Supply multiple keys per provider by comma-separating them — they are round-robined within that provider for additional rate limit headroom.

Provider	Key Variables	Tier Variable
OpenRouter	OPENROUTER_API_KEY / FREELM_OPENROUTER_KEYS	FREELM_OPENROUTER_TIER (free \| credit)
Google AI Studio	GEMINI_API_KEY / GOOGLE_API_KEY / FREELM_GOOGLE_KEYS	FREELM_GOOGLE_TIER (free \| tier1)
NVIDIA NIM	NVIDIA_API_KEY / NIM_API_KEY / FREELM_NIM_KEYS	FREELM_NIM_TIER (free)
Groq	GROQ_API_KEY / FREELM_GROQ_KEYS	FREELM_GROQ_TIER (free)
Cerebras	CEREBRAS_API_KEY / FREELM_CEREBRAS_KEYS	FREELM_CEREBRAS_TIER (free)
Mistral	MISTRAL_API_KEY / FREELM_MISTRAL_KEYS	FREELM_MISTRAL_TIER (free)

Quick Start

The fastest path: call from_env() / fromEnv(). It reads all supported keys from your environment and builds the provider pool automatically.

Python

import freelm

llm = freelm.FreeLLM.from_env()
print(llm.text("Explain black holes in one sentence."))

Node.js / TypeScript

TypeScript

import { FreeLLM } from "freelm";

const llm = FreeLLM.fromEnv();
console.log(await llm.text("Explain black holes in one sentence."));

Explicit provider config (Python)

For full control over which providers are used, their priority, and routing strategy:

Python

from freelm import FreeLLM, OpenRouter, GoogleAIStudio, NIM, Groq, Cerebras

llm = FreeLLM(
    providers=[
        OpenRouter("sk-or-...", priority=0),       # tried first
        GoogleAIStudio("AIza...", priority=1),
        Groq("gsk_...", priority=2),
        Cerebras("csk-...", priority=3),
        NIM("nvapi-...", priority=4),
    ],
    strategy="quota_aware",   # priority | round_robin | quota_aware | latency
)

resp = llm.chat(
    [{"role": "user", "content": "Write a haiku about failover."}],
    model="chat:fast",
)
print(resp.text, "via", resp.provider)

Explicit provider config (Node.js)

TypeScript

import { FreeLLM, OpenRouter, GoogleAIStudio, Groq, Cerebras, Mistral, NIM } from "freelm";

const llm = new FreeLLM(
  [
    new OpenRouter("sk-or-..."),
    new GoogleAIStudio("AIza..."),
    new Groq("gsk_..."),
    new Cerebras("csk-..."),
    new Mistral("..."),
    new NIM("nvapi-..."),
  ],
  { strategy: "quota_aware" }
);

const r = await llm.chat(
  [{ role: "user", content: "Write a haiku about failover." }],
  { model: "chat:fast" }
);
console.log(r.text, "via", r.provider);

Streaming

Token streaming works across all providers through the same failover mechanism. freelm fails over between providers before the first token. Once tokens start flowing, it stays on that provider for the rest of the response — no mid-stream switching.

Python (sync)

Python

llm = freelm.FreeLLM.from_env()
for chunk in llm.stream("Write a haiku about failover."):
    print(chunk, end="", flush=True)

Python (async)

Python

from freelm import AsyncFreeLLM

async with AsyncFreeLLM.from_env() as llm:
    async for chunk in llm.astream("Stream me some tokens"):
        print(chunk, end="", flush=True)

Node.js

TypeScript

for await (const chunk of llm.stream("Stream me some tokens")) {
  process.stdout.write(chunk);
}

Drop-in OpenAI Shim

Change one import line. Your existing OpenAI SDK code works unchanged — backed by free providers instead of a paid API.

Python

# Before:  from openai import OpenAI
# After:
from freelm.compat import OpenAI

client = OpenAI()   # backed by FreeLLM.from_env()
r = client.chat.completions.create(
    model="auto",   # virtual model: picks best available
    messages=[{"role": "user", "content": "hi"}],
)
print(r.choices[0].message.content)

Node.js / TypeScript

TypeScript

// Before:  import OpenAI from "openai";
// After:
import { OpenAI } from "freelm/compat";

const client = new OpenAI();   // backed by FreeLLM.fromEnv()
const r = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "hi" }],
});
console.log(r.choices[0].message.content);

Virtual Models

Model names differ across providers. Use intent-based aliases and freelm resolves each to a concrete model. Free model IDs change constantly, so freelm discovers them live from each provider's /models endpoint and caches results to ~/.cache/freelm.

Alias	Meaning
auto / chat	Any available chat model (registry order)
chat:large / large	A larger, stronger model
chat:fast / fast	A fast, cheap model (e.g. Groq or Cerebras)
chat:small / small	The smallest available model
vendor/model-id	Exact passthrough — use precisely this model ID

List currently available free models

Python

from freelm import list_free_models

for m in list_free_models()[:5]:
    print(m.id, m.tags, m.ctx)

Node.js

import { listFreeModels } from "freelm";

const models = await listFreeModels();
for (const m of models.slice(0, 5)) {
    console.log(m.id, m.tags);
}

Routing Strategies

Set the strategy param when constructing FreeLLM. The default is priority. Regardless of strategy, failover is always interleaved across providers so no single provider monopolizes retries.

priority

Providers tried in ascending priority integer. Deterministic and predictable.

round_robin

Rotates which provider goes first each call, spreading load evenly across all keys.

quota_aware

Ranks by remaining quota headroom. Keys nearing daily limits score lower.

latency

Prefers the provider with the lowest observed EWMA round-trip latency.

Error Handling (Python)

Retryable errors (RateLimited, Transient) are handled internally and only surface inside NoProvidersAvailable when all providers are exhausted. Non-retryable errors (AuthError, caller 4xx) surface immediately as ProviderError.

Python

from freelm import NoProvidersAvailable, ProviderError

try:
    resp = llm.chat("hi")
except NoProvidersAvailable as e:
    print("All providers exhausted:", e.attempts)
    # e.attempts = [(candidate, exception), ...]
except ProviderError as e:
    print(e.provider, e.status, e.retryable)

Health & Introspection

Call llm.health() at any time to inspect the live state of every key in the pool.

Python

for row in llm.health():
    print(row)
# → provider, key (masked), ready, breaker state, rpd_used, last_error, ewma_latency_ms

Each response object also exposes: r.text, r.provider, r.model, r.usage, r.latency_ms, and r.raw.

GitHub Repository PyPI Package npm Package