AI API Rate Limits Explained

AI API Rate Limits: A Complete Guide

Rate limits are the most common source of frustration when working with AI APIs. Every provider imposes limits on how many requests, tokens, or concurrent connections you can use, and hitting those limits can disrupt your application and your workflow. This guide explains how rate limits work across major AI providers, why they exist, and practical strategies for working within them — or around them.

What Are Rate Limits?

Rate limits are restrictions that AI API providers place on how much you can use their service within a given time period. They serve several purposes:

Infrastructure protection: Prevent any single user from consuming disproportionate computing resources
Fair access: Ensure all users get reasonable response times
Cost control: Protect users from accidental excessive spending
Abuse prevention: Limit the impact of compromised API keys

Types of Rate Limits

AI providers typically enforce multiple types of rate limits simultaneously:

Requests per minute (RPM): The total number of API calls you can make per minute
Tokens per minute (TPM): The total input and output tokens processed per minute
Tokens per day (TPD): Daily token consumption ceiling
Concurrent requests: How many requests can be in-flight simultaneously
Images per minute: For vision or image generation endpoints

Rate Limits by Provider

Anthropic (Claude)

Anthropic uses a tiered system where limits increase as you spend more:

Tier 1 (new accounts): ~50 RPM, 40,000 TPM for Claude Sonnet
Tier 2 ($40+ spent): ~1,000 RPM, 80,000 TPM
Tier 3 ($200+ spent): ~2,000 RPM, 160,000 TPM
Tier 4 ($400+ spent): ~4,000 RPM, 400,000 TPM

Additionally, Anthropic returns 529 status codes when their infrastructure is overloaded, independent of your rate limits.

OpenAI (GPT)

Tier 1: ~500 RPM, 200,000 TPM for GPT-4o
Higher tiers: Progressively higher limits based on spending history
OpenAI uses standard 429 status codes for rate limit responses

Google (Gemini)

Free tier has significant restrictions (2-15 RPM depending on model)
Paid tier limits are substantially higher but vary by model and region

Understanding Rate Limit HTTP Responses

When you hit a rate limit, the API returns specific headers and status codes:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
retry-after: 30
x-ratelimit-limit-requests: 1000
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 2026-01-15T10:30:00Z

{
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded. Please retry after 30 seconds."
  }
}

Strategies for Handling Rate Limits

1. Exponential Backoff with Jitter

The standard approach for handling rate limit errors. Add random jitter to prevent thundering herd problems:

async function requestWithBackoff(params, maxRetries = 5) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await client.chat.completions.create(params);
    } catch (error) {
      if (error.status === 429 && i < maxRetries - 1) {
        const baseDelay = Math.pow(2, i) * 1000;
        const jitter = Math.random() * 1000;
        await new Promise((r) => setTimeout(r, baseDelay + jitter));
      } else {
        throw error;
      }
    }
  }
}

2. Request Queuing

Instead of making requests as fast as possible and handling errors, queue requests and process them at a controlled rate:

class RateLimitedQueue {
  constructor(requestsPerMinute) {
    this.interval = 60000 / requestsPerMinute;
    this.queue = [];
    this.processing = false;
  }

  async add(requestFn) {
    return new Promise((resolve, reject) => {
      this.queue.push({ requestFn, resolve, reject });
      this.process();
    });
  }

  async process() {
    if (this.processing || this.queue.length === 0) return;
    this.processing = true;
    const { requestFn, resolve, reject } = this.queue.shift();
    try {
      resolve(await requestFn());
    } catch (e) {
      reject(e);
    }
    setTimeout(() => {
      this.processing = false;
      this.process();
    }, this.interval);
  }
}

3. Multi-Account Distribution

The most effective strategy for high-throughput applications. Distribute requests across multiple API accounts to multiply your effective rate limits:

This is exactly what relay services like claude4u.com do. They maintain pools of upstream accounts and distribute your requests across them. If one account hits a rate limit, requests are automatically routed to another. This gives you effectively unlimited rate limits relative to what any single account provides.

4. Caching

Cache responses for identical or similar requests to avoid hitting the API unnecessarily:

Exact match caching for deterministic requests (temperature 0)
Semantic caching for similar prompts using embedding similarity
Set appropriate TTLs based on how often the ideal response would change

5. Prompt Optimization

Reduce token consumption to stay within TPM limits:

Keep system prompts concise
Summarize conversation history instead of sending full transcripts
Use smaller models for simple tasks
Leverage prompt caching features offered by providers

Monitoring Rate Limit Usage

Proactive monitoring prevents rate limit surprises. Track these metrics:

Current RPM and TPM relative to limits
429/529 error rates over time
Average and p99 response latencies (which increase near rate limits)
Remaining quota from rate limit response headers

Do not ignore 529 (overloaded) responses from Anthropic's API. Unlike 429 rate limits which are per-account, 529 errors indicate system-wide capacity issues. Back off more aggressively and consider routing to an alternative model or provider temporarily.

The Relay Service Advantage

For developers who need consistent, high-throughput access to AI APIs without managing rate limits manually, a relay service is the practical solution. Services like claude4u.com handle the complexity of multi-account management, automatic retry logic, and intelligent request distribution. You get a single API endpoint with effectively higher rate limits than any individual account, transparent error handling, and no need to build rate limiting infrastructure in your application.

Rate limits are a fact of life with AI APIs, but they do not have to be a bottleneck. With the right combination of client-side strategies and infrastructure-level solutions, you can build applications that handle any volume of AI requests reliably.

Get Started with 轻舟 AI

Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more

AI API Rate Limits Explained

AI API Rate Limits: A Complete Guide

What Are Rate Limits?

Types of Rate Limits

Rate Limits by Provider

Anthropic (Claude)

OpenAI (GPT)

Google (Gemini)

Understanding Rate Limit HTTP Responses

Strategies for Handling Rate Limits

1. Exponential Backoff with Jitter

2. Request Queuing

3. Multi-Account Distribution

4. Caching

5. Prompt Optimization

Monitoring Rate Limit Usage

The Relay Service Advantage

Get Started with 轻舟 AI

More Guides