Claude API Rate Limits Explained

Claude API Rate Limits Explained: RPM, TPM, and How to Handle Them

Rate limits control how many requests and tokens you can send to the Claude API within a given time period. Understanding these limits is essential for building reliable applications that do not get throttled. This guide explains how Claude's rate limiting works, what your limits are, and how to handle rate limit errors gracefully.

Types of Rate Limits

Anthropic enforces three types of rate limits simultaneously:

Rate Limits by Usage Tier

Anthropic assigns usage tiers based on your account's spending history and trust level. Limits increase automatically as you spend more:

Exact limits vary by model. Opus typically has lower RPM limits than Sonnet due to higher compute requirements.

Reading Rate Limit Headers

Every API response includes headers that show your current rate limit status:

# Key rate limit headers
x-ratelimit-limit-requests: 60          # Max RPM
x-ratelimit-limit-tokens: 60000         # Max TPM
x-ratelimit-remaining-requests: 55      # Remaining RPM
x-ratelimit-remaining-tokens: 58000     # Remaining TPM
x-ratelimit-reset-requests: 2025-01-01T00:01:00Z  # RPM reset time
x-ratelimit-reset-tokens: 2025-01-01T00:00:30Z    # TPM reset time
retry-after: 5                          # Seconds to wait (on 429)
# Python: Read rate limit headers
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

# Access headers through the raw response
raw = client.messages.with_raw_response.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)
print(f"Remaining requests: {raw.headers.get('x-ratelimit-remaining-requests')}")
print(f"Remaining tokens: {raw.headers.get('x-ratelimit-remaining-tokens')}")

Handling 429 Rate Limit Errors

When you exceed a rate limit, the API returns HTTP 429 (Too Many Requests). Implement exponential backoff with jitter:

import time
import random
import anthropic

client = anthropic.Anthropic()

def call_with_backoff(messages, max_retries=6):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages
            )
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            base_delay = 2 ** attempt
            jitter = random.uniform(0, base_delay * 0.5)
            wait_time = base_delay + jitter

            print(f"Rate limited (attempt {attempt + 1}). Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)

    raise Exception("Max retries exceeded")

Node.js Rate Limit Handling

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function callWithBackoff(messages, maxRetries = 6) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model: 'claude-sonnet-4-20250514',
        max_tokens: 1024,
        messages,
      });
    } catch (err) {
      if (err.status !== 429 || attempt === maxRetries - 1) throw err;

      const baseDelay = Math.pow(2, attempt) * 1000;
      const jitter = Math.random() * baseDelay * 0.5;
      const waitTime = baseDelay + jitter;

      console.log(`Rate limited. Waiting ${(waitTime / 1000).toFixed(1)}s...`);
      await new Promise((r) => setTimeout(r, waitTime));
    }
  }
}
Tip: The Anthropic SDKs include built-in retry logic with exponential backoff for rate limit errors. Check the SDK documentation for configuration options like max_retries to customize the retry behavior.

429 vs 529: What Is the Difference?

Strategies to Avoid Rate Limits

  1. Use a relay service — Services like claude4u.com pool multiple Anthropic accounts, effectively multiplying your rate limits. When one account hits its limit, the relay automatically routes to another.
  2. Implement request queuing — Queue outgoing requests and process them at a controlled rate just below your limit.
  3. Use the Batch API — For non-urgent workloads, the Batch API has separate, higher limits and costs 50% less.
  4. Optimize request sizes — Smaller requests consume less of your TPM budget. Minimize prompt length and reduce max_tokens to only what is needed.
  5. Cache responses — If users ask similar questions, cache and reuse responses instead of making new API calls.
  6. Use appropriate models — Haiku and Sonnet often have higher rate limits than Opus. Use the smallest model that meets your quality requirements.
# Request queue with rate limiting (Python)
import asyncio
from collections import deque

class RateLimitedQueue:
    def __init__(self, rpm_limit=50):
        self.rpm_limit = rpm_limit
        self.semaphore = asyncio.Semaphore(rpm_limit)

    async def submit(self, coro):
        async with self.semaphore:
            result = await coro
            # Wait to spread requests across the minute
            await asyncio.sleep(60 / self.rpm_limit)
            return result
Warning: Never implement aggressive retry loops without backoff. Hammering the API after a 429 error will extend your rate limit penalty and may result in temporary key blocking. Always use exponential backoff with jitter.

Monitoring Your Rate Limit Usage

Track your rate limit consumption to avoid hitting limits during critical operations:

Understanding and properly handling rate limits is what separates a demo from a production-ready Claude integration. Combine proper error handling with a relay service for the most reliable experience.

Get Started with 轻舟 AI

Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more

Sign Up Free