Claude API Rate Limits Explained

Claude API Rate Limits Explained: RPM, TPM, and How to Handle Them

Rate limits control how many requests and tokens you can send to the Claude API within a given time period. Understanding these limits is essential for building reliable applications that do not get throttled. This guide explains how Claude's rate limiting works, what your limits are, and how to handle rate limit errors gracefully.

Types of Rate Limits

Anthropic enforces three types of rate limits simultaneously:

RPM (Requests Per Minute) — The maximum number of API requests you can make per minute, regardless of size.
TPM (Tokens Per Minute) — The maximum number of input tokens you can send per minute. Large requests consume more of this limit.
TPD (Tokens Per Day) — The maximum total tokens (input + output) you can use in a 24-hour period.

Rate Limits by Usage Tier

Anthropic assigns usage tiers based on your account's spending history and trust level. Limits increase automatically as you spend more:

Tier 1 (Free/New): Lower limits, typically 60 RPM, 60K TPM
Tier 2: Moderate limits after initial spending
Tier 3: Higher limits for established accounts
Tier 4 (Enterprise): Custom limits negotiated with Anthropic

Exact limits vary by model. Opus typically has lower RPM limits than Sonnet due to higher compute requirements.

Reading Rate Limit Headers

Every API response includes headers that show your current rate limit status:

# Key rate limit headers
x-ratelimit-limit-requests: 60          # Max RPM
x-ratelimit-limit-tokens: 60000         # Max TPM
x-ratelimit-remaining-requests: 55      # Remaining RPM
x-ratelimit-remaining-tokens: 58000     # Remaining TPM
x-ratelimit-reset-requests: 2025-01-01T00:01:00Z  # RPM reset time
x-ratelimit-reset-tokens: 2025-01-01T00:00:30Z    # TPM reset time
retry-after: 5                          # Seconds to wait (on 429)

# Python: Read rate limit headers
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

# Access headers through the raw response
raw = client.messages.with_raw_response.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)
print(f"Remaining requests: {raw.headers.get('x-ratelimit-remaining-requests')}")
print(f"Remaining tokens: {raw.headers.get('x-ratelimit-remaining-tokens')}")

Handling 429 Rate Limit Errors

When you exceed a rate limit, the API returns HTTP 429 (Too Many Requests). Implement exponential backoff with jitter:

import time
import random
import anthropic

client = anthropic.Anthropic()

def call_with_backoff(messages, max_retries=6):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages
            )
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            base_delay = 2 ** attempt
            jitter = random.uniform(0, base_delay * 0.5)
            wait_time = base_delay + jitter

            print(f"Rate limited (attempt {attempt + 1}). Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)

    raise Exception("Max retries exceeded")

Node.js Rate Limit Handling

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function callWithBackoff(messages, maxRetries = 6) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model: 'claude-sonnet-4-20250514',
        max_tokens: 1024,
        messages,
      });
    } catch (err) {
      if (err.status !== 429 || attempt === maxRetries - 1) throw err;

      const baseDelay = Math.pow(2, attempt) * 1000;
      const jitter = Math.random() * baseDelay * 0.5;
      const waitTime = baseDelay + jitter;

      console.log(`Rate limited. Waiting ${(waitTime / 1000).toFixed(1)}s...`);
      await new Promise((r) => setTimeout(r, waitTime));
    }
  }
}

Tip: The Anthropic SDKs include built-in retry logic with exponential backoff for rate limit errors. Check the SDK documentation for configuration options like max_retries to customize the retry behavior.

429 vs 529: What Is the Difference?

429 (Too Many Requests) — You personally have exceeded your rate limit. The solution is to slow down or upgrade your tier.
529 (Overloaded) — Anthropic's servers are under heavy load. This affects all users. The solution is to retry or use a relay service that can route to less-loaded accounts.

Strategies to Avoid Rate Limits

Use a relay service — Services like claude4u.com pool multiple Anthropic accounts, effectively multiplying your rate limits. When one account hits its limit, the relay automatically routes to another.
Implement request queuing — Queue outgoing requests and process them at a controlled rate just below your limit.
Use the Batch API — For non-urgent workloads, the Batch API has separate, higher limits and costs 50% less.
Optimize request sizes — Smaller requests consume less of your TPM budget. Minimize prompt length and reduce max_tokens to only what is needed.
Cache responses — If users ask similar questions, cache and reuse responses instead of making new API calls.
Use appropriate models — Haiku and Sonnet often have higher rate limits than Opus. Use the smallest model that meets your quality requirements.

# Request queue with rate limiting (Python)
import asyncio
from collections import deque

class RateLimitedQueue:
    def __init__(self, rpm_limit=50):
        self.rpm_limit = rpm_limit
        self.semaphore = asyncio.Semaphore(rpm_limit)

    async def submit(self, coro):
        async with self.semaphore:
            result = await coro
            # Wait to spread requests across the minute
            await asyncio.sleep(60 / self.rpm_limit)
            return result

Warning: Never implement aggressive retry loops without backoff. Hammering the API after a 429 error will extend your rate limit penalty and may result in temporary key blocking. Always use exponential backoff with jitter.

Monitoring Your Rate Limit Usage

Track your rate limit consumption to avoid hitting limits during critical operations:

Log rate limit headers from every response
Set up alerts when remaining capacity drops below 20%
Use a relay service dashboard like claude4u.com for real-time rate limit monitoring across all API keys
Track TPM usage patterns to identify peak usage times

Understanding and properly handling rate limits is what separates a demo from a production-ready Claude integration. Combine proper error handling with a relay service for the most reliable experience.

Get Started with 轻舟 AI

Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more

Claude API Rate Limits Explained