Claude API Rate Limits Explained
Claude API Rate Limits Explained: RPM, TPM, and How to Handle Them
Rate limits control how many requests and tokens you can send to the Claude API within a given time period. Understanding these limits is essential for building reliable applications that do not get throttled. This guide explains how Claude's rate limiting works, what your limits are, and how to handle rate limit errors gracefully.
Types of Rate Limits
Anthropic enforces three types of rate limits simultaneously:
- RPM (Requests Per Minute) — The maximum number of API requests you can make per minute, regardless of size.
- TPM (Tokens Per Minute) — The maximum number of input tokens you can send per minute. Large requests consume more of this limit.
- TPD (Tokens Per Day) — The maximum total tokens (input + output) you can use in a 24-hour period.
Rate Limits by Usage Tier
Anthropic assigns usage tiers based on your account's spending history and trust level. Limits increase automatically as you spend more:
- Tier 1 (Free/New): Lower limits, typically 60 RPM, 60K TPM
- Tier 2: Moderate limits after initial spending
- Tier 3: Higher limits for established accounts
- Tier 4 (Enterprise): Custom limits negotiated with Anthropic
Exact limits vary by model. Opus typically has lower RPM limits than Sonnet due to higher compute requirements.
Reading Rate Limit Headers
Every API response includes headers that show your current rate limit status:
# Key rate limit headers
x-ratelimit-limit-requests: 60 # Max RPM
x-ratelimit-limit-tokens: 60000 # Max TPM
x-ratelimit-remaining-requests: 55 # Remaining RPM
x-ratelimit-remaining-tokens: 58000 # Remaining TPM
x-ratelimit-reset-requests: 2025-01-01T00:01:00Z # RPM reset time
x-ratelimit-reset-tokens: 2025-01-01T00:00:30Z # TPM reset time
retry-after: 5 # Seconds to wait (on 429)
# Python: Read rate limit headers
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}]
)
# Access headers through the raw response
raw = client.messages.with_raw_response.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}]
)
print(f"Remaining requests: {raw.headers.get('x-ratelimit-remaining-requests')}")
print(f"Remaining tokens: {raw.headers.get('x-ratelimit-remaining-tokens')}")
Handling 429 Rate Limit Errors
When you exceed a rate limit, the API returns HTTP 429 (Too Many Requests). Implement exponential backoff with jitter:
import time
import random
import anthropic
client = anthropic.Anthropic()
def call_with_backoff(messages, max_retries=6):
for attempt in range(max_retries):
try:
return client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=messages
)
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
base_delay = 2 ** attempt
jitter = random.uniform(0, base_delay * 0.5)
wait_time = base_delay + jitter
print(f"Rate limited (attempt {attempt + 1}). Waiting {wait_time:.1f}s...")
time.sleep(wait_time)
raise Exception("Max retries exceeded")
Node.js Rate Limit Handling
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function callWithBackoff(messages, maxRetries = 6) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages,
});
} catch (err) {
if (err.status !== 429 || attempt === maxRetries - 1) throw err;
const baseDelay = Math.pow(2, attempt) * 1000;
const jitter = Math.random() * baseDelay * 0.5;
const waitTime = baseDelay + jitter;
console.log(`Rate limited. Waiting ${(waitTime / 1000).toFixed(1)}s...`);
await new Promise((r) => setTimeout(r, waitTime));
}
}
}
max_retries to customize the retry behavior.
429 vs 529: What Is the Difference?
- 429 (Too Many Requests) — You personally have exceeded your rate limit. The solution is to slow down or upgrade your tier.
- 529 (Overloaded) — Anthropic's servers are under heavy load. This affects all users. The solution is to retry or use a relay service that can route to less-loaded accounts.
Strategies to Avoid Rate Limits
- Use a relay service — Services like claude4u.com pool multiple Anthropic accounts, effectively multiplying your rate limits. When one account hits its limit, the relay automatically routes to another.
- Implement request queuing — Queue outgoing requests and process them at a controlled rate just below your limit.
- Use the Batch API — For non-urgent workloads, the Batch API has separate, higher limits and costs 50% less.
- Optimize request sizes — Smaller requests consume less of your TPM budget. Minimize prompt length and reduce
max_tokensto only what is needed. - Cache responses — If users ask similar questions, cache and reuse responses instead of making new API calls.
- Use appropriate models — Haiku and Sonnet often have higher rate limits than Opus. Use the smallest model that meets your quality requirements.
# Request queue with rate limiting (Python)
import asyncio
from collections import deque
class RateLimitedQueue:
def __init__(self, rpm_limit=50):
self.rpm_limit = rpm_limit
self.semaphore = asyncio.Semaphore(rpm_limit)
async def submit(self, coro):
async with self.semaphore:
result = await coro
# Wait to spread requests across the minute
await asyncio.sleep(60 / self.rpm_limit)
return result
Monitoring Your Rate Limit Usage
Track your rate limit consumption to avoid hitting limits during critical operations:
- Log rate limit headers from every response
- Set up alerts when remaining capacity drops below 20%
- Use a relay service dashboard like claude4u.com for real-time rate limit monitoring across all API keys
- Track TPM usage patterns to identify peak usage times
Understanding and properly handling rate limits is what separates a demo from a production-ready Claude integration. Combine proper error handling with a relay service for the most reliable experience.
Get Started with 轻舟 AI
Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more
Sign Up Free
轻舟 AI