Fix Gemini 503 Overloaded Error
Fix Gemini 503 Error: Model Overloaded and MODEL_CAPACITY_EXHAUSTED Solutions
The Gemini API 503 error with the message "The model is overloaded" or error code MODEL_CAPACITY_EXHAUSTED is one of the most frustrating issues developers encounter. This error means Google's servers do not currently have enough capacity to handle your request. Unlike rate limit errors (429), which are about your individual quota, a 503 indicates a server-side capacity problem that affects all users. Here is how to diagnose and work around it.
Understanding the 503 Error
When you receive this error, the API response typically looks like this:
{
"error": {
"code": 503,
"message": "The model is overloaded. Please try again later.",
"status": "UNAVAILABLE"
}
}
Or for Vertex AI users:
{
"error": {
"code": 503,
"message": "MODEL_CAPACITY_EXHAUSTED: The model is temporarily unable to serve your request.",
"status": "UNAVAILABLE"
}
}
This happens most frequently with Gemini 2.5 Pro due to its high computational requirements and popularity. Flash models are less likely to experience capacity issues but are not immune.
Immediate Fixes
1. Implement Exponential Backoff with Retry
The most effective immediate solution is to retry the request with increasing delays between attempts:
import time
import random
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
def generate_with_retry(prompt, max_retries=5):
for attempt in range(max_retries):
try:
response = client.models.generate_content(
model="gemini-2.5-pro",
contents=prompt
)
return response
except Exception as e:
if "503" in str(e) or "overloaded" in str(e).lower():
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Model overloaded. Retrying in {wait_time:.1f}s...")
time.sleep(wait_time)
else:
raise e
raise Exception("Max retries exceeded. Model still unavailable.")
2. Switch to a Different Model
If Pro is overloaded, Flash often remains available and can handle many tasks adequately:
def generate_with_fallback(prompt):
models = ["gemini-2.5-pro", "gemini-2.5-flash", "gemini-2.0-flash"]
for model in models:
try:
response = client.models.generate_content(
model=model,
contents=prompt
)
return response
except Exception as e:
if "503" in str(e):
print(f"{model} unavailable, trying next model...")
continue
raise e
raise Exception("All models unavailable.")
3. Try a Different Region (Vertex AI)
If you are using Vertex AI, capacity varies by region. Try switching to a less loaded region:
us-central1(Iowa) — most popular, often first to hit capacityus-east4(Virginia) — often less loadedeurope-west1(Belgium) — good for European usersasia-northeast1(Tokyo) — good for Asian users
Long-Term Solutions
Reduce Your Request Size
Larger requests consume more server resources and are more likely to be rejected during capacity crunches. To reduce request size:
- Trim unnecessary context from your prompts
- Use context caching for repeated large contexts instead of sending them with every request
- Set a lower
max_output_tokensvalue when you do not need long responses - Split very large tasks into smaller, sequential requests
Use Request Queuing
Instead of sending all requests simultaneously, implement a queue that processes requests at a controlled rate:
import asyncio
from collections import deque
class RequestQueue:
def __init__(self, max_concurrent=3, delay=1.0):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.delay = delay
async def process(self, prompt):
async with self.semaphore:
result = await generate_async(prompt)
await asyncio.sleep(self.delay)
return result
Schedule Requests During Off-Peak Hours
Capacity issues are most common during US business hours (9 AM - 6 PM PST). If your workload is flexible, scheduling batch processing during off-peak hours can significantly reduce 503 errors.
Use a Relay Service for Automatic Failover
For production applications where downtime is not acceptable, an API relay service like claude4u.com provides built-in handling for 503 errors. The relay service maintains multiple API keys across different projects and regions, automatically retrying failed requests on different endpoints and queuing requests when all endpoints are temporarily unavailable. This transparent handling means your application code does not need to implement complex retry logic — the relay service handles it for you.
When to Contact Google Support
If you experience persistent 503 errors lasting more than a few hours, it may indicate a larger service issue. Check the Google Cloud Status Dashboard for any ongoing incidents. For Vertex AI customers with premium support plans, you can open a support ticket to request capacity allocation for your project.
Get Started with 轻舟 AI
Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more
Sign Up Free
轻舟 AI