OpenAI API Best Practices
OpenAI API Best Practices for Production Applications
Moving from prototype to production with the OpenAI API requires careful attention to reliability, cost efficiency, security, and user experience. This guide compiles the most important best practices learned from real-world production deployments serving millions of requests.
1. Use Environment Variables for Configuration
Never hardcode API keys, model names, or base URLs in your source code:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
base_url=os.environ.get("OPENAI_BASE_URL", "https://claude4u.com/v1")
)
MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o")
MAX_TOKENS = int(os.environ.get("OPENAI_MAX_TOKENS", "2000"))
Warning: Leaked API keys can result in thousands of dollars in unauthorized charges within hours. Never commit keys to version control, log them, or include them in client-side code.
2. Implement Robust Error Handling
from openai import OpenAI, APIError, RateLimitError, APIConnectionError
import time
import random
client = OpenAI(base_url="https://claude4u.com/v1")
def call_api(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=2000,
timeout=60
)
except RateLimitError:
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
except APIConnectionError:
if attempt == max_retries - 1:
raise
time.sleep(1)
except APIError as e:
if e.status_code >= 500:
time.sleep(2 ** attempt)
else:
raise
raise Exception("Max retries exceeded")
3. Always Set max_tokens
Without max_tokens, the model may generate extremely long responses, increasing both cost and latency unpredictably:
# Bad - no limit on output length
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# Good - explicit output limit
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=1000 # Control cost and latency
)
4. Choose the Right Model for Each Task
- Simple classification, extraction, formatting → gpt-4o-mini ($0.15/1M input)
- Complex reasoning, code generation, analysis → gpt-4o ($2.50/1M input)
- Math, logic, multi-step reasoning → o3-mini ($1.10/1M input)
- Embeddings for search → text-embedding-3-small ($0.02/1M tokens)
Tip: Use a two-tier approach: route simple requests to gpt-4o-mini and complex ones to gpt-4o. This can reduce costs by 80% while maintaining quality where it matters.
5. Implement Response Caching
import hashlib
import json
import redis
r = redis.Redis(host="localhost", port=6379, db=0)
def cached_completion(client, messages, model="gpt-4o", ttl=3600):
# Only cache deterministic (temperature=0) requests
cache_key = "oai:" + hashlib.sha256(
json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()
).hexdigest()
cached = r.get(cache_key)
if cached:
return json.loads(cached)
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
max_tokens=1000
)
result = response.choices[0].message.content
r.setex(cache_key, ttl, json.dumps(result))
return result
6. Monitor Usage and Costs
import logging
logger = logging.getLogger("openai_usage")
def tracked_completion(client, messages, model="gpt-4o"):
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1000
)
usage = response.usage
cost = estimate_cost(model, usage)
logger.info("API call completed", extra={
"model": model,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"estimated_cost_usd": round(cost, 6)
})
return response
def estimate_cost(model, usage):
rates = {
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"gpt-3.5-turbo": (0.50, 1.50),
}
inp, out = rates.get(model, (0, 0))
return (usage.prompt_tokens * inp + usage.completion_tokens * out) / 1_000_000
7. Manage Conversation Context Efficiently
import tiktoken
def build_messages(system_prompt, conversation, max_tokens=100000):
"""Trim conversation to fit within context window."""
enc = tiktoken.encoding_for_model("gpt-4o")
messages = [{"role": "system", "content": system_prompt}]
budget = max_tokens - len(enc.encode(system_prompt)) - 10
kept = []
for msg in reversed(conversation):
msg_tokens = len(enc.encode(msg["content"])) + 4
if msg_tokens > budget:
break
kept.insert(0, msg)
budget -= msg_tokens
return messages + kept
8. Use Streaming for Better UX
Always use streaming for user-facing applications. It reduces perceived latency from seconds to milliseconds for the first token:
stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True,
stream_options={"include_usage": True}
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
if chunk.usage:
print(f"\nTokens: {chunk.usage.total_tokens}")
9. Implement Application-Level Rate Limiting
from collections import deque
import time
import asyncio
class TokenBucket:
def __init__(self, rpm=500):
self.rpm = rpm
self.timestamps = deque()
async def acquire(self):
now = time.time()
while self.timestamps and self.timestamps[0] < now - 60:
self.timestamps.popleft()
if len(self.timestamps) >= self.rpm:
wait = 60 - (now - self.timestamps[0])
await asyncio.sleep(wait)
self.timestamps.append(time.time())
10. Security Best Practices
- Validate all user input before sending to the API — prevent prompt injection
- Sanitize model outputs before rendering in HTML — prevent XSS attacks
- Set spending limits on your account and per API key
- Log requests without sensitive data — mask user PII and API keys
- Rotate API keys quarterly and after any suspected compromise
- Use a relay service to isolate upstream credentials from client applications
11. Implement Graceful Degradation
async def resilient_completion(messages):
"""Try primary model, fall back to faster model, then return cached/default."""
try:
return await client.chat.completions.create(
model="gpt-4o", messages=messages, max_tokens=1000, timeout=30
)
except Exception:
try:
return await client.chat.completions.create(
model="gpt-4o-mini", messages=messages, max_tokens=1000, timeout=15
)
except Exception:
return {"fallback": True, "content": "Service temporarily unavailable."}
Production Checklist
- API keys stored in environment variables or secrets manager
- Error handling with retries and exponential backoff
- max_tokens set on every request
- Usage monitoring and cost alerts configured
- Rate limiting on your application layer
- Input validation and output sanitization
- Streaming enabled for user-facing responses
- Appropriate model selected for each task type
- Response caching for deterministic queries
- Timeouts and fallback strategies configured
Tip: claude4u.com provides production-grade infrastructure out of the box: automatic load balancing, multi-account failover, rate limit pooling, usage analytics, and multi-provider routing. Instead of building these features yourself, use the relay service and focus on your application logic.
Get Started with 轻舟 AI
Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more
Sign Up Free
轻舟 AI