OpenAI API Best Practices

OpenAI API Best Practices for Production Applications

Moving from prototype to production with the OpenAI API requires careful attention to reliability, cost efficiency, security, and user experience. This guide compiles the most important best practices learned from real-world production deployments serving millions of requests.

1. Use Environment Variables for Configuration

Never hardcode API keys, model names, or base URLs in your source code:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=os.environ.get("OPENAI_BASE_URL", "https://claude4u.com/v1")
)

MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o")
MAX_TOKENS = int(os.environ.get("OPENAI_MAX_TOKENS", "2000"))

Warning: Leaked API keys can result in thousands of dollars in unauthorized charges within hours. Never commit keys to version control, log them, or include them in client-side code.

2. Implement Robust Error Handling

from openai import OpenAI, APIError, RateLimitError, APIConnectionError
import time
import random

client = OpenAI(base_url="https://claude4u.com/v1")

def call_api(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                max_tokens=2000,
                timeout=60
            )
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
        except APIConnectionError:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)
        except APIError as e:
            if e.status_code >= 500:
                time.sleep(2 ** attempt)
            else:
                raise
    raise Exception("Max retries exceeded")

3. Always Set max_tokens

Without max_tokens, the model may generate extremely long responses, increasing both cost and latency unpredictably:

# Bad - no limit on output length
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

# Good - explicit output limit
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=1000  # Control cost and latency
)

4. Choose the Right Model for Each Task

Simple classification, extraction, formatting → gpt-4o-mini ($0.15/1M input)
Complex reasoning, code generation, analysis → gpt-4o ($2.50/1M input)
Math, logic, multi-step reasoning → o3-mini ($1.10/1M input)
Embeddings for search → text-embedding-3-small ($0.02/1M tokens)

Tip: Use a two-tier approach: route simple requests to gpt-4o-mini and complex ones to gpt-4o. This can reduce costs by 80% while maintaining quality where it matters.

5. Implement Response Caching

import hashlib
import json
import redis

r = redis.Redis(host="localhost", port=6379, db=0)

def cached_completion(client, messages, model="gpt-4o", ttl=3600):
    # Only cache deterministic (temperature=0) requests
    cache_key = "oai:" + hashlib.sha256(
        json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()
    ).hexdigest()

    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
        max_tokens=1000
    )

    result = response.choices[0].message.content
    r.setex(cache_key, ttl, json.dumps(result))
    return result

6. Monitor Usage and Costs

import logging

logger = logging.getLogger("openai_usage")

def tracked_completion(client, messages, model="gpt-4o"):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=1000
    )

    usage = response.usage
    cost = estimate_cost(model, usage)

    logger.info("API call completed", extra={
        "model": model,
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens,
        "estimated_cost_usd": round(cost, 6)
    })
    return response

def estimate_cost(model, usage):
    rates = {
        "gpt-4o": (2.50, 10.00),
        "gpt-4o-mini": (0.15, 0.60),
        "gpt-3.5-turbo": (0.50, 1.50),
    }
    inp, out = rates.get(model, (0, 0))
    return (usage.prompt_tokens * inp + usage.completion_tokens * out) / 1_000_000

7. Manage Conversation Context Efficiently

import tiktoken

def build_messages(system_prompt, conversation, max_tokens=100000):
    """Trim conversation to fit within context window."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    messages = [{"role": "system", "content": system_prompt}]
    budget = max_tokens - len(enc.encode(system_prompt)) - 10

    kept = []
    for msg in reversed(conversation):
        msg_tokens = len(enc.encode(msg["content"])) + 4
        if msg_tokens > budget:
            break
        kept.insert(0, msg)
        budget -= msg_tokens

    return messages + kept

8. Use Streaming for Better UX

Always use streaming for user-facing applications. It reduces perceived latency from seconds to milliseconds for the first token:

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True,
    stream_options={"include_usage": True}
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
    if chunk.usage:
        print(f"\nTokens: {chunk.usage.total_tokens}")

9. Implement Application-Level Rate Limiting

from collections import deque
import time
import asyncio

class TokenBucket:
    def __init__(self, rpm=500):
        self.rpm = rpm
        self.timestamps = deque()

    async def acquire(self):
        now = time.time()
        while self.timestamps and self.timestamps[0] < now - 60:
            self.timestamps.popleft()

        if len(self.timestamps) >= self.rpm:
            wait = 60 - (now - self.timestamps[0])
            await asyncio.sleep(wait)

        self.timestamps.append(time.time())

10. Security Best Practices

Validate all user input before sending to the API — prevent prompt injection
Sanitize model outputs before rendering in HTML — prevent XSS attacks
Set spending limits on your account and per API key
Log requests without sensitive data — mask user PII and API keys
Rotate API keys quarterly and after any suspected compromise
Use a relay service to isolate upstream credentials from client applications

11. Implement Graceful Degradation

async def resilient_completion(messages):
    """Try primary model, fall back to faster model, then return cached/default."""
    try:
        return await client.chat.completions.create(
            model="gpt-4o", messages=messages, max_tokens=1000, timeout=30
        )
    except Exception:
        try:
            return await client.chat.completions.create(
                model="gpt-4o-mini", messages=messages, max_tokens=1000, timeout=15
            )
        except Exception:
            return {"fallback": True, "content": "Service temporarily unavailable."}

Production Checklist

API keys stored in environment variables or secrets manager
Error handling with retries and exponential backoff
max_tokens set on every request
Usage monitoring and cost alerts configured
Rate limiting on your application layer
Input validation and output sanitization
Streaming enabled for user-facing responses
Appropriate model selected for each task type
Response caching for deterministic queries
Timeouts and fallback strategies configured

Tip: claude4u.com provides production-grade infrastructure out of the box: automatic load balancing, multi-account failover, rate limit pooling, usage analytics, and multi-provider routing. Instead of building these features yourself, use the relay service and focus on your application logic.

Get Started with 轻舟 AI

Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more

OpenAI API Best Practices