AI API Best Practices
AI API Best Practices for Developers
Working with AI APIs effectively requires more than just making HTTP requests and parsing responses. From prompt engineering to error handling, streaming to cost management, there are proven patterns that separate production-grade AI integrations from fragile prototypes. This guide covers the essential best practices every developer should follow when building with AI APIs.
1. Design Your Prompts for Consistency
Prompt engineering is the foundation of reliable AI API usage. A well-structured prompt produces consistent, useful outputs while minimizing token waste.
// Bad: Vague, inconsistent results
const response = await client.chat.completions.create({
model: "claude-sonnet-4-20250514",
messages: [
{ role: "user", content: "Check this code" }
]
});
// Good: Structured prompt with clear expectations
const response = await client.chat.completions.create({
model: "claude-sonnet-4-20250514",
messages: [
{
role: "system",
content: `You are a code reviewer. Analyze code for:
1. Bugs and logical errors
2. Security vulnerabilities
3. Performance issues
Respond in JSON with fields: issues[], suggestions[], severity`
},
{ role: "user", content: `Review this function:\n${codeSnippet}` }
]
});
2. Implement Robust Error Handling
AI APIs can fail in ways that traditional APIs do not. Build resilience into every call:
async function callAIWithRetry(params, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await client.chat.completions.create(params);
} catch (error) {
if (error.status === 429) {
// Rate limited — exponential backoff
const delay = Math.pow(2, attempt) * 1000;
console.log(`Rate limited. Retrying in ${delay}ms...`);
await new Promise((r) => setTimeout(r, delay));
} else if (error.status === 529) {
// Overloaded — wait longer
await new Promise((r) => setTimeout(r, 10000));
} else if (error.status >= 500) {
// Server error — retry with backoff
await new Promise((r) => setTimeout(r, attempt * 2000));
} else {
// Client error (400, 401, 403) — don't retry
throw error;
}
}
}
throw new Error('Max retries exceeded');
}
3. Use Streaming for Better User Experience
For interactive applications, streaming responses via Server-Sent Events (SSE) dramatically improves perceived latency. Users see the first tokens within milliseconds instead of waiting for the complete response.
const stream = await client.chat.completions.create({
model: "claude-sonnet-4-20250514",
messages: [{ role: "user", content: prompt }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
4. Manage Context Windows Efficiently
Every token in your request costs money and counts against the model's context window. Be intentional about what you include:
- Summarize conversation history instead of sending the full chat log
- Use system prompts wisely — keep them concise but complete
- Leverage prompt caching for repeated system prompts and few-shot examples
- Truncate or chunk large inputs rather than sending entire documents
5. Implement Request Cancellation
When a user navigates away or cancels an operation, stop the API request to avoid wasting tokens:
const controller = new AbortController();
// Cancel on user action
cancelButton.addEventListener('click', () => controller.abort());
try {
const response = await fetch('https://claude4u.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify(params),
signal: controller.signal
});
} catch (error) {
if (error.name === 'AbortError') {
console.log('Request cancelled by user');
}
}
6. Use the Right Model for the Task
Not every request needs the most powerful model. Match model capability to task complexity:
- Simple classification, extraction, formatting: Claude Haiku / GPT-4o-mini / Gemini Flash
- Code generation, analysis, general tasks: Claude Sonnet / GPT-4o / Gemini Pro
- Complex reasoning, research, difficult problems: Claude Opus / o3 / Gemini 2.5 Pro
7. Structure Your Output
When you need structured data from an AI model, use JSON mode or explicit schema instructions:
const response = await client.chat.completions.create({
model: "claude-sonnet-4-20250514",
messages: [
{
role: "user",
content: `Extract entities from this text and return JSON:
{"people": [], "places": [], "dates": []}
Text: ${inputText}`
}
],
response_format: { type: "json_object" }
});
8. Log and Monitor Everything
Production AI integrations need comprehensive observability:
- Log request/response metadata (model, tokens, latency) but never log prompt content in production
- Track token usage and costs per endpoint, user, and model
- Set up alerts for error rate spikes and latency increases
- Monitor rate limit headroom to anticipate scaling needs
9. Implement Graceful Degradation
Design your application to function when the AI API is unavailable. Provide cached responses, fallback to simpler models, or disable AI features gracefully instead of showing error pages to users.
10. Centralize Your AI API Access
Whether you build an internal abstraction layer or use a managed relay service like claude4u.com, centralizing AI API access gives you a single place to manage authentication, implement caching, track costs, and switch between providers. This is especially critical for teams where multiple developers and services are making AI API calls.
Following these best practices will help you build AI integrations that are reliable, cost-effective, and maintainable. The difference between a demo and a production system is not the AI model — it is the engineering around it.
Get Started with 轻舟 AI
Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more
Sign Up Free
轻舟 AI