AI API costs can spiral quickly. A few careless patterns in your code, and you're looking at hundreds of dollars in unexpected charges. This guide covers practical strategies to reduce costs without degrading your results.
Understanding AI API Pricing
Most AI APIs charge per token. A token is roughly 4 characters or 0.75 words in English. You pay separately for input tokens (your prompt) and output tokens (the response).
Current approximate pricing for popular models (as of early 2025):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|-------|----------------------|------------------------|
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Opus | $15.00 | $75.00 |
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
Key insight: Output tokens cost 3-5x more than input tokens. Controlling output length has the biggest impact on costs.
Strategy 1: Use the Right Model for the Task
The most impactful change is matching model capability to task complexity.
Task-to-model mapping
| Task | Best Model | Why |
|------|-----------|-----|
| Classification (spam, sentiment) | Haiku / GPT-4o-mini | Fast, accurate for simple decisions |
| Summarization | Haiku / GPT-4o-mini | Doesn't need deep reasoning |
| Code generation | Sonnet / GPT-4o | Balance of speed and quality |
| Complex analysis | Sonnet | Good enough for most cases |
| Research, multi-step reasoning | Opus / GPT-4 | Only when cheaper models fail |
Implementation pattern
function selectModel(taskType: string): string {
const modelMap: Record<string, string> = {
"classify": "anthropic/claude-3-haiku-20240307",
"summarize": "anthropic/claude-3-haiku-20240307",
"generate_code": "anthropic/claude-sonnet-4-20250514",
"analyze": "anthropic/claude-sonnet-4-20250514",
"research": "anthropic/claude-3-opus-20240229",
};
return modelMap[taskType] || "anthropic/claude-sonnet-4-20250514";
}
Savings: Using Haiku instead of Sonnet for simple tasks saves ~75% on those requests.
Strategy 2: Control Output Length
Output tokens are expensive. Most applications don't need 4000-token responses.
Set appropriate max_tokens
// Bad: Default allows massive responses
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
messages: [{ role: "user", content: prompt }],
// max_tokens defaults to model maximum
});
// Good: Limit based on expected response
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
messages: [{ role: "user", content: prompt }],
max_tokens: 500, // Enough for most tasks
});
Ask for concise responses in your prompt
Add explicit instructions:
Respond concisely in 2-3 sentences.
or
Provide a brief answer (under 100 words).
Savings: Reducing average output from 1000 to 300 tokens saves 70% on output costs.
Strategy 3: Trim Your Prompts
Long system prompts and unnecessary context inflate input costs.
Before: Bloated prompt
const systemPrompt = `
You are a helpful AI assistant. You should always be polite and professional.
Your goal is to help users with their questions. You have access to various
tools and capabilities. Please think carefully before responding. Make sure
your responses are accurate and helpful. If you're not sure about something,
say so. Always prioritize user safety and privacy...
[500 more words of instructions]
`;
After: Focused prompt
const systemPrompt = `
You are a customer support assistant. Be concise and helpful.
If you don't know something, say so.
`;
Only include relevant context
Don't dump entire documents into every request. Extract relevant sections first.
// Bad: Send entire document
const prompt = `Analyze this document: ${entireDocument}`;
// Good: Send relevant excerpt
const relevantSection = extractRelevantSection(entireDocument, query);
const prompt = `Analyze this excerpt: ${relevantSection}`;
Savings: Cutting prompt length from 2000 to 500 tokens saves 75% on input costs.
Strategy 4: Cache Repeated Prompts
Anthropic's prompt caching reduces costs for repeated prompts by 90%.
When you send the same system prompt or context repeatedly, Claude caches it and charges only 10% for subsequent uses.
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
system: [
{
type: "text",
text: longSystemPrompt,
cache_control: { type: "ephemeral" }
}
],
messages: [{ role: "user", content: userQuery }],
});
Savings: For applications with consistent system prompts, this can reduce input costs by 80-90%.
Strategy 5: Batch Non-Urgent Requests
Anthropic's Batch API offers 50% off for requests that don't need immediate responses.
Good for:
- Nightly report generation
- Bulk content processing
- Background analysis tasks
// Submit batch job
const batch = await client.batches.create({
requests: items.map(item => ({
custom_id: item.id,
params: {
model: "claude-sonnet-4-20250514",
messages: [{ role: "user", content: item.prompt }],
}
}))
});
// Results available within 24 hours
Savings: 50% on all batched requests.
Strategy 6: Implement Smart Caching
Don't call the API for queries you've already answered.
import { createHash } from "crypto";
const cache = new Map<string, string>();
async function cachedQuery(prompt: string): Promise<string> {
const hash = createHash("md5").update(prompt).digest("hex");
if (cache.has(hash)) {
return cache.get(hash)!;
}
const response = await callAPI(prompt);
cache.set(hash, response);
return response;
}
For production, use Redis or a similar persistent cache with TTL.
Savings: Varies by use case, but common queries hitting cache can save 90%+.
Strategy 7: Use Streaming Wisely
Streaming doesn't change token costs, but it can help you:
- Stop early: If the response is going off-track, cancel the stream
- Reduce perceived latency: Users see results faster, reducing retry attempts
const stream = await client.messages.create({
model: "claude-sonnet-4-20250514",
messages: [{ role: "user", content: prompt }],
stream: true,
});
let response = "";
for await (const chunk of stream) {
response += chunk.delta?.text || "";
// Cancel if response looks wrong
if (response.includes("I cannot") && response.length > 100) {
stream.controller.abort();
break;
}
}
Real-World Example: Email Triage
Here's how these strategies combine for an email triage application:
| Component | Before | After | Savings |
|-----------|--------|-------|---------|
| Model | Sonnet for everything | Haiku for classification, Sonnet for drafts | 60% |
| System prompt | 800 tokens | 200 tokens with caching | 90% |
| Output length | Default (4096) | 300 for classification, 500 for drafts | 70% |
| Repeated queries | None | Redis cache | 40% |
| Total | ~$150/mo | ~$35/mo | 77% |
Monitoring and Alerts
Set up tracking to catch cost spikes early:
- Log every request with model, token counts, and cost
- Set daily/weekly budgets in your provider dashboard
- Alert on anomalies (2x normal daily spend triggers investigation)
function logAPICall(model: string, inputTokens: number, outputTokens: number) {
const cost = calculateCost(model, inputTokens, outputTokens);
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
model,
inputTokens,
outputTokens,
cost
}));
}
Summary
| Strategy | Typical Savings | Effort |
|----------|----------------|--------|
| Right model for task | 50-75% | Low |
| Control output length | 30-70% | Low |
| Trim prompts | 20-50% | Medium |
| Prompt caching | 80-90% on cached | Low |
| Batch API | 50% | Medium |
| Response caching | 40-90% | Medium |
Start with model selection and output limits. These require minimal code changes and have the biggest impact. Add caching as your usage grows.