Problem Context
Your AI feature launched. Users love it. Then the first bill arrives and it's 5x what you budgeted. The CEO asks why you're spending $30K/month on API calls for a feature that serves 10,000 users. You don't have great answers because you never instrumented the costs.
LLM costs are deceptively simple — price per token, tokens per request. But in practice, costs are driven by system prompt length, context window usage, retry rates, conversation length, and model choice. Without visibility into these drivers, optimization is guessing.
- You can't explain why your LLM costs increased 3x last month
- Your system prompt alone consumes 2,000 tokens on every request, and nobody's questioned it
- You don't know the per-user or per-feature cost of your AI capabilities
- You want to offer AI features at scale but the unit economics don't work at current prices
This article gives you the instrumentation, analysis, and optimization patterns to make LLM costs predictable and manageable.
Concept Explanation
Token economics has three layers: visibility (knowing what you spend), attribution (knowing why you spend it), andoptimization (spending less for the same quality). Most teams skip the first two and jump to optimization — which means they're optimizing blind.
flowchart TD
A["Instrument Costs\n(per request)"] --> B["Attribute to Features\n(per feature/user)"]
B --> C["Identify Cost Drivers\n(prompts, retries, model)"]
C --> D["Optimize"]
D --> D1["Prompt Compression"]
D --> D2["Semantic Caching"]
D --> D3["Model Routing"]
D --> D4["Budget Guardrails"]
style A fill:#4f46e5,color:#fff,stroke:#4338ca
style B fill:#059669,color:#fff,stroke:#047857
style C fill:#d97706,color:#fff,stroke:#b45309
style D fill:#dc2626,color:#fff,stroke:#b91c1c
Token Pricing Basics
LLM APIs charge per token (roughly 4 characters in English). Input tokens (your prompt) and output tokens (the model's response) are priced differently — output tokens are typically 3-4x more expensive because they require sequential generation. A request with a 2,000-token system prompt, 500-token user input, and 500-token response costs differently than two requests with 500-token prompts each.
The Hidden Cost Multipliers
- System prompts: Sent on every request. A 2K-token system prompt at 100K requests/day = 200M input tokens/day.
- Conversation history: Grows linearly per turn. By turn 10, you're sending all previous turns as context.
- Retries: Failed requests that get retried still cost tokens for the first attempt.
- Redundant requests: Same question from different users gets computed from scratch every time without caching.
Implementation
Step 1: Cost Instrumentation
public class CostTracker
{
private static readonly Dictionary<string, (decimal Input, decimal Output)> PricePerMillionTokens = new()
{
["gpt-4o"] = (2.50m, 10.00m),
["gpt-4o-mini"] = (0.15m, 0.60m),
["claude-sonnet"] = (3.00m, 15.00m),
};
public decimal CalculateCost(string model, int inputTokens, int outputTokens)
{
var prices = PricePerMillionTokens[model];
return (inputTokens * prices.Input / 1_000_000m)
+ (outputTokens * prices.Output / 1_000_000m);
}
public void Record(string feature, string userId, string model,
int inputTokens, int outputTokens)
{
var cost = CalculateCost(model, inputTokens, outputTokens);
_metrics.RecordHistogram("llm.cost.usd", (double)cost,
new("feature", feature),
new("model", model));
_metrics.RecordHistogram("llm.tokens.input", inputTokens,
new("feature", feature));
_metrics.RecordHistogram("llm.tokens.output", outputTokens,
new("feature", feature));
}
}
Step 2: Prompt Compression
public class PromptOptimizer
{
// Before: 2,100 tokens
const string VerbosePrompt = """
You are a helpful customer support assistant working for Contoso Inc.
You should always be polite and professional in your responses.
When a customer asks about their order status, you should look up
their order using the order_lookup function.
When a customer has a billing question, you should check their
account using the account_lookup function.
You should never make up information...
[... 50 more lines of instructions ...]
""";
// After: 800 tokens — same behavior
const string CompressedPrompt = """
Contoso support assistant. Professional tone.
Rules:
- Order questions → order_lookup function
- Billing questions → account_lookup function
- Never fabricate. Say "I'll check" if unsure.
- Escalate: refund > $100, legal, complaints.
Response: ≤ 3 sentences unless detail requested.
""";
}
Step 3: Semantic Caching
public class SemanticCache
{
private readonly ISearchClient _vectorStore;
public async Task<string?> GetCachedResponse(string query)
{
var results = await _vectorStore.SearchAsync<CacheEntry>(
query,
new SearchOptions { Top = 1 });
var topResult = results.Value.GetResults().FirstOrDefault();
if (topResult?.Score > 0.95) // High similarity threshold
{
_metrics.RecordCounter("cache.hit", 1);
return topResult.Document.Response;
}
_metrics.RecordCounter("cache.miss", 1);
return null;
}
public async Task CacheResponse(string query, string response)
{
await _vectorStore.IndexDocumentsAsync(
IndexDocumentsBatch.Upload(new[]
{
new CacheEntry
{
Query = query,
Response = response,
CreatedAt = DateTimeOffset.UtcNow
}
}));
}
}
Step 4: Budget Guardrails
public class BudgetGuard
{
private readonly decimal _dailyBudgetUsd;
private readonly decimal _perUserBudgetUsd;
public async Task<bool> CanProceed(string userId, string model,
int estimatedInputTokens, int estimatedOutputTokens)
{
var estimatedCost = _costTracker.CalculateCost(
model, estimatedInputTokens, estimatedOutputTokens);
var dailySpend = await _store.GetDailySpendAsync();
if (dailySpend + estimatedCost > _dailyBudgetUsd)
{
_alerts.Trigger("daily-budget-exceeded");
return false;
}
var userSpend = await _store.GetUserDailySpendAsync(userId);
if (userSpend + estimatedCost > _perUserBudgetUsd)
{
_metrics.RecordCounter("budget.user_limited", 1);
return false;
}
return true;
}
}
Pitfalls
1. Optimizing before measuring
Teams switch to smaller models or compress prompts without knowing where the money goes. Instrument first. You might find that 80% of your cost is one feature with unnecessarily long conversation histories — a targeted fix, not a broad optimization.
2. Ignoring output token costs
Output tokens cost 3-4x more than input tokens. A chatty model that generates 1,000-token responses when 200 tokens would suffice is your biggest cost driver. Set max_tokens aggressively and instruct the model to be concise.
3. Caching without invalidation
Semantic caching saves money, but stale cache entries serve outdated information. Set TTLs based on how fast your data changes. Product catalog cache can live for hours; stock price cache should live for seconds.
4. No per-user budget limits
Without per-user caps, one power user (or one bot) can consume your entire daily budget. Set reasonable per-user limits and implement graceful degradation — rate limit AI features, not the entire application.
Practical Takeaways
- Instrument every LLM call. Record model, input tokens, output tokens, and calculated cost. Attribute to feature and user. You can't optimize what you don't measure.
- Compress system prompts first. They're sent on every request and often bloated. A 60% reduction in system prompt length saves 60% of that per-request cost — with usually zero quality loss.
- Cache semantically similar queries. FAQ-type interactions, repeated lookups, and common questions can be served from cache at near-zero cost.
- Route to the cheapest model that meets quality requirements. GPT-4o-mini is 16x cheaper than GPT-4o. For classification, extraction, and simple tasks, the quality difference is negligible.
- Set budget guardrails from day one. Daily caps, per-user caps, and per-feature caps prevent surprise bills. Alert before you hit them, not after.
