Token Economics: Understanding and Optimizing LLM Costs

Problem Context

Your AI feature launched. Users love it. Then the first bill arrives and it's 5x what you budgeted. The CEO asks why you're spending $30K/month on API calls for a feature that serves 10,000 users. You don't have great answers because you never instrumented the costs.

LLM costs are deceptively simple — price per token, tokens per request. But in practice, costs are driven by system prompt length, context window usage, retry rates, conversation length, and model choice. Without visibility into these drivers, optimization is guessing.

🤔 Sound familiar?

You can't explain why your LLM costs increased 3x last month
Your system prompt alone consumes 2,000 tokens on every request, and nobody's questioned it
You don't know the per-user or per-feature cost of your AI capabilities
You want to offer AI features at scale but the unit economics don't work at current prices

This article gives you the instrumentation, analysis, and optimization patterns to make LLM costs predictable and manageable.

Concept Explanation

Token economics has three layers: visibility (knowing what you spend), attribution (knowing why you spend it), andoptimization (spending less for the same quality). Most teams skip the first two and jump to optimization — which means they're optimizing blind.


      flowchart TD
          A["Instrument Costs\n(per request)"] --> B["Attribute to Features\n(per feature/user)"]
          B --> C["Identify Cost Drivers\n(prompts, retries, model)"]
          C --> D["Optimize"]
          D --> D1["Prompt Compression"]
          D --> D2["Semantic Caching"]
          D --> D3["Model Routing"]
          D --> D4["Budget Guardrails"]
      
          style A fill:#4f46e5,color:#fff,stroke:#4338ca
          style B fill:#059669,color:#fff,stroke:#047857
          style C fill:#d97706,color:#fff,stroke:#b45309
          style D fill:#dc2626,color:#fff,stroke:#b91c1c

Token Pricing Basics

LLM APIs charge per token (roughly 4 characters in English). Input tokens (your prompt) and output tokens (the model's response) are priced differently — output tokens are typically 3-4x more expensive because they require sequential generation. A request with a 2,000-token system prompt, 500-token user input, and 500-token response costs differently than two requests with 500-token prompts each.

The Hidden Cost Multipliers

System prompts: Sent on every request. A 2K-token system prompt at 100K requests/day = 200M input tokens/day.
Conversation history: Grows linearly per turn. By turn 10, you're sending all previous turns as context.
Retries: Failed requests that get retried still cost tokens for the first attempt.
Redundant requests: Same question from different users gets computed from scratch every time without caching.

Implementation

Step 1: Cost Instrumentation

public class CostTracker
      {
          private static readonly Dictionary&lt;string, (decimal Input, decimal Output)&gt; PricePerMillionTokens = new()
          {
              ["gpt-4o"] = (2.50m, 10.00m),
              ["gpt-4o-mini"] = (0.15m, 0.60m),
              ["claude-sonnet"] = (3.00m, 15.00m),
          };
      
          public decimal CalculateCost(string model, int inputTokens, int outputTokens)
          {
              var prices = PricePerMillionTokens[model];
              return (inputTokens * prices.Input / 1_000_000m)
                   + (outputTokens * prices.Output / 1_000_000m);
          }
      
          public void Record(string feature, string userId, string model,
              int inputTokens, int outputTokens)
          {
              var cost = CalculateCost(model, inputTokens, outputTokens);
              _metrics.RecordHistogram("llm.cost.usd", (double)cost,
                  new("feature", feature),
                  new("model", model));
              _metrics.RecordHistogram("llm.tokens.input", inputTokens,
                  new("feature", feature));
              _metrics.RecordHistogram("llm.tokens.output", outputTokens,
                  new("feature", feature));
          }
      }

Step 2: Prompt Compression

public class PromptOptimizer
      {
          // Before: 2,100 tokens
          const string VerbosePrompt = """
              You are a helpful customer support assistant working for Contoso Inc.
              You should always be polite and professional in your responses.
              When a customer asks about their order status, you should look up
              their order using the order_lookup function.
              When a customer has a billing question, you should check their
              account using the account_lookup function.
              You should never make up information...
              [... 50 more lines of instructions ...]
              """;
      
          // After: 800 tokens — same behavior
          const string CompressedPrompt = """
              Contoso support assistant. Professional tone.
              Rules:
              - Order questions → order_lookup function
              - Billing questions → account_lookup function
              - Never fabricate. Say "I'll check" if unsure.
              - Escalate: refund > $100, legal, complaints.
              Response: ≤ 3 sentences unless detail requested.
              """;
      }

Step 3: Semantic Caching

public class SemanticCache
      {
          private readonly ISearchClient _vectorStore;
      
          public async Task&lt;string?&gt; GetCachedResponse(string query)
          {
              var results = await _vectorStore.SearchAsync&lt;CacheEntry&gt;(
                  query,
                  new SearchOptions { Top = 1 });
      
              var topResult = results.Value.GetResults().FirstOrDefault();
              if (topResult?.Score > 0.95) // High similarity threshold
              {
                  _metrics.RecordCounter("cache.hit", 1);
                  return topResult.Document.Response;
              }
      
              _metrics.RecordCounter("cache.miss", 1);
              return null;
          }
      
          public async Task CacheResponse(string query, string response)
          {
              await _vectorStore.IndexDocumentsAsync(
                  IndexDocumentsBatch.Upload(new[]
                  {
                      new CacheEntry
                      {
                          Query = query,
                          Response = response,
                          CreatedAt = DateTimeOffset.UtcNow
                      }
                  }));
          }
      }

Step 4: Budget Guardrails

public class BudgetGuard
      {
          private readonly decimal _dailyBudgetUsd;
          private readonly decimal _perUserBudgetUsd;
      
          public async Task&lt;bool&gt; CanProceed(string userId, string model,
              int estimatedInputTokens, int estimatedOutputTokens)
          {
              var estimatedCost = _costTracker.CalculateCost(
                  model, estimatedInputTokens, estimatedOutputTokens);
      
              var dailySpend = await _store.GetDailySpendAsync();
              if (dailySpend + estimatedCost > _dailyBudgetUsd)
              {
                  _alerts.Trigger("daily-budget-exceeded");
                  return false;
              }
      
              var userSpend = await _store.GetUserDailySpendAsync(userId);
              if (userSpend + estimatedCost > _perUserBudgetUsd)
              {
                  _metrics.RecordCounter("budget.user_limited", 1);
                  return false;
              }
      
              return true;
          }
      }

Pitfalls

⚠️ Common Mistakes

1. Optimizing before measuring

Teams switch to smaller models or compress prompts without knowing where the money goes. Instrument first. You might find that 80% of your cost is one feature with unnecessarily long conversation histories — a targeted fix, not a broad optimization.

2. Ignoring output token costs

Output tokens cost 3-4x more than input tokens. A chatty model that generates 1,000-token responses when 200 tokens would suffice is your biggest cost driver. Set max_tokens aggressively and instruct the model to be concise.

3. Caching without invalidation

Semantic caching saves money, but stale cache entries serve outdated information. Set TTLs based on how fast your data changes. Product catalog cache can live for hours; stock price cache should live for seconds.

4. No per-user budget limits

Without per-user caps, one power user (or one bot) can consume your entire daily budget. Set reasonable per-user limits and implement graceful degradation — rate limit AI features, not the entire application.

Practical Takeaways

✅ Key Lessons

Instrument every LLM call. Record model, input tokens, output tokens, and calculated cost. Attribute to feature and user. You can't optimize what you don't measure.
Compress system prompts first. They're sent on every request and often bloated. A 60% reduction in system prompt length saves 60% of that per-request cost — with usually zero quality loss.
Cache semantically similar queries. FAQ-type interactions, repeated lookups, and common questions can be served from cache at near-zero cost.
Route to the cheapest model that meets quality requirements. GPT-4o-mini is 16x cheaper than GPT-4o. For classification, extraction, and simple tasks, the quality difference is negligible.
Set budget guardrails from day one. Daily caps, per-user caps, and per-feature caps prevent surprise bills. Alert before you hit them, not after.

Token Economics: Understanding and Optimizing LLM Costs

Problem Context

Concept Explanation

Token Pricing Basics

The Hidden Cost Multipliers

Implementation

Step 1: Cost Instrumentation

Step 2: Prompt Compression

Step 3: Semantic Caching

Step 4: Budget Guardrails

Pitfalls

1. Optimizing before measuring

2. Ignoring output token costs

3. Caching without invalidation

4. No per-user budget limits

Practical Takeaways

Enjoyed this article?

Continue reading

LLM Evaluation Beyond Vibes

Small Language Models in Production

When to Fine-Tune vs Few-Shot vs RAG

Discussion

Stay ahead of the AI curve.