Testing LLM-Powered Features Without Going Broke

Problem Context

You've integrated an LLM into your application. Now you need to test it. The problem: LLM calls are non-deterministic, slow, and expensive. Running your test suite against a live model burns tokens, takes minutes, and produces different results every run. Your CI pipeline becomes flaky, expensive, and nobody trusts it.

Traditional testing strategies don't map cleanly to LLM features. You can't assert assertEquals("expected output", llmResponse) because the response changes every time. But skipping tests isn't an option — LLM-powered features have the highest bug surface area in your codebase.

🤔 Sound familiar?

Your CI costs tripled after adding LLM integration tests that call Azure OpenAI on every push
Tests pass on Monday and fail on Friday with the same code — because the model responded differently
You're mocking the entire LLM layer and have zero confidence your prompts actually work
You need to test prompt changes but have no way to measure if the new version is better or worse

This article gives you a layered testing strategy that's reliable, fast, and won't bankrupt your team.

Concept Explanation

LLM testing requires a layered approach. Each layer catches different types of defects at different costs:


      flowchart TD
          A["Unit Tests - Mocked LLM"] --> B["Snapshot Tests - Recorded Responses"]
          B --> C["Evaluation Tests - Live LLM + Metrics"]
          C --> D["Production Monitoring - Real Traffic"]
      
          A1["Fast, Free, Deterministic"] -.-> A
          B1["Fast, Cheap, Semi-Deterministic"] -.-> B
          C1["Slow, Costly, Realistic"] -.-> C
          D1["Continuous, Real-World Validation"] -.-> D
      
          style A fill:#059669,color:#fff,stroke:#047857
          style B fill:#4f46e5,color:#fff,stroke:#4338ca
          style C fill:#7c3aed,color:#fff,stroke:#6d28d9
          style D fill:#d97706,color:#fff,stroke:#b45309

Layer 1: Unit Tests with Mocked LLM

Mock the LLM client entirely. Test your application logic — prompt construction, response parsing, error handling, retry behavior — without calling a model. These run in milliseconds, cost nothing, and should cover 70%+ of your test cases.

Layer 2: Snapshot Tests with Recorded Responses

Record real LLM responses once, replay them in tests. This verifies your parsing and downstream logic against realistic outputs without calling the model repeatedly. Re-record snapshots when prompts change.

Layer 3: Evaluation Tests with Live LLM

Run a curated test set against the live model and measure quality metrics: accuracy, format compliance, relevance score. These run on schedule (nightly or pre-release), not on every commit.

Implementation

Unit Tests: Mocking the LLM Client

// Interface for testability
      public interface IChatService
      {
          Task&lt;string&gt; GetCompletionAsync(string systemPrompt, string userMessage);
      }
      
      // Unit test with mocked LLM
      [Fact]
      public async Task ClassifyProduct_ReturnsCategory_WhenValidJson()
      {
          var mockChat = new Mock&lt;IChatService&gt;();
          mockChat.Setup(c => c.GetCompletionAsync(It.IsAny&lt;string&gt;(), It.IsAny&lt;string&gt;()))
              .ReturnsAsync("""{"category": "electronics", "confidence": 0.95}""");
      
          var classifier = new ProductClassifier(mockChat.Object);
          var result = await classifier.ClassifyAsync("iPhone 15 Pro Max");
      
          Assert.Equal("electronics", result.Category);
          Assert.True(result.Confidence > 0.9);
      }
      
      [Fact]
      public async Task ClassifyProduct_HandlesInvalidJson_Gracefully()
      {
          var mockChat = new Mock&lt;IChatService&gt;();
          mockChat.Setup(c => c.GetCompletionAsync(It.IsAny&lt;string&gt;(), It.IsAny&lt;string&gt;()))
              .ReturnsAsync("This is not JSON at all");
      
          var classifier = new ProductClassifier(mockChat.Object);
          var result = await classifier.ClassifyAsync("some product");
      
          Assert.Equal("unknown", result.Category);
          Assert.Equal(0.0, result.Confidence);
      }

💡

Key insight: Most LLM bugs are in your code, not in the model. Prompt construction, JSON parsing, null handling, timeout logic — these are all testable without a live model.

Snapshot Tests: Record and Replay

// Record a snapshot (run once, commit the file)
      [Fact]
      public async Task RecordSnapshot_Summarization()
      {
          var client = CreateRealChatClient(); // live Azure OpenAI
          var response = await client.GetCompletionAsync(
              "Summarize this document in 3 bullet points.",
              documentText);
      
          File.WriteAllText("snapshots/summarize_doc1.json",
              JsonSerializer.Serialize(new { prompt = documentText, response }));
      }
      
      // Replay snapshot in CI
      [Fact]
      public async Task Summarization_ParsesCorrectly_FromSnapshot()
      {
          var snapshot = JsonSerializer.Deserialize&lt;Snapshot&gt;(
              File.ReadAllText("snapshots/summarize_doc1.json"));
      
          var mockChat = new Mock&lt;IChatService&gt;();
          mockChat.Setup(c => c.GetCompletionAsync(It.IsAny&lt;string&gt;(), It.IsAny&lt;string&gt;()))
              .ReturnsAsync(snapshot.Response);
      
          var summarizer = new DocumentSummarizer(mockChat.Object);
          var result = await summarizer.SummarizeAsync(snapshot.Prompt);
      
          Assert.Equal(3, result.BulletPoints.Count);
          Assert.All(result.BulletPoints, bp => Assert.False(string.IsNullOrEmpty(bp)));
      }

Evaluation Tests: Measuring Quality

// Evaluation harness — runs nightly, not per-commit
      public class LlmEvaluationTests
      {
          private readonly record struct EvalCase(
              string Input, string ExpectedCategory, string[] MustContain);
      
          private static readonly EvalCase[] TestCases =
          [
              new("iPhone 15 Pro", "electronics", ["phone", "Apple"]),
              new("Nike Air Max 90", "clothing", ["shoe", "sneaker"]),
              new("Organic almonds 1lb", "food", ["nut", "snack"]),
          ];
      
          [Fact]
          public async Task Classification_MeetsAccuracyThreshold()
          {
              var client = CreateRealChatClient();
              var classifier = new ProductClassifier(client);
              int correct = 0;
      
              foreach (var tc in TestCases)
              {
                  var result = await classifier.ClassifyAsync(tc.Input);
                  if (result.Category == tc.ExpectedCategory) correct++;
              }
      
              double accuracy = (double)correct / TestCases.Length;
              Assert.True(accuracy >= 0.85,
                  $"Accuracy {accuracy:P0} below 85% threshold");
          }
      }

Cost-Aware CI Configuration

# GitHub Actions: separate LLM eval from regular CI
      name: CI Pipeline
      
      on: [push, pull_request]
      
      jobs:
        unit-tests:
          runs-on: ubuntu-latest
          steps:
            - uses: actions/checkout@v4
            - run: dotnet test --filter "Category!=LlmEval"
      
        llm-evaluation:
          runs-on: ubuntu-latest
          if: github.ref == 'refs/heads/main' || contains(github.event.pull_request.labels.*.name, 'eval-needed')
          steps:
            - uses: actions/checkout@v4
            - run: dotnet test --filter "Category=LlmEval"
          env:
            AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}

Pitfalls

⚠️ Common Mistakes

1. Testing the model, not your code

Asserting that the LLM returns a specific sentence is testing OpenAI's model, not your application. Test your behavior: Does the parser handle the response? Does the retry logic work? Does the prompt include the right context?

2. Running live LLM tests on every commit

A test suite making 50 Azure OpenAI calls takes 2+ minutes and costs ~$0.50 per run. At 20 commits/day, that's $10/day per developer. Use mocks for CI and run live evaluation tests on schedule or label-triggered.

3. No baseline for evaluation

If you change a prompt, how do you know if it's better or worse? Without a baseline eval score before the change, you're guessing. Always run evals before and after prompt changes, and track scores over time.

4. Snapshot rot

Snapshots recorded 6 months ago may not reflect current model behavior. When tests pass against stale snapshots but fail in production, the snapshots are lying. Re-record snapshots quarterly or after model version changes.

5. Ignoring edge cases in mock responses

Mocked responses are always well-formatted. Real models return markdown in JSON, extra whitespace, partial JSON, or empty responses under load. Include adversarial mock responses in your test suite to catch parsing failures.

Practical Takeaways

✅ Key Lessons

Mock the LLM for 90% of tests. Test your code's behavior, not the model's intelligence. Unit tests should never call a live model.
Use snapshot tests for integration coverage. Record real responses once, replay in CI for deterministic tests against realistic data.
Run evaluation tests on schedule, not per-commit. Nightly or pre-release evals catch quality regressions without burning your budget.
Track eval metrics over time. Accuracy, format compliance, latency — treat them like performance benchmarks with alerting on regression.
Separate LLM test costs in CI. Use labels or branch rules to gate expensive evaluation runs. Your token budget is a test resource like compute.

Testing LLM-Powered Features Without Going Broke

Problem Context

Concept Explanation

Layer 1: Unit Tests with Mocked LLM

Layer 2: Snapshot Tests with Recorded Responses

Layer 3: Evaluation Tests with Live LLM

Implementation

Unit Tests: Mocking the LLM Client

Snapshot Tests: Record and Replay

Evaluation Tests: Measuring Quality

Cost-Aware CI Configuration

Pitfalls

1. Testing the model, not your code

2. Running live LLM tests on every commit

3. No baseline for evaluation

4. Snapshot rot

5. Ignoring edge cases in mock responses

Practical Takeaways

Enjoyed this article?

Continue reading

Structured Outputs from LLMs: JSON Mode, Function Calling, and Schema Enforcement

Prompt Engineering as Software Engineering

Building Reliable AI Agents with Semantic Kernel

Discussion

Stay ahead of the AI curve.