Problem Context
You've integrated an LLM into your application. Now you need to test it. The problem: LLM calls are non-deterministic, slow, and expensive. Running your test suite against a live model burns tokens, takes minutes, and produces different results every run. Your CI pipeline becomes flaky, expensive, and nobody trusts it.
Traditional testing strategies don't map cleanly to LLM features. You can't assert assertEquals("expected output", llmResponse) because the response changes every time. But skipping tests isn't an option โ LLM-powered features have the highest bug surface area in your codebase.
- Your CI costs tripled after adding LLM integration tests that call Azure OpenAI on every push
- Tests pass on Monday and fail on Friday with the same code โ because the model responded differently
- You're mocking the entire LLM layer and have zero confidence your prompts actually work
- You need to test prompt changes but have no way to measure if the new version is better or worse
This article gives you a layered testing strategy that's reliable, fast, and won't bankrupt your team.
Concept Explanation
LLM testing requires a layered approach. Each layer catches different types of defects at different costs:
flowchart TD
A["Unit Tests - Mocked LLM"] --> B["Snapshot Tests - Recorded Responses"]
B --> C["Evaluation Tests - Live LLM + Metrics"]
C --> D["Production Monitoring - Real Traffic"]
A1["Fast, Free, Deterministic"] -.-> A
B1["Fast, Cheap, Semi-Deterministic"] -.-> B
C1["Slow, Costly, Realistic"] -.-> C
D1["Continuous, Real-World Validation"] -.-> D
style A fill:#059669,color:#fff,stroke:#047857
style B fill:#4f46e5,color:#fff,stroke:#4338ca
style C fill:#7c3aed,color:#fff,stroke:#6d28d9
style D fill:#d97706,color:#fff,stroke:#b45309
Layer 1: Unit Tests with Mocked LLM
Mock the LLM client entirely. Test your application logic โ prompt construction, response parsing, error handling, retry behavior โ without calling a model. These run in milliseconds, cost nothing, and should cover 70%+ of your test cases.
Layer 2: Snapshot Tests with Recorded Responses
Record real LLM responses once, replay them in tests. This verifies your parsing and downstream logic against realistic outputs without calling the model repeatedly. Re-record snapshots when prompts change.
Layer 3: Evaluation Tests with Live LLM
Run a curated test set against the live model and measure quality metrics: accuracy, format compliance, relevance score. These run on schedule (nightly or pre-release), not on every commit.
Implementation
Unit Tests: Mocking the LLM Client
// Interface for testability
public interface IChatService
{
Task<string> GetCompletionAsync(string systemPrompt, string userMessage);
}
// Unit test with mocked LLM
[Fact]
public async Task ClassifyProduct_ReturnsCategory_WhenValidJson()
{
var mockChat = new Mock<IChatService>();
mockChat.Setup(c => c.GetCompletionAsync(It.IsAny<string>(), It.IsAny<string>()))
.ReturnsAsync("""{"category": "electronics", "confidence": 0.95}""");
var classifier = new ProductClassifier(mockChat.Object);
var result = await classifier.ClassifyAsync("iPhone 15 Pro Max");
Assert.Equal("electronics", result.Category);
Assert.True(result.Confidence > 0.9);
}
[Fact]
public async Task ClassifyProduct_HandlesInvalidJson_Gracefully()
{
var mockChat = new Mock<IChatService>();
mockChat.Setup(c => c.GetCompletionAsync(It.IsAny<string>(), It.IsAny<string>()))
.ReturnsAsync("This is not JSON at all");
var classifier = new ProductClassifier(mockChat.Object);
var result = await classifier.ClassifyAsync("some product");
Assert.Equal("unknown", result.Category);
Assert.Equal(0.0, result.Confidence);
}
Key insight: Most LLM bugs are in your code, not in the model. Prompt construction, JSON parsing, null handling, timeout logic โ these are all testable without a live model.
Snapshot Tests: Record and Replay
// Record a snapshot (run once, commit the file)
[Fact]
public async Task RecordSnapshot_Summarization()
{
var client = CreateRealChatClient(); // live Azure OpenAI
var response = await client.GetCompletionAsync(
"Summarize this document in 3 bullet points.",
documentText);
File.WriteAllText("snapshots/summarize_doc1.json",
JsonSerializer.Serialize(new { prompt = documentText, response }));
}
// Replay snapshot in CI
[Fact]
public async Task Summarization_ParsesCorrectly_FromSnapshot()
{
var snapshot = JsonSerializer.Deserialize<Snapshot>(
File.ReadAllText("snapshots/summarize_doc1.json"));
var mockChat = new Mock<IChatService>();
mockChat.Setup(c => c.GetCompletionAsync(It.IsAny<string>(), It.IsAny<string>()))
.ReturnsAsync(snapshot.Response);
var summarizer = new DocumentSummarizer(mockChat.Object);
var result = await summarizer.SummarizeAsync(snapshot.Prompt);
Assert.Equal(3, result.BulletPoints.Count);
Assert.All(result.BulletPoints, bp => Assert.False(string.IsNullOrEmpty(bp)));
}
Evaluation Tests: Measuring Quality
// Evaluation harness โ runs nightly, not per-commit
public class LlmEvaluationTests
{
private readonly record struct EvalCase(
string Input, string ExpectedCategory, string[] MustContain);
private static readonly EvalCase[] TestCases =
[
new("iPhone 15 Pro", "electronics", ["phone", "Apple"]),
new("Nike Air Max 90", "clothing", ["shoe", "sneaker"]),
new("Organic almonds 1lb", "food", ["nut", "snack"]),
];
[Fact]
public async Task Classification_MeetsAccuracyThreshold()
{
var client = CreateRealChatClient();
var classifier = new ProductClassifier(client);
int correct = 0;
foreach (var tc in TestCases)
{
var result = await classifier.ClassifyAsync(tc.Input);
if (result.Category == tc.ExpectedCategory) correct++;
}
double accuracy = (double)correct / TestCases.Length;
Assert.True(accuracy >= 0.85,
$"Accuracy {accuracy:P0} below 85% threshold");
}
}
Cost-Aware CI Configuration
# GitHub Actions: separate LLM eval from regular CI
name: CI Pipeline
on: [push, pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: dotnet test --filter "Category!=LlmEval"
llm-evaluation:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' || contains(github.event.pull_request.labels.*.name, 'eval-needed')
steps:
- uses: actions/checkout@v4
- run: dotnet test --filter "Category=LlmEval"
env:
AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
Pitfalls
1. Testing the model, not your code
Asserting that the LLM returns a specific sentence is testing OpenAI's model, not your application. Test your behavior: Does the parser handle the response? Does the retry logic work? Does the prompt include the right context?
2. Running live LLM tests on every commit
A test suite making 50 Azure OpenAI calls takes 2+ minutes and costs ~$0.50 per run. At 20 commits/day, that's $10/day per developer. Use mocks for CI and run live evaluation tests on schedule or label-triggered.
3. No baseline for evaluation
If you change a prompt, how do you know if it's better or worse? Without a baseline eval score before the change, you're guessing. Always run evals before and after prompt changes, and track scores over time.
4. Snapshot rot
Snapshots recorded 6 months ago may not reflect current model behavior. When tests pass against stale snapshots but fail in production, the snapshots are lying. Re-record snapshots quarterly or after model version changes.
5. Ignoring edge cases in mock responses
Mocked responses are always well-formatted. Real models return markdown in JSON, extra whitespace, partial JSON, or empty responses under load. Include adversarial mock responses in your test suite to catch parsing failures.
Practical Takeaways
- Mock the LLM for 90% of tests. Test your code's behavior, not the model's intelligence. Unit tests should never call a live model.
- Use snapshot tests for integration coverage. Record real responses once, replay in CI for deterministic tests against realistic data.
- Run evaluation tests on schedule, not per-commit. Nightly or pre-release evals catch quality regressions without burning your budget.
- Track eval metrics over time. Accuracy, format compliance, latency โ treat them like performance benchmarks with alerting on regression.
- Separate LLM test costs in CI. Use labels or branch rules to gate expensive evaluation runs. Your token budget is a test resource like compute.
