Problem Context

"It looks good to me" is not an evaluation strategy. Yet most teams assess LLM quality by eyeballing a few outputs, deciding it feels right, and shipping. Then a subtle regression slips through — the model starts adding a disclaimer to every response, or the tone shifts from professional to casual — and nobody catches it until customers complain.

LLM evaluation is harder than traditional ML evaluation because there's rarely one "correct" answer. A summary can be accurate but poorly written. A code review can be technically correct but miss the point. You need evaluation methods that capture the dimensions of quality that matter for your use case.

🤔 Sound familiar?
  • You updated a prompt and have no idea if it actually improved things or just felt better on the test cases you checked
  • Your team can't agree on what "good" output looks like for your use case
  • You want to catch regressions automatically but don't know what metrics to track
  • Business stakeholders ask "how good is the AI?" and you don't have a number to give them

This article builds a practical evaluation framework — from defining quality dimensions to automating regression detection.

Concept Explanation

Good LLM evaluation operates at three levels: offline evaluation (before deployment), online metrics (during production), and human feedback loops (continuous calibration). Each level catches different problems.


      flowchart TD
          A["Define Quality\nDimensions"] --> B["Build Test Suite\n(golden dataset)"]
          B --> C["Automated Evaluation\n(metrics + LLM-as-judge)"]
          C --> D["Human Evaluation\n(calibration)"]
          D --> E["CI Integration\n(regression gate)"]
          E --> F["Production Monitoring\n(online metrics)"]
          F -->|"Feedback loop"| B
      
          style A fill:#4f46e5,color:#fff,stroke:#4338ca
          style C fill:#059669,color:#fff,stroke:#047857
          style E fill:#7c3aed,color:#fff,stroke:#6d28d9
          style F fill:#dc2626,color:#fff,stroke:#b91c1c
      

Quality Dimensions

Before measuring anything, define what you're measuring. Common dimensions for engineering tasks:

  • Correctness: Is the output factually and technically accurate?
  • Relevance: Does the output address what was asked?
  • Completeness: Does it cover all required aspects?
  • Format compliance: Does it follow the specified structure?
  • Conciseness: Is it appropriately brief without losing substance?
  • Safety: Does it avoid harmful, biased, or inappropriate content?

LLM-as-Judge

Use a more capable model (e.g., GPT-4o) to evaluate outputs from your production model. Provide rubrics, reference answers, and scoring criteria. It's not perfect — judges have biases too — but it scales far better than human review and correlates well with human judgment on structured rubrics.

Implementation

Step 1: Build a Golden Test Suite

public class EvalTestCase
      {
          public required string Id { get; init; }
          public required string Category { get; init; }
          public required string Input { get; init; }
          public string? ReferenceOutput { get; init; }
          public required Dictionary<string, double> MinScores { get; init; }
      }
      
      // Example test cases
      var testSuite = new List<EvalTestCase>
      {
          new()
          {
              Id = "review-null-check",
              Category = "code-review",
              Input = "Review: if (user != null && user.Name != null) { ... }",
              ReferenceOutput = "Use pattern matching: `if (user is { Name: not null })`...",
              MinScores = new()
              {
                  ["correctness"] = 0.8,
                  ["actionability"] = 0.7,
                  ["conciseness"] = 0.6
              }
          }
      };
      

Step 2: Automated Scoring with LLM-as-Judge

public class LlmJudge
      {
          private readonly ChatClient _judgeClient; // GPT-4o
      
          public async Task<Dictionary<string, double>> ScoreAsync(
              string input, string output, string? reference)
          {
              var prompt = $"""
                  Score the following AI output on these dimensions (0.0-1.0):
                  - correctness: Technical accuracy of the response
                  - relevance: How well it addresses the input
                  - completeness: Coverage of important aspects
                  - conciseness: Brevity without losing substance
      
                  INPUT: {input}
                  AI OUTPUT: {output}
                  {(reference != null ? $"REFERENCE: {reference}" : "")}
      
                  Return JSON: {{"correctness": 0.0, "relevance": 0.0,
                  "completeness": 0.0, "conciseness": 0.0}}
                  """;
      
              var response = await _judgeClient.CompleteChatAsync(
                  [new UserChatMessage(prompt)],
                  new ChatCompletionOptions
                  {
                      ResponseFormat = ChatResponseFormat.CreateJsonObjectFormat(),
                      Temperature = 0.1f
                  });
      
              return JsonSerializer.Deserialize<Dictionary<string, double>>(
                  response.Value.Content[0].Text)!;
          }
      }
      

Step 3: Regression Testing Pipeline

public class EvalPipeline
      {
          public async Task<EvalReport> RunAsync(
              IChatClient modelUnderTest,
              List<EvalTestCase> testSuite)
          {
              var results = new List<EvalResult>();
      
              foreach (var testCase in testSuite)
              {
                  var output = await modelUnderTest.CompleteAsync(testCase.Input);
                  var scores = await _judge.ScoreAsync(
                      testCase.Input, output, testCase.ReferenceOutput);
      
                  var passed = testCase.MinScores.All(
                      req => scores.GetValueOrDefault(req.Key, 0) >= req.Value);
      
                  results.Add(new EvalResult
                  {
                      TestCaseId = testCase.Id,
                      Scores = scores,
                      Passed = passed,
                      Output = output
                  });
              }
      
              return new EvalReport
              {
                  TotalTests = results.Count,
                  Passed = results.Count(r => r.Passed),
                  Failed = results.Count(r => !r.Passed),
                  AverageScores = AggregateScores(results),
                  FailedTests = results.Where(r => !r.Passed).ToList()
              };
          }
      }
      

Step 4: Production Quality Monitoring

public class ProductionMonitor
      {
          public async Task RecordInteraction(
              string input, string output, string modelId)
          {
              // Sample 5% of interactions for async evaluation
              if (Random.Shared.NextDouble() > 0.05) return;
      
              var scores = await _judge.ScoreAsync(input, output, reference: null);
      
              _metrics.RecordGauge("llm.quality.correctness", scores["correctness"],
                  new("model", modelId));
              _metrics.RecordGauge("llm.quality.relevance", scores["relevance"],
                  new("model", modelId));
      
              // Alert on quality drops
              if (scores.Values.Average() < 0.6)
                  _alerts.Trigger("llm-quality-below-threshold", new { modelId, scores });
          }
      }
      

Pitfalls

⚠️ Common Mistakes

1. Evaluating only on "happy path" inputs

Your test suite needs adversarial inputs, edge cases, and ambiguous queries — not just the clean examples that look good in demos. If every test case has an obvious correct answer, your evaluation won't catch the failures that matter.

2. Using a single aggregate score

"Quality: 0.82" tells you nothing about what's working and what's broken. Break quality into dimensions (correctness, relevance, format) and track them separately. A model can be highly correct but poorly formatted — that's a different fix than one that's well-formatted but wrong.

3. LLM-as-judge without calibration

Judge models have biases — they tend to prefer verbose outputs, give higher scores to outputs that resemble their own generation style, and struggle with domain-specific correctness. Calibrate by running the same test cases through human evaluation and comparing scores.

4. Running evaluations only at deploy time

Model behavior can change over time (provider-side updates, temperature drift, context changes). Run evaluations continuously on a sample of production traffic, not just before deployment.

Practical Takeaways

✅ Key Lessons
  • Define quality dimensions before measuring. Correctness, relevance, completeness, format, conciseness — decide which matter for your use case and weight them accordingly.
  • Build a golden test suite of 50-100 cases. Include easy cases, hard cases, edge cases, and adversarial inputs. This is your evaluation infrastructure — invest in it.
  • Use LLM-as-judge for scale, humans for calibration. The judge scales to thousands of evaluations per hour. Humans validate the judge is calibrated correctly. Both are necessary.
  • Gate deployments on evaluation scores. No prompt or model change ships without passing the test suite. Treat it like a test suite — red means no deploy.
  • Monitor quality in production continuously. Sample production traffic, evaluate asynchronously, alert on quality drops. Models and prompts drift — ongoing monitoring catches it.