LLM Evaluation Beyond Vibes

Problem Context

"It looks good to me" is not an evaluation strategy. Yet most teams assess LLM quality by eyeballing a few outputs, deciding it feels right, and shipping. Then a subtle regression slips through — the model starts adding a disclaimer to every response, or the tone shifts from professional to casual — and nobody catches it until customers complain.

LLM evaluation is harder than traditional ML evaluation because there's rarely one "correct" answer. A summary can be accurate but poorly written. A code review can be technically correct but miss the point. You need evaluation methods that capture the dimensions of quality that matter for your use case.

🤔 Sound familiar?

You updated a prompt and have no idea if it actually improved things or just felt better on the test cases you checked
Your team can't agree on what "good" output looks like for your use case
You want to catch regressions automatically but don't know what metrics to track
Business stakeholders ask "how good is the AI?" and you don't have a number to give them

This article builds a practical evaluation framework — from defining quality dimensions to automating regression detection.

Concept Explanation

Good LLM evaluation operates at three levels: offline evaluation (before deployment), online metrics (during production), and human feedback loops (continuous calibration). Each level catches different problems.


      flowchart TD
          A["Define Quality\nDimensions"] --> B["Build Test Suite\n(golden dataset)"]
          B --> C["Automated Evaluation\n(metrics + LLM-as-judge)"]
          C --> D["Human Evaluation\n(calibration)"]
          D --> E["CI Integration\n(regression gate)"]
          E --> F["Production Monitoring\n(online metrics)"]
          F -->|"Feedback loop"| B
      
          style A fill:#4f46e5,color:#fff,stroke:#4338ca
          style C fill:#059669,color:#fff,stroke:#047857
          style E fill:#7c3aed,color:#fff,stroke:#6d28d9
          style F fill:#dc2626,color:#fff,stroke:#b91c1c

Quality Dimensions

Before measuring anything, define what you're measuring. Common dimensions for engineering tasks:

Correctness: Is the output factually and technically accurate?
Relevance: Does the output address what was asked?
Completeness: Does it cover all required aspects?
Format compliance: Does it follow the specified structure?
Conciseness: Is it appropriately brief without losing substance?
Safety: Does it avoid harmful, biased, or inappropriate content?

LLM-as-Judge

Use a more capable model (e.g., GPT-4o) to evaluate outputs from your production model. Provide rubrics, reference answers, and scoring criteria. It's not perfect — judges have biases too — but it scales far better than human review and correlates well with human judgment on structured rubrics.

Implementation

Step 1: Build a Golden Test Suite

public class EvalTestCase
      {
          public required string Id { get; init; }
          public required string Category { get; init; }
          public required string Input { get; init; }
          public string? ReferenceOutput { get; init; }
          public required Dictionary&lt;string, double&gt; MinScores { get; init; }
      }
      
      // Example test cases
      var testSuite = new List&lt;EvalTestCase&gt;
      {
          new()
          {
              Id = "review-null-check",
              Category = "code-review",
              Input = "Review: if (user != null &amp;&amp; user.Name != null) { ... }",
              ReferenceOutput = "Use pattern matching: `if (user is { Name: not null })`...",
              MinScores = new()
              {
                  ["correctness"] = 0.8,
                  ["actionability"] = 0.7,
                  ["conciseness"] = 0.6
              }
          }
      };

Step 2: Automated Scoring with LLM-as-Judge

public class LlmJudge
      {
          private readonly ChatClient _judgeClient; // GPT-4o
      
          public async Task&lt;Dictionary&lt;string, double&gt;&gt; ScoreAsync(
              string input, string output, string? reference)
          {
              var prompt = $"""
                  Score the following AI output on these dimensions (0.0-1.0):
                  - correctness: Technical accuracy of the response
                  - relevance: How well it addresses the input
                  - completeness: Coverage of important aspects
                  - conciseness: Brevity without losing substance
      
                  INPUT: {input}
                  AI OUTPUT: {output}
                  {(reference != null ? $"REFERENCE: {reference}" : "")}
      
                  Return JSON: {{"correctness": 0.0, "relevance": 0.0,
                  "completeness": 0.0, "conciseness": 0.0}}
                  """;
      
              var response = await _judgeClient.CompleteChatAsync(
                  [new UserChatMessage(prompt)],
                  new ChatCompletionOptions
                  {
                      ResponseFormat = ChatResponseFormat.CreateJsonObjectFormat(),
                      Temperature = 0.1f
                  });
      
              return JsonSerializer.Deserialize&lt;Dictionary&lt;string, double&gt;&gt;(
                  response.Value.Content[0].Text)!;
          }
      }

Step 3: Regression Testing Pipeline

public class EvalPipeline
      {
          public async Task&lt;EvalReport&gt; RunAsync(
              IChatClient modelUnderTest,
              List&lt;EvalTestCase&gt; testSuite)
          {
              var results = new List&lt;EvalResult&gt;();
      
              foreach (var testCase in testSuite)
              {
                  var output = await modelUnderTest.CompleteAsync(testCase.Input);
                  var scores = await _judge.ScoreAsync(
                      testCase.Input, output, testCase.ReferenceOutput);
      
                  var passed = testCase.MinScores.All(
                      req => scores.GetValueOrDefault(req.Key, 0) >= req.Value);
      
                  results.Add(new EvalResult
                  {
                      TestCaseId = testCase.Id,
                      Scores = scores,
                      Passed = passed,
                      Output = output
                  });
              }
      
              return new EvalReport
              {
                  TotalTests = results.Count,
                  Passed = results.Count(r => r.Passed),
                  Failed = results.Count(r => !r.Passed),
                  AverageScores = AggregateScores(results),
                  FailedTests = results.Where(r => !r.Passed).ToList()
              };
          }
      }

Step 4: Production Quality Monitoring

public class ProductionMonitor
      {
          public async Task RecordInteraction(
              string input, string output, string modelId)
          {
              // Sample 5% of interactions for async evaluation
              if (Random.Shared.NextDouble() > 0.05) return;
      
              var scores = await _judge.ScoreAsync(input, output, reference: null);
      
              _metrics.RecordGauge("llm.quality.correctness", scores["correctness"],
                  new("model", modelId));
              _metrics.RecordGauge("llm.quality.relevance", scores["relevance"],
                  new("model", modelId));
      
              // Alert on quality drops
              if (scores.Values.Average() < 0.6)
                  _alerts.Trigger("llm-quality-below-threshold", new { modelId, scores });
          }
      }

Pitfalls

⚠️ Common Mistakes

1. Evaluating only on "happy path" inputs

Your test suite needs adversarial inputs, edge cases, and ambiguous queries — not just the clean examples that look good in demos. If every test case has an obvious correct answer, your evaluation won't catch the failures that matter.

2. Using a single aggregate score

"Quality: 0.82" tells you nothing about what's working and what's broken. Break quality into dimensions (correctness, relevance, format) and track them separately. A model can be highly correct but poorly formatted — that's a different fix than one that's well-formatted but wrong.

3. LLM-as-judge without calibration

Judge models have biases — they tend to prefer verbose outputs, give higher scores to outputs that resemble their own generation style, and struggle with domain-specific correctness. Calibrate by running the same test cases through human evaluation and comparing scores.

4. Running evaluations only at deploy time

Model behavior can change over time (provider-side updates, temperature drift, context changes). Run evaluations continuously on a sample of production traffic, not just before deployment.

Practical Takeaways

✅ Key Lessons

Define quality dimensions before measuring. Correctness, relevance, completeness, format, conciseness — decide which matter for your use case and weight them accordingly.
Build a golden test suite of 50-100 cases. Include easy cases, hard cases, edge cases, and adversarial inputs. This is your evaluation infrastructure — invest in it.
Use LLM-as-judge for scale, humans for calibration. The judge scales to thousands of evaluations per hour. Humans validate the judge is calibrated correctly. Both are necessary.
Gate deployments on evaluation scores. No prompt or model change ships without passing the test suite. Treat it like a test suite — red means no deploy.
Monitor quality in production continuously. Sample production traffic, evaluate asynchronously, alert on quality drops. Models and prompts drift — ongoing monitoring catches it.

LLM Evaluation Beyond Vibes

Problem Context

Concept Explanation

Quality Dimensions

LLM-as-Judge

Implementation

Step 1: Build a Golden Test Suite

Step 2: Automated Scoring with LLM-as-Judge

Step 3: Regression Testing Pipeline

Step 4: Production Quality Monitoring

Pitfalls

1. Evaluating only on "happy path" inputs

2. Using a single aggregate score

3. LLM-as-judge without calibration

4. Running evaluations only at deploy time

Practical Takeaways

Enjoyed this article?

Continue reading

Token Economics: Understanding and Optimizing LLM Costs

Small Language Models in Production

When to Fine-Tune vs Few-Shot vs RAG

Discussion

Stay ahead of the AI curve.