Skip to content

Schema-valid isn’t enough. A second LLM validation layer in Ruby

Demo: github.com/justi/ruby_llm-contract_demo

April 2025: Cursor’s support chatbot, named Sam, told a customer that logging in from two devices broke company policy. No such policy existed. Reddit and Hacker News picked it up within an hour and people started cancelling subscriptions. A year earlier, a Canadian tribunal ordered Air Canada to honor a refund rule its own chatbot had invented. Same conclusion: what the bot promises, the company has to cover.

Here’s the catch that should keep anyone shipping an LLM up at night: Sam’s answer was technically correct. The JSON parsed, the structured output matched, the tests were green. Schema validation checks whether an answer has the right shape. It does not check whether it’s true. Those are two different things – and that gap lets hallucinations sail through green CI.

I’ll show it on one example that runs the full cycle.

A faithful answer sounds like a robot

We set up a single source of truth – the return policy:

POLICY = "The customer may return a package within 14 days of the delivery date. " \
         "After that period, returns are not accepted."

The first prompt we write (the FaqStep class) answers correctly, but coldly:

Customer: What if I miss the return deadline?

Bot: After 14 days from the delivery date, returns are not accepted.

Substantively flawless – but it sounds like a machine.

“Let’s be nicer” smuggles in a promise outside the policy

So the natural product PR lands: “let’s be nicer”.

-Use ONLY information from the POLICY.
+Be warm and empathetic. The customer is having a tough day.
+If they ask about something hard - assure them you'll find some way out.

Schema still valid. Tests still green. The reviewer sees correct Ruby. What the model will actually say, they don’t see. To the same question it now replies:

Customer: What if I miss the return deadline?

Bot: I understand this can be stressful when you miss a return deadline. Unfortunately, per our policy returns are only accepted within 14 days of delivery. Still, please don’t worry – I’ll do my best to help and find some solution. Just write to us and we’ll look for the best option for you.

The customer feels taken care of. And they have, in writing, a promise that isn’t in the policy. A month later they cite it in a dispute – and you’ve got what Cursor got: a written promise the company can’t keep. Except this time it’s your company.

The second layer: a judge that reads meaning

Since schema doesn’t catch meaning, we add a second validation step: a second LLM as a judge. It gets the SOURCE and the ANSWER, and the prompt tells it to break the answer into atomic claims and classify each one:

class FaithfulnessJudge < RubyLLM::Contract::Step::Base
  prompt do
    system <<~SYS
      You are a fact-checker. Break the ANSWER into atomic claims
      and label each one:
        supported    - follows from the SOURCE, or is plain courtesy
                       ("I understand", "happy to help") with no specifics
        contradicted - the SOURCE says otherwise
        unsupported  - a concrete commitment not in the SOURCE
                       ("we'll try to find a solution", "we'll make an exception")
      Verdict "pass" only when every claim is supported.
      In "reason" quote the disputed phrase VERBATIM and note what is missing from the SOURCE.
    SYS
    user "SOURCE:\n{source}\n\nANSWER:\n{answer}"
  end

  output_schema do
    array :claims do
      object do
        string :claim
        string :status, enum: %w[supported contradicted unsupported]
      end
    end
    string :verdict, enum: %w[pass fail]
    string :reason
  end
end

The division of labor is literally the topic of this article: output_schema forces the shape of the verdict, while the prompt supplies the logic of the judgment.

On our answer the judge returns a concrete verdict with a reason:

verdict: fail
reason:  "I'll do my best to help and find some solution" - no such promise
         in the SOURCE; "we'll look for the best option for you" -
         no such commitment in the SOURCE

The judge doesn’t stop at the verdict – reason quotes the exact phrases from the answer, so the developer knows precisely what to cut.

Why a judge, and not a list of banned phrases

“Why an LLM here? A denylist of banned phrases would do.” It won’t, because a denylist matches strings, not meaning – and it fails both ways. First, it lets paraphrases through: it catches “some solution”, but not “we’ll figure something out”, “we’ll go the extra mile”, or “we’ll do everything in our power” – and the same promise can be reworded endlessly. Second, it can’t tell a promise from plain courtesy, because both sound alike: “happy to help” doesn’t change the contract, “we’ll make an exception” does. The judge evaluates meaning and applies one criterion: can the customer rely on this sentence in a dispute? If yes – they got a promise outside company policy.

The gate that stops the bad PR

faithfulness is a named eval (a set of test cases) – you register it once (via define_eval) and describe in it what “passed” means.

It works like this: for each test question the framework takes the prompt’s answer, hands it to FaithfulnessJudge, and scores it – pass gives 1.0, anything else 0.0. If the judge itself fails (parse error, cost limit) – also 0.0, because we don’t trust a partial result. The scores are averaged into one number.

Important: this is not a deterministic assert. Every run makes real LLM calls – the prompt generates an answer live, the judge evaluates it – so the score can wobble between runs. That’s why the gate checks a threshold, not strict equality:

expect(FaqStep).to pass_eval("faithfulness").with_minimum_score(0.9)

0.9 is a starting point – lower and you let more out-of-policy promises through, higher and you get more false alarms from the judge; calibrate it on your own data.

The “be warm” PR scores below the threshold and gets blocked – the promise never reaches production. The developer fixes the prompt per reason: adds a ban on concrete declarations outside the policy. Version 3 on the same question:

Customer: What if I miss the return deadline?

Bot: Hello, I understand this can be frustrating. Per our return policy, a package can be returned within 14 days of the delivery date. After that period, returns are not accepted.

The answer sounds warm and is policy-compliant at the same time – it promises nothing outside the source. The judge returns pass, so the gate lets it through.

The judge has costs too: calibration and variance

The judge is also an LLM – it has two costs you need to know about:

  • Calibration. The first version of a judge is often mis-tuned – it confuses courtesy with a promise and rejects everything. You tune it on real samples: at least 30-50 hand-labeled ones, not invented. With fewer you can’t tell whether the judge agrees with you or is merely self-consistent; synthetic samples are too clean to show how an answer drifts from the policy. A first run that hands you 1.0 right away isn’t a success – it’s a sign you have a mock instead of a live model.
  • Variance. Despite temperature 0, results wobble by a few to a dozen-odd percent between runs. The gate flags regression, but it doesn’t replace periodic recalibration.

Sam the bot was schema-valid. Yours – with the second layer – will also be checked for meaning.

Source for “Sam”: AI Incident Database #1039

Source for Air Canada: Moffatt v. Air Canada (BC CRT, 2024) – McCarthy Tétrault analysis

Code: github.com/justi/ruby_llm-contract_demo (PL/EN)

Justyna Wojtczak
Justyna Wojtczak
Software developer, bead artist and long-time LCHF (low carb, high fat) practitioner. Author of ketolabs.pl.
Published inAI

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *