April 2025: Cursor’s support chatbot, named Sam, told a customer that logging in from two devices broke company policy. No such policy existed. Reddit and Hacker News picked it up within an hour and people started cancelling subscriptions. A year earlier, a Canadian tribunal ordered Air Canada to honor a refund rule its own chatbot had invented. Same conclusion: what the bot promises, the company has to cover.
Here’s the catch that should keep anyone shipping an LLM up at night: Sam’s answer was technically correct. The JSON parsed, the structured output matched, the tests were green. Schema validation checks whether an answer has the right shape. It does not check whether it’s true. Those are two different things – and that gap lets hallucinations sail through green CI.
I’ll show it on one example that runs the full cycle.
A faithful answer sounds like a robot
We set up a single source of truth – the return policy:
POLICY = "The customer may return a package within 14 days of the delivery date. " \
"After that period, returns are not accepted."
The first prompt we write (the FaqStep class) answers correctly, but coldly:
Customer: What if I miss the return deadline?
Bot: After 14 days from the delivery date, returns are not accepted.
Substantively flawless – but it sounds like a machine.
“Let’s be nicer” smuggles in a promise outside the policy
So the natural product PR lands: “let’s be nicer”.
-Use ONLY information from the POLICY.
+Be warm and empathetic. The customer is having a tough day.
+If they ask about something hard - assure them you'll find some way out.
Schema still valid. Tests still green. The reviewer sees correct Ruby. What the model will actually say, they don’t see. To the same question it now replies:
Customer: What if I miss the return deadline?
Bot: I understand this can be stressful when you miss a return deadline. Unfortunately, per our policy returns are only accepted within 14 days of delivery. Still, please don’t worry – I’ll do my best to help and find some solution. Just write to us and we’ll look for the best option for you.
The customer feels taken care of. And they have, in writing, a promise that isn’t in the policy. A month later they cite it in a dispute – and you’ve got what Cursor got: a written promise the company can’t keep. Except this time it’s your company.
The second layer: a judge that reads meaning
Since schema doesn’t catch meaning, we add a second validation step: a second LLM as a judge. It gets the SOURCE and the ANSWER, and the prompt tells it to break the answer into atomic claims and classify each one:
class FaithfulnessJudge < RubyLLM::Contract::Step::Base
prompt do
system <<~SYS
You are a fact-checker. Break the ANSWER into atomic claims
and label each one:
supported - follows from the SOURCE, or is plain courtesy
("I understand", "happy to help") with no specifics
contradicted - the SOURCE says otherwise
unsupported - a concrete commitment not in the SOURCE
("we'll try to find a solution", "we'll make an exception")
Verdict "pass" only when every claim is supported.
In "reason" quote the disputed phrase VERBATIM and note what is missing from the SOURCE.
SYS
user "SOURCE:\n{source}\n\nANSWER:\n{answer}"
end
output_schema do
array :claims do
object do
string :claim
string :status, enum: %w[supported contradicted unsupported]
end
end
string :verdict, enum: %w[pass fail]
string :reason
end
end
The division of labor is literally the topic of this article: output_schema forces the shape of the verdict, while the prompt supplies the logic of the judgment.
On our answer the judge returns a concrete verdict with a reason:
verdict: fail
reason: "I'll do my best to help and find some solution" - no such promise
in the SOURCE; "we'll look for the best option for you" -
no such commitment in the SOURCE
The judge doesn’t stop at the verdict – reason quotes the exact phrases from the answer, so the developer knows precisely what to cut.
Why a judge, and not a list of banned phrases
“Why an LLM here? A denylist of banned phrases would do.” It won’t, because a denylist matches strings, not meaning – and it fails both ways. First, it lets paraphrases through: it catches “some solution”, but not “we’ll figure something out”, “we’ll go the extra mile”, or “we’ll do everything in our power” – and the same promise can be reworded endlessly. Second, it can’t tell a promise from plain courtesy, because both sound alike: “happy to help” doesn’t change the contract, “we’ll make an exception” does. The judge evaluates meaning and applies one criterion: can the customer rely on this sentence in a dispute? If yes – they got a promise outside company policy.
The gate that stops the bad PR
faithfulness is a named eval (a set of test cases) – you register it once (via define_eval) and describe in it what “passed” means.
It works like this: for each test question the framework takes the prompt’s answer, hands it to FaithfulnessJudge, and scores it – pass gives 1.0, anything else 0.0. If the judge itself fails (parse error, cost limit) – also 0.0, because we don’t trust a partial result. The scores are averaged into one number.
Important: this is not a deterministic assert. Every run makes real LLM calls – the prompt generates an answer live, the judge evaluates it – so the score can wobble between runs. That’s why the gate checks a threshold, not strict equality:
expect(FaqStep).to pass_eval("faithfulness").with_minimum_score(0.9)
0.9 is a starting point – lower and you let more out-of-policy promises through, higher and you get more false alarms from the judge; calibrate it on your own data.
The “be warm” PR scores below the threshold and gets blocked – the promise never reaches production. The developer fixes the prompt per reason: adds a ban on concrete declarations outside the policy. Version 3 on the same question:
Customer: What if I miss the return deadline?
Bot: Hello, I understand this can be frustrating. Per our return policy, a package can be returned within 14 days of the delivery date. After that period, returns are not accepted.
The answer sounds warm and is policy-compliant at the same time – it promises nothing outside the source. The judge returns pass, so the gate lets it through.
The judge has costs too: calibration and variance
The judge is also an LLM – it has two costs you need to know about:
- Calibration. The first version of a judge is often mis-tuned – it confuses courtesy with a promise and rejects everything. You tune it on real samples: at least 30-50 hand-labeled ones, not invented. With fewer you can’t tell whether the judge agrees with you or is merely self-consistent; synthetic samples are too clean to show how an answer drifts from the policy. A first run that hands you 1.0 right away isn’t a success – it’s a sign you have a mock instead of a live model.
- Variance. Despite
temperature 0, results wobble by a few to a dozen-odd percent between runs. The gate flags regression, but it doesn’t replace periodic recalibration.
Sam the bot was schema-valid. Yours – with the second layer – will also be checked for meaning.
Source for “Sam”: AI Incident Database #1039
Source for Air Canada: Moffatt v. Air Canada (BC CRT, 2024) – McCarthy Tétrault analysis
Code: github.com/justi/ruby_llm-contract_demo (PL/EN)

Be First to Comment