// Documentation
Documentation
How HallucinationLab measures whether an LLM’s answer can be trusted — and how to put that score to work.
Overview
HallucinationLab measures how reliable an LLM’s answer is — not by asking another LLM, but with a deterministic, multi-signal score you can defend. You bring a question and a ground-truth reference, fan it out across multiple models, and see exactly how each one holds up.
The result is a single 0–100 reliability score per model, broken down so you can see why an answer scored the way it did.
How scoring works
Every answer is scored across four independent signals. We deliberately never use an LLM to grade another LLM — that’s circular and biased. The score is deterministic, repeatable, and explainable.
Factual Consistency
Does the answer agree with your ground-truth reference, or contradict it? This is what catches confident falsehoods — the core of hallucination.
Completeness
Does the answer cover the key facts in your reference, or leave critical pieces out? A correct-but-partial answer is still a risk.
Conciseness
Is the answer appropriately scoped, or padded with filler and hedging that buries the actual signal?
Calibration
Does the model's confidence match its correctness? Confident-and-wrong is penalized; appropriate uncertainty is rewarded.
Using the evaluator
- 01Enter your question and a ground-truth reference answer.
- 02Pick the models to compare. Bring your own API keys for paid providers — they stay in your browser and never touch our servers.
- 03Run the evaluation. Results stream in live as each model responds.
- 04Read the leaderboard — the model with the highest reliability score wins.
SDK
The hlabs SDK sits between your app and your LLM, so a hallucinated answer never reaches your users. One import. The full SDK reference lands at launch.
Get early access →