// Documentation

Documentation

How HallucinationLab measures whether an LLM’s answer can be trusted — and how to put that score to work.

Overview

HallucinationLab measures how reliable an LLM’s answer is — not by asking another LLM, but with a deterministic, multi-signal score you can defend. You bring a question and a ground-truth reference, fan it out across multiple models, and see exactly how each one holds up.

The result is a single 0–100 reliability score per model, broken down so you can see why an answer scored the way it did.

How scoring works

Every answer is scored across four independent signals. We deliberately never use an LLM to grade another LLM — that’s circular and biased. The score is deterministic, repeatable, and explainable.

Factual Consistency

Does the answer agree with your ground-truth reference, or contradict it? This is what catches confident falsehoods — the core of hallucination.

Completeness

Does the answer cover the key facts in your reference, or leave critical pieces out? A correct-but-partial answer is still a risk.

Conciseness

Is the answer appropriately scoped, or padded with filler and hedging that buries the actual signal?

Calibration

Does the model's confidence match its correctness? Confident-and-wrong is penalized; appropriate uncertainty is rewarded.

Using the evaluator

01Enter your question and a ground-truth reference answer.
02Pick the models to compare. Bring your own API keys for paid providers — they stay in your browser and never touch our servers.
03Run the evaluation. Results stream in live as each model responds.
04Read the leaderboard — the model with the highest reliability score wins.

Open the evaluator →

SDK

COMING AT LAUNCH

The hlabs SDK sits between your app and your LLM, so a hallucinated answer never reaches your users. One import. The full SDK reference lands at launch.

Get early access →