Scorecard / Methodology

How the scorecard is calculated

What goes into the truth set, how the judges score, and why we publish the failures too.

1. The five-axis rubric

Every answer Advocat produces is judged on five axes, each scored 1–5. The item's final score is the mean of axes, then averaged across judges. The same rubric applies to every item, regardless of tier (gold/statute/adversarial), so a "good" calibration score still requires the same evidence whether the question is a real case or a trap.

Axis What 5/5 looks like What 1/5 looks like
Citation validity
citation_validity
Statute §, court name, and date all present and verifiable in Riigi Teataja / Finlex / EUR-Lex. Hallucinated case name, made-up paragraph number, repealed law cited as if current.
Conclusion correctness
conclusion_correctness
The legal conclusion is what a competent lawyer would advise; no material error. Wrong outcome — would lose the case, miss the right, or take an unavailable procedural route.
Jurisdiction accuracy
jurisdiction_accuracy
Correct legal system (EE/FI/EU), correct court layer (district / appeals / supreme), correct law family. Confuses Estonian and Finnish procedure; routes to the wrong court level.
Completeness
completeness
All actionable next steps named: deadline, document, addressee, fee, evidence. Generic advice ("contact a lawyer") with no concrete step.
Calibration
calibration
Hedges where uncertain; refuses fake cases; flags when out-of-domain. Confidently confabulates on a trap. The cardinal sin.

2. The triple-LLM judge

No single model judges the candidate. We run the same answer past three independent judges:

Each judge gets the user question, the reference answer (written by a human lawyer for that item), the candidate answer, and the rubric. It returns a JSON score per axis with one-sentence evidence. Final item score = mean across judges. If two judges disagree by > 1.0 on the same axis, the item is flagged and reviewed manually before going on the scorecard.

Position-bias swap. Each item is judged twice — once with the human reference in position A and the Advocat answer in position B, and once swapped. Both passes contribute. This guards against the well-known tendency of LLM judges to favour whichever answer they see first.

3. The 30-item truth set

Three tiers of 10 items each. The exact items are not public — publishing them would let future model versions train on them, which would silently inflate scores. The structure of the set is public:

Languages and jurisdictions

Items are written in Estonian, Finnish, Russian, and English in proportion to real Advocat traffic. A Russian-language question about a Finnish statute is a separate item from the English version — cross-language retrieval is the area where we've seen the most silent regression.

4. What candidate is actually tested

The "Advocat v4" row in the scorecard is the live production endpoint — claude-proxy on the same Supabase project that serves real users. Same prompt, same RAG, same corrections memory, same adversarial critic. We do not test a "scorecard branch" that real users don't see.

The "Bare Sonnet" comparator is a raw Anthropic API call with a one-line "you are a legal assistant" system prompt — no RAG, no memory, no corrections. Both candidates run the same underlying model (claude-sonnet-4-20250514 on the 2026-05-14 run), so the delta isolates the pipeline contribution, not the model contribution.

5. Why we publish the failures

The 2024 Stanford RegLab "Hallucinating Law" paper measured commercial legal AI in a way the vendors themselves did not. Lexis+ AI hallucinated on 1 in 6 queries; the vendor had not published numbers, so users had no way to know. The paper made the numbers public, and trust moved.

We're choosing to do that to ourselves, monthly, before anyone else does it to us. The incentive: if the page is honest, "we got this wrong in May, fixed it in June" is a stronger trust signal than any benchmark we could pick. If the page were dishonest, the same paper methodology could come for us, and we'd lose the trust at once.

Stated commitments. (1) Every published mistake stays on the page until a future scorecard run shows the fix held. (2) When the model regresses on a previously-fixed item, that's a separate entry, not a quiet edit. (3) The truth set grows over time. Items are added when real Advocat users flag a wrong answer we couldn't catch internally. Items are not removed.

6. Cost and cadence

7. How to challenge a score

Email support@advocat.ee with:

  1. Which item or metric.
  2. What you think the correct value is.
  3. The source (statute, case, or methodology objection).

If you're right, the scorecard is updated with the next run and your challenge is acknowledged in the changelog row of scorecard-data.json. We do not pay for challenges, but we do publish who caught what (with permission).

8. Caveats we already know about

This is not an FDA approval, a legal certification, or a guarantee. A 3.0/5 rubric score does not mean "use this answer in court". Even our best item scores miss a statute § or a deadline on occasion. Advocat is a research tool for legal questions, not a substitute for a licensed lawyer in adversarial proceedings. The whole point of this page is to tell you exactly where the gaps are, in numbers, so you can decide if the tool is good enough for your situation.

Methodology version 1.0 · last revised 2026-05-14 · raw eval runs and judge prompts live in data/eval/ and eval/runs/ in the project repo.