Evaluation Results
Accuracy measured across 213 basic rules questions — human-verified against expected answers.
| Model | Input (RAG) tokens | Recall accuracy | Calc accuracy | Cost / query | Avg response time | Details |
|---|---|---|---|---|---|---|
| gpt-5.4 | ~20k | 99%* | 79%* | — | — | View |
| gpt-5.4-mini | ~20k | 93% | 61% | — | — | View |
| gpt-5-mini | ~20k | 99% | 86% | — | — | View |
| gpt-4.1-mini | ~20k | 91% | 50% | — | — | View |
Accuracy from human-reviewed evals · Cost and timing from live production chat
* Estimated: gpt-5.4 run on 24 questions where gpt-5.4-mini failed; remaining 190 assumed correct.
- The latest models can accurately answer recall questions for this simple eval. Next step is to evaluate with a harder eval; building the eval now.
- The models don't reliably answer calculation questions. Next step: analyzing calculation errors to identify whether failures stem from retrieval, reasoning, or arithmetic — then targeting the weakest link.
Production Usage
Token consumption, cost, and response time from live chat interactions.
Tokens per Question
Cost per Question
Response Time per Question
Data from production chat interactions