Evaluation Results

Accuracy measured across 213 basic rules questions — human-verified against expected answers.

Model Input (RAG) tokens Recall accuracy Calc accuracy Cost / query Avg response time Details
gpt-5.4 ~20k 99%* 79%* View
gpt-5.4-mini ~20k 93% 61% View
gpt-5-mini ~20k 99% 86% View
gpt-4.1-mini ~20k 91% 50% View

Accuracy from human-reviewed evals · Cost and timing from live production chat
* Estimated: gpt-5.4 run on 24 questions where gpt-5.4-mini failed; remaining 190 assumed correct.

  • The latest models can accurately answer recall questions for this simple eval. Next step is to evaluate with a harder eval; building the eval now.
  • The models don't reliably answer calculation questions. Next step: analyzing calculation errors to identify whether failures stem from retrieval, reasoning, or arithmetic — then targeting the weakest link.

Production Usage

Token consumption, cost, and response time from live chat interactions.

Tokens per Question
Cost per Question
Response Time per Question

Data from production chat interactions