Evaluation Results
| DATE | MODEL (Base / Fine-tuned) | EVAL | JUDGE | QUESTION TYPE | PASS | FAIL | NEEDS REVIEW | TOTAL | ACTIONS |
|---|---|---|---|---|---|---|---|---|---|
| 2026-02-18 | gpt-4.1-mini | asl-evals-combined | Human Review | Calc | 14 (50%) | 14 (50%) | 0 (0%) | 28 | View Details |
| 2026-02-18 | gpt-4.1-mini | asl-evals-combined | Human Review | Recall | 170 (91%) | 16 (9%) | 0 (0%) | 186 | View Details |
Column Notes:
- Model: Base model / fine-tuned model name
- Eval: Evaluation file corresponding to a section of the rulebook
- Question Type: Calc = Calculation questions (recall + computation), Recall = Pure rule recall questions
- Note: Only Human Review results are shown (AI Judge results are hidden)