Evaluation Results

DATE MODEL (Base / Fine-tuned) EVAL JUDGE QUESTION TYPE PASS FAIL NEEDS REVIEW TOTAL ACTIONS
2026-02-18 gpt-4.1-mini asl-evals-combined Human Review Calc 14 (50%) 14 (50%) 0 (0%) 28 View Details
2026-02-18 gpt-4.1-mini asl-evals-combined Human Review Recall 170 (91%) 16 (9%) 0 (0%) 186 View Details

Column Notes:

  • Model: Base model / fine-tuned model name
  • Eval: Evaluation file corresponding to a section of the rulebook
  • Question Type: Calc = Calculation questions (recall + computation), Recall = Pure rule recall questions
  • Note: Only Human Review results are shown (AI Judge results are hidden)