Evaluation Results

DATE MODEL EVAL JUDGE PASS FAIL NEEDS REVIEW TOTAL ACTIONS
2026-01-14 gpt-4o / asl-formatted-v2 asl-evals-section-a-closed AI Judge 46 (70%) 14 (21%) 6 (9%) 66 View Details
2026-01-14 gpt-4o / asl-formatted-v2 asl-evals-section-a-closed Human Review 51 (77%) 15 (23%) 0 (0%) 66 View Details
2026-01-13 gpt-4o / asl-formatted-v2 asl-evals-section-b-closed AI Judge 26 (70%) 2 (5%) 9 (24%) 37 View Details
2026-01-13 gpt-4o / asl-formatted-v2 asl-evals-section-b-closed Human Review 34 (92%) 3 (8%) 0 (0%) 37 View Details

Column Notes:

  • Model: Base model / fine-tuned model name
  • Eval: Evaluation file corresponding to a section of the rulebook