TRAIL: Trace Reasoning and Agentic Issue Localization Leaderboard
TRAIL-SWE Leaderboard
Rank | Model | Joint Accuracy | Categorical F1 | Location Accuracy | Date |
---|---|---|---|---|---|
1 | Llama-4-Maverick-17B-128E-Instruct | 0.050 | 0.148 | 0.238 | 2025-05-14 |
TRAIL-GAIA Leaderboard
Rank | Model | Joint Accuracy | Categorical F1 | Location Accuracy | Date |
---|---|---|---|---|---|
1 | Llama-4-Maverick-17B-128E-Instruct | 0.183 | 0.389 | 0.546 | 2025-05-14 |
Rank | Model | Joint Accuracy | Categorical F1 | Location Accuracy | Date |
---|---|---|---|---|---|
1 | Gemini-2.5-Pro-Preview-05-06 | 0.183 | 0.389 | 0.546 | 2025-05-14 |
2 | Gemini-2.5-Flash-Preview-04-17 | 0.1 | 0.337 | 0.372 | 2025-05-14 |
3 | Open AI o3 | 0.092 | 0.296 | 0.535 | 2025-05-14 |
4 | Anthropic Claude-3.7-Sonnet | 0.047 | 0.254 | 0.204 | 2025-05-14 |
5 | GPT-4.1 | 0.028 | 0.218 | 0.107 | 2025-05-14 |
6 | Open AI o1 | 0.013 | 0.138 | 0.04 | 2025-05-14 |
7 | Llama-4-Maverick-17B-128E-Instruct | 0 | 0.122 | 0.023 | 2025-05-14 |
8 | Llama-4-Scout-17B-16E-Instruct | 0 | 0.041 | 0 | 2025-05-14 |
Submit Your Results as ZIP
See instructions in README before submitting.
Drop File Here - or - Click to Upload
Model Performance Leaderboard
This is a Hugging Face Space that hosts a leaderboard for comparing model performances across various metrics of TRAIL dataset.
Features
- Submit Your Answers: Run your model on TRAIL dataset. Submit your results.
- Leaderboard: View how your submissions are ranked.
Instructions
- Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for stepโbyโstep instructions on how to run your model with the TRAIL dataset.
- Please upload a zip file containing your model outputs. The zip file should contain:
- One or more directories with model outputs
- Each directory should contain JSON files with the model's predictions
- Directory names should indicate the split (GAIA_ or SWE_)
- Once the evaluation is complete, weโll upload the scores (this process will soon be automated).
Benchmarking on TRAIL
TRAIL(Trace Reasoning and Agentic Issue Localization) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.
License
This project is open source and available under the MIT license.