TRAIL-SWE Leaderboard

Rank	Model	Joint Accuracy	Categorical F1	Location Accuracy	Date
1	Llama-4-Maverick-17B-128E-Instruct	0.050	0.148	0.238	2025-05-14

Rank	Model	Joint Accuracy	Categorical F1	Location Accuracy	Date
1	Gemini-2.5-Pro-Preview-05-06	0.050	0.148	0.238	2025-05-14
2	Gemini-2.5-Flash-Preview-04-17	0.000	0.213	0.060	2025-05-14
3	Llama-4-Maverick-17B-128E-Instruct	0.000	0.191	0.083	2025-05-14
4	GPT-4.1	0.000	0.166	0.000	2025-05-14
5	Llama-4-Scout-17B-16E-Instruct	0.000	0.050	0.000	2025-05-14
6	Open AI o1	CLE	CLE	CLE	2025-05-14
7	Open AI o3	CLE	CLE	CLE	2025-05-14
8	Anthropic Claude-3.7-Sonnet	CLE	CLE	CLE	2025-05-14

TRAIL-GAIA Leaderboard

Rank	Model	Joint Accuracy	Categorical F1	Location Accuracy	Date
1	Llama-4-Maverick-17B-128E-Instruct	0.183	0.389	0.546	2025-05-14

Rank	Model	Joint Accuracy	Categorical F1	Location Accuracy	Date
1	Gemini-2.5-Pro-Preview-05-06	0.183	0.389	0.546	2025-05-14
2	Gemini-2.5-Flash-Preview-04-17	0.1	0.337	0.372	2025-05-14
3	Open AI o3	0.092	0.296	0.535	2025-05-14
4	Anthropic Claude-3.7-Sonnet	0.047	0.254	0.204	2025-05-14
5	GPT-4.1	0.028	0.218	0.107	2025-05-14
6	Open AI o1	0.013	0.138	0.04	2025-05-14
7	Llama-4-Maverick-17B-128E-Instruct	0	0.122	0.023	2025-05-14
8	Llama-4-Scout-17B-16E-Instruct	0	0.041	0	2025-05-14

Submit Your Results as ZIP

See instructions in README before submitting.

Upload ZIP File

Status

Model Performance Leaderboard

This is a Hugging Face Space that hosts a leaderboard for comparing model performances across various metrics of TRAIL dataset.

Features

Submit Your Answers: Run your model on TRAIL dataset. Submit your results.
Leaderboard: View how your submissions are ranked.

Instructions

Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset.
Please upload a zip file containing your model outputs. The zip file should contain:
- One or more directories with model outputs
- Each directory should contain JSON files with the model's predictions
- Directory names should indicate the split (GAIA_ or SWE_)
Once the evaluation is complete, we’ll upload the scores (this process will soon be automated).

Benchmarking on TRAIL

TRAIL(Trace Reasoning and Agentic Issue Localization) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.

License

This project is open source and available under the MIT license.

TRAIL: Trace Reasoning and Agentic Issue Localization Leaderboard