The AI Colosseum

Welcome to The AI Colosseum, an experimental arena where we test modern AI models against classic NP-hard optimization problems (i.e., Operations Research). Our goal is to push these models beyond simple reasoning tasks and into the realm of complex mathematical decision-making.

TSP

The Traveling Salesman Problem (TSP) serves as our first benchmark, offering a clear metric for evaluating how well AI can handle combinatorial optimization. As we increase problem size, the difficulty grows exponentially. Can AI rise to the challenge?

TSP - 20 nodes up to 3 shots

The 20-node TSP is our current best benchmark for evaluating AI in OR. Unlike the 10-node case, no model has yet reached the optimal solution.

Model Optimal Opt. Gap (%) Runtime (s) Shots Link
OpenAI o3-mini-high N 9.76 542 3 OpenAI chatGPT
Google Gemini 2.0 F. Exp. 01-21 N 20.76 164 3 Google AI Studio

TSP - 10 nodes up to 3 shots

The 10-node TSP problem served as a preliminary test to determine which models would be evaluated at the next level.

Model Optimal Opt. Gap (%) Runtime (s) Shots Link
Google Gemini 2.0 F. Exp. 01-21 Y 0.00 164 3 Google AI Studio
OpenAI o3-mini-high Y 0.00 467 3 OpenAI chatGPT
X Grok 3 beta Think N 12.99 744 3 X AI Grok
X Grok 2 N 17.36 13 3 X AI Grok
Ai2 Llama Tülu 3 405B N 17.36 74 3 Ai2 Playground
OpenAI o1 N 18.41 568 3 OpenAI chatGPT
Anthropic Claude 3.5 Sonnet N 19.90 41 3 Anthropic Claude
Qwen QwQ2.5-Max-Preview Think N 19.95 765 3 Qwen Chat
DeepSeek R1 N 26.47 614 3 DeepSeek R1
Anthropic Claude 3.7 Sonnet N 44.53 27 3 Anthropic Claude
Mistral Le Chat N 58.56 102 3 Mistral Le Chat
groq Llama 3.3 70B SpecDeck 8k N 95.20 3 3 Groq Playground
INSPECT!