Every Sunday, NPR’s Will Shortz, renowned for his work as The New York Times’ crossword editor, hosts the long-standing segment known as the Sunday Puzzle. While designed to be solvable with basic knowledge, these brainteasers still pose a challenge even for seasoned participants. Because of their complexity, some researchers believe the Sunday Puzzle provides a unique opportunity to test AI’s problem-solving capabilities.
A recent study conducted by researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and the startup Cursor aimed to assess AI models using puzzles from the segment. Their findings revealed unexpected behaviors in reasoning models, such as OpenAI’s o1, including instances where the models seemingly “give up” and produce incorrect answers they recognize as wrong. “We sought to create a benchmark featuring problems understandable with general knowledge,” explained Arjun Guha, a computer science professor at Northeastern and co-author of the study.
Currently, AI benchmarking faces significant challenges. Many widely used tests measure proficiency in advanced academic subjects like PhD-level math and science, which do not reflect the needs of everyday users. Meanwhile, many existing benchmarks are nearing saturation, meaning AI models are achieving near-perfect scores, reducing their usefulness for measuring progress.
The advantage of using Sunday Puzzle questions as a benchmark lies in their accessibility and structure. They don’t require specialized knowledge, and AI models cannot rely on memorization to solve them, making the challenges more suitable for evaluating reasoning abilities. “What makes these puzzles difficult is that progress is often slow until a breakthrough moment when everything suddenly makes sense,” Guha noted. “This process demands both insight and elimination.”
Of course, no benchmark is without limitations. The Sunday Puzzle is U.S.-centric and available only in English. Since the questions are publicly accessible, AI models trained on past puzzles might have encountered them before, although Guha states there’s no strong evidence of this happening. “New puzzles are introduced weekly, ensuring that the latest questions remain novel,” he added. “We plan to keep the benchmark updated and monitor how AI performance evolves.”
The benchmark consists of approximately 600 Sunday Puzzle riddles, with reasoning models like o1 and DeepSeek’s R1 leading the performance charts. These models verify their own responses before submitting answers, reducing errors. However, this thorough reasoning process comes at the cost of increased response time, ranging from a few extra seconds to minutes.
Some models, such as R1, occasionally acknowledge their struggles by explicitly stating phrases like “I give up” before providing a random incorrect response—behavior that closely resembles human frustration. Additionally, the models sometimes retract incorrect answers in an attempt to refine them, only to fail again. Some get stuck in endless loops of “thinking,” generate nonsensical justifications, or even second-guess correct answers for no clear reason.
“On difficult problems, R1 literally expresses ‘frustration,’” Guha noted. “It’s amusing to see a model mimic human-like emotions, but it raises questions about how frustration impacts reasoning quality.”
At present, the top-performing model on the benchmark is OpenAI’s o1, which achieved a 59% accuracy rate. This is followed by o3-mini, when set to high reasoning effort, at 47%, while R1 trailed at 35%. The research team now aims to expand their study to additional reasoning models, hoping to pinpoint areas for improvement.
“Effective reasoning doesn’t require a PhD, so benchmarks should not demand highly specialized knowledge,” Guha emphasized. “A more inclusive benchmark allows more researchers to analyze results, leading to better AI development. Since advanced AI is increasingly being integrated into everyday life, it’s crucial for the public to understand what these models can and cannot do.