Alex Heyman
alexheyman.bsky.social
Alex Heyman
@alexheyman.bsky.social
PhD candidate AI/ML researcher at York University, ON, CA | they/them
Given that both OpenAI and DeepSeek have found diminishing returns on accuracy from chain-of-thought length, we advise against a focus on scaling up CoT and in favor of architecture/training innovations to combat the performance and reliability issues our work helps show.
(10/10)
February 13, 2025 at 6:14 PM
Reasoning models seem to have improved at possibility space exploration (important for low-greedy-score problems) less than at linear step-by-step reasoning, and their lack of a zone of complete reliability here is concerning for efforts to deploy them fully autonomously.
(9/10)
February 13, 2025 at 6:14 PM
For both reasoning models, error rates are generally higher for the “friends” semantic frame than the explicit math frame, perhaps due to math-focused training. For the standard LLMs, frame effects are substantial and show trends within models but can vary between models.
(8/10)
February 13, 2025 at 6:14 PM
However, the reasoning models (like the standard LLMs) fail to reach 0% error even in the simple domain of 4-vertex 2-coloring, and on colorable problems they make proportionally more mistakes with decreasing greedy scores even more than the standard LLMs do.
(7/10)
February 13, 2025 at 6:14 PM
In addition to o1-mini and DeepSeek-R1, we test the standard LLMs Llama 3.1 405B, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. To the reasoning models’ credit, the standard LLMs fare much worse, exceeding 60% error on the most difficult problem types in all frames.
(6/10)
February 13, 2025 at 6:14 PM
We also present our problems in multiple semantic frames across prompts – explicitly in terms of a graph, or in terms of cities connected by highways or networks of friendships, or with a small phrasing change – motivated by an interest in probing models’ semantic biases.
(5/10)
February 13, 2025 at 6:14 PM
We further categorize colorable problems by how feasible they are to solve without exploring multiple coloring possibilities. The success rate of a simple, deliberately flawed randomized greedy algorithm over 10k trials yields a “greedy score” for each problem.
(4/10)
February 13, 2025 at 6:14 PM
Given an undirected graph and a set of colors, we require models to color the vertices so no pair sharing an edge also share a color, or report that this is impossible. We generate a dataset of ~4k problems with vertex counts from 4 to 8 and color counts from 2 to 4.
(3/10)
February 13, 2025 at 6:14 PM