Alex Makelov
banner
amakelov.bsky.social
Alex Makelov
@amakelov.bsky.social
Mechanistic interpretability
Creator of https://github.com/amakelov/mandala
prev. Harvard/MIT
machine learning, theoretical computer science, competition math.
Finally, it realizes and tries to fix the off-by-a-factor-of-6 issue. It writes a little essay giving what mathematicians would call a "moral" argument for why everything is OK. Pretty close!
December 5, 2024 at 9:20 PM
Then, it counts these triples. Unfortunately, it counts the number of ordered triples, which overestimates the number of unordered triples (what we care about) by about a factor of 6. Then it proceeds to the key step - lower-bound the average number of representations:
December 5, 2024 at 9:19 PM
So how does o1 do? Well, still not perfect, but it gets the overall steps correct! It goes for a direct pigeonhole argument. It eventually figures out that if we look at triples of numbers at most 18,000 each, the sum of their squares is always less than 1,000,000,000:
December 5, 2024 at 9:18 PM
Similarly, o1-mini (and o1-preview, from what I remember - it's not available in chat anymore) recalls the asymptotic statement, and spends more time talking about it, but also proves nothing about the constant.
December 5, 2024 at 9:18 PM
So how do LLMs do on this problem? 4o spits out a bunch of related facts and confidently asserts the (correct) answer without justification. Importantly, it states that the number of representations grows as sqrt(n) asymptotically - which is true, but the constant is decisive.
December 5, 2024 at 9:17 PM
The problem superficially pattern-matches to some heavy-ish tools, like Pythagorean triples or Legendre's three-square theorem; however, the only solution I'm aware of is actually quite simple and uses no "theory".
December 5, 2024 at 9:16 PM
Some fun with o1 from OpenAI: there's a math problem I often give to "reasoning" AIs to try them out. It's basically to prove that there's a number less than 1 billion that you can write in 1000 different ways as a sum of 3 squares (precise statement in the pic).
December 5, 2024 at 9:15 PM
yes, this is what mechanistic interpretability research looks like
November 24, 2024 at 7:51 PM