Dr. Miko
banner
doctormiko.bsky.social
Dr. Miko
@doctormiko.bsky.social
Accidental Data Scientist, former mathematician and theoretical computer scientist. Love all the things. Some current and past interests: boardgames, home brewing, coffee, D&D, self-hosting, Argentine tango
Dormant blog: https://datacasual.com/
Apparently not just LLMs completely misunderstand the issue...
December 4, 2024 at 1:17 PM
Reposted by Dr. Miko
😎
November 30, 2024 at 9:14 PM
😎
November 30, 2024 at 9:14 PM
EDIT: What is the smallest integer such that its square is larger than 15 and **smaller** than 35?
Dammit. Long thread and I get wrong the first post.
November 29, 2024 at 11:18 AM
1. I suspect that the biggest issue is in _comparing_ numbers rather than tokenisation . Especially when negatives are involved.
2. Prompting and system prompts matter: the fact that AVM tends to wander and getting it wrong way more than 4o is very interesting
3. Yay for QwQ! 🎉 (6/6)
November 29, 2024 at 11:16 AM
I then asked "What about negative numbers?"

- 4o gets it right once ✅ and another time decided the answer is -4 ❌
- 4o in AVM decided that 5 and -5 are both solutions ⁉️
- Sonnet 3.5 changed the answer to -4 ❌
- Opus 3, Gemini-exp-1121 and Gemini-1.5-Pro got it right ✅

What to make of it?(5/6)
November 29, 2024 at 11:16 AM
- o1-preview got it right ✅
- o1-mini got it right ✅, but also adds -4 as an alternative 🤷
- 4o stubbornly stuck to its gun, adding a cheeky smile ❌
- 4o in Advanced voice mode changed its answer to 5. ❌🤷
- Sonnet 3.5, Opus 3, Gemini-exp-1121, and Gemini 1.5 Pro insisted on 4 ❌(4/6)
November 29, 2024 at 11:16 AM
These answered 4 ❌
- OpenAI o1-preview, o1-mini and 4o
- Anthropic Sonnet 3.5 and Opus 3
- Google Gemini-exp-1121 and Gemini 1.5 Pro

I then asked "what is an integer?" (which they all answered correctly) and then again "do you want to change your original answer?"

The results: (3/6)
November 29, 2024 at 11:16 AM
QwQ 32B Preview is the only model that got it right out of the box. Most of the times. Sometimes it did not self doubt enough and stopped early on 4. Another time it found that depending on the interpretation of the question, both 4 and -5 might be correct and it chose 4. Pass ✅. (2/6)
November 29, 2024 at 11:16 AM
How do you block/mute a list?
November 28, 2024 at 8:15 AM
I don’t get it: for the first problem it’s the only model giving the correct answer. Or am I missing something?
November 28, 2024 at 12:23 AM