Junjie Wu
junjie116.bsky.social
Junjie Wu
@junjie116.bsky.social
NLP PhD candidate@HKUST | Visiting PhD student @YaleNLP
In our study of ARAOC, we observe that even powerful LLMs like GPT-4o struggle with certain atomic operations. Based on this observation, we conduct a series of experiments to explore the reasons behind LLMs' lack of fluid intelligence, which leads to several findings.

(4/4)
February 15, 2025 at 4:17 AM
Our ARAOC benchmark breaks ARC into atomic steps—shedding new light on why LLMs lack fluid intelligence.

(3/4)
February 15, 2025 at 4:16 AM
📊 Even top models solve just 19% of ARC tasks vs. ~75% for humans!

(2/4)
February 15, 2025 at 4:16 AM
3️⃣ Surface vs. True Comprehension:
SoTA LLMs have a perfect grasp of the physical concepts' definitions and a solid low-level understanding of grid inputs. Yet, the ~40% gap highlights a fundamental difference in abstract pattern understanding between humans and LLMs.

(5/5)
February 15, 2025 at 4:10 AM
2️⃣ Reasoning Limitations: Despite o1's better ARC performance, it fails to outperform GPT-4o on PhysiCo. Both o1 and gemini2.0-flash-thinking-exp fall behind GPT-4o on our leaderboard.

(4/5)
February 15, 2025 at 4:10 AM
Key Findings from Our Experiments (Figs. 2 & 3):

1️⃣ Human vs. LLM Performance: Humans achieve ~90% accuracy, while top LLMs, including reasoning models (e.g., o1) and vision-language models (e.g., GPT-4o), lag by ~40%.

(3/5)
February 15, 2025 at 4:10 AM
To explore this, we created PhysiCo — a benchmark where core concepts of physical phenomena are represented using few-shot 2D grids. LLMs must identify the phenomenon from four choices.

(2/5)
February 15, 2025 at 4:09 AM