Junjie Wu
junjie116.bsky.social
Junjie Wu
@junjie116.bsky.social
NLP PhD candidate@HKUST | Visiting PhD student @YaleNLP
In our study of ARAOC, we observe that even powerful LLMs like GPT-4o struggle with certain atomic operations. Based on this observation, we conduct a series of experiments to explore the reasons behind LLMs' lack of fluid intelligence, which leads to several findings.

(4/4)
February 15, 2025 at 4:17 AM
Our ARAOC benchmark breaks ARC into atomic steps—shedding new light on why LLMs lack fluid intelligence.

(3/4)
February 15, 2025 at 4:16 AM
📊 Even top models solve just 19% of ARC tasks vs. ~75% for humans!

(2/4)
February 15, 2025 at 4:16 AM
Key Findings from Our Experiments (Figs. 2 & 3):

1️⃣ Human vs. LLM Performance: Humans achieve ~90% accuracy, while top LLMs, including reasoning models (e.g., o1) and vision-language models (e.g., GPT-4o), lag by ~40%.

(3/5)
February 15, 2025 at 4:10 AM
🚀 Introducing PhysiCo: A New Benchmark for Evaluating Abstract Understanding in LLMs! 🚀

📚Link: physico-benchmark.github.io

While models like o3 have made impressive strides on ARC-AGI, how well do LLMs truly grasp the abstract patterns in ARC-style tasks?

(1/5)
February 15, 2025 at 4:09 AM