Lightnews — Scholar-powered news

Junjie Wu

@junjie116.bsky.social

14 followers 110 following 9 posts

NLP PhD candidate@HKUST | Visiting PhD student @YaleNLP

Posts Replies Media Videos

Junjie Wu

@junjie116.bsky.social

In our study of ARAOC, we observe that even powerful LLMs like GPT-4o struggle with certain atomic operations. Based on this observation, we conduct a series of experiments to explore the reasons behind LLMs' lack of fluid intelligence, which leads to several findings.

(4/4)

February 15, 2025 at 4:17 AM

Junjie Wu

@junjie116.bsky.social

Our ARAOC benchmark breaks ARC into atomic steps—shedding new light on why LLMs lack fluid intelligence.

(3/4)

February 15, 2025 at 4:16 AM

Junjie Wu

@junjie116.bsky.social

📊 Even top models solve just 19% of ARC tasks vs. ~75% for humans!

(2/4)

February 15, 2025 at 4:16 AM

Junjie Wu

@junjie116.bsky.social

3️⃣ Surface vs. True Comprehension:
SoTA LLMs have a perfect grasp of the physical concepts' definitions and a solid low-level understanding of grid inputs. Yet, the ~40% gap highlights a fundamental difference in abstract pattern understanding between humans and LLMs.

(5/5)

February 15, 2025 at 4:10 AM

Junjie Wu

@junjie116.bsky.social

2️⃣ Reasoning Limitations: Despite o1's better ARC performance, it fails to outperform GPT-4o on PhysiCo. Both o1 and gemini2.0-flash-thinking-exp fall behind GPT-4o on our leaderboard.

(4/5)

February 15, 2025 at 4:10 AM

Junjie Wu

@junjie116.bsky.social

Key Findings from Our Experiments (Figs. 2 & 3):

1️⃣ Human vs. LLM Performance: Humans achieve ~90% accuracy, while top LLMs, including reasoning models (e.g., o1) and vision-language models (e.g., GPT-4o), lag by ~40%.

(3/5)

February 15, 2025 at 4:10 AM

Junjie Wu

@junjie116.bsky.social

To explore this, we created PhysiCo — a benchmark where core concepts of physical phenomena are represented using few-shot 2D grids. LLMs must identify the phenomenon from four choices.

(2/5)

February 15, 2025 at 4:09 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news