Lightnews — Scholar-powered news

Junjie Wu

@junjie116.bsky.social

14 followers 110 following 9 posts

NLP PhD candidate@HKUST | Visiting PhD student @YaleNLP

Posts Replies Media Videos

Junjie Wu

@junjie116.bsky.social

In our study of ARAOC, we observe that even powerful LLMs like GPT-4o struggle with certain atomic operations. Based on this observation, we conduct a series of experiments to explore the reasons behind LLMs' lack of fluid intelligence, which leads to several findings.

(4/4)

February 15, 2025 at 4:17 AM

Junjie Wu

@junjie116.bsky.social

Our ARAOC benchmark breaks ARC into atomic steps—shedding new light on why LLMs lack fluid intelligence.

(3/4)

February 15, 2025 at 4:16 AM

Junjie Wu

@junjie116.bsky.social

📊 Even top models solve just 19% of ARC tasks vs. ~75% for humans!

(2/4)

February 15, 2025 at 4:16 AM

Junjie Wu

@junjie116.bsky.social

Key Findings from Our Experiments (Figs. 2 & 3):

1️⃣ Human vs. LLM Performance: Humans achieve ~90% accuracy, while top LLMs, including reasoning models (e.g., o1) and vision-language models (e.g., GPT-4o), lag by ~40%.

(3/5)

February 15, 2025 at 4:10 AM

Junjie Wu

@junjie116.bsky.social

🚀 Introducing PhysiCo: A New Benchmark for Evaluating Abstract Understanding in LLMs! 🚀

📚Link: physico-benchmark.github.io

While models like o3 have made impressive strides on ARC-AGI, how well do LLMs truly grasp the abstract patterns in ARC-style tasks?

(1/5)

February 15, 2025 at 4:09 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news