Lightnews — Scholar-powered news

Zory Zhang

@zoryzhang.bsky.social

Computational modeling of human learning: cognitive development, language acquisition, social learning, causal learning... Brown PhD student with ‪@daphnab.bsky.social‬

Posts Replies Media Videos

Zory Zhang

@zoryzhang.bsky.social

Surprisingly, their accuracy does not differ between front views and side views, while humans do (p<0.001). VLMs may rely on 👺head orientation rather than 👀eye gaze direction, making them "robust" to side views that increase the geometric ambiguity of eye direction.
🧵7/11

June 12, 2025 at 5:04 PM

Zory Zhang

@zoryzhang.bsky.social

On the other hand, the performance of Gemini 1.5 Pro, GPT-4o, InternLM, Qwen2.5, and GLM becomes closer to the chance level as difficulty increases (with increasing proximity and number of objects). They likely employ heuristics that break down under difficult conditions.
🧵6/11

June 12, 2025 at 5:04 PM

Zory Zhang

@zoryzhang.bsky.social

Before that, we need to establish baselines. 65 human participants were presented with MC questions like the one below. Their performance degrades 📉 with increasing proximity, increasing number of objects, and when the camera view switches from the front to the side.
🧵5/11

June 12, 2025 at 5:04 PM

Zory Zhang

@zoryzhang.bsky.social

In addition to the chance-level accuracy, VLMs responded with every possible answer almost equally frequently. Are they random guessers? 🤡 Spoiler: top-tier VLMs are not, as we further analyzed how their performance varies with respect to the controlled variables. 🤗
🧵4/11

June 12, 2025 at 5:04 PM

Zory Zhang

@zoryzhang.bsky.social

We found that humans excel at gaze inference (~91% accuracy), but 94 of 111 VLMs performed about as well as if they had guessed randomly without looking at the images (~42%) 😲. Even the best, like GPT-4o, hit only ~50%. Bigger (or newer) VLMs are not better. 🫤
🧵3/11

June 12, 2025 at 5:04 PM

Zory Zhang

@zoryzhang.bsky.social

We systematically manipulated variables across 900 evaluation stimuli: View (left/right/front), Proximity (1-3 scale), Number of objects (2-4), etc., and tested 65 human participants (45 stimuli per person) and 111 VLMs on it.
🧵2/11

June 12, 2025 at 5:04 PM

Zory Zhang

@zoryzhang.bsky.social

👁️ 𝐂𝐚𝐧 𝐕𝐢𝐬𝐢𝐨𝐧 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬 (𝐕𝐋𝐌𝐬) 𝐈𝐧𝐟𝐞𝐫 𝐇𝐮𝐦𝐚𝐧 𝐆𝐚𝐳𝐞 𝐃𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧?
Knowing where someone looks is key to a Theory of Mind. We test 111 VLMs and 65 humans to compare their inferences.
Project page: grow-ai-like-a-child.github.io/gaze/
🧵1/11

June 12, 2025 at 5:04 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news