Zory Zhang
zoryzhang.bsky.social
Zory Zhang
@zoryzhang.bsky.social
Computational modeling of human learning: cognitive development, language acquisition, social learning, causal learning... Brown PhD student with ‪@daphnab.bsky.social‬
Surprisingly, their accuracy does not differ between front views and side views, while humans do (p<0.001). VLMs may rely on 👺head orientation rather than 👀eye gaze direction, making them "robust" to side views that increase the geometric ambiguity of eye direction.
🧵7/11
June 12, 2025 at 5:04 PM
On the other hand, the performance of Gemini 1.5 Pro, GPT-4o, InternLM, Qwen2.5, and GLM becomes closer to the chance level as difficulty increases (with increasing proximity and number of objects). They likely employ heuristics that break down under difficult conditions.
🧵6/11
June 12, 2025 at 5:04 PM
Before that, we need to establish baselines. 65 human participants were presented with MC questions like the one below. Their performance degrades 📉 with increasing proximity, increasing number of objects, and when the camera view switches from the front to the side.
🧵5/11
June 12, 2025 at 5:04 PM
In addition to the chance-level accuracy, VLMs responded with every possible answer almost equally frequently. Are they random guessers? 🤡 Spoiler: top-tier VLMs are not, as we further analyzed how their performance varies with respect to the controlled variables. 🤗
🧵4/11
June 12, 2025 at 5:04 PM
We found that humans excel at gaze inference (~91% accuracy), but 94 of 111 VLMs performed about as well as if they had guessed randomly without looking at the images (~42%) 😲. Even the best, like GPT-4o, hit only ~50%. Bigger (or newer) VLMs are not better. 🫤
🧵3/11
June 12, 2025 at 5:04 PM
We systematically manipulated variables across 900 evaluation stimuli: View (left/right/front), Proximity (1-3 scale), Number of objects (2-4), etc., and tested 65 human participants (45 stimuli per person) and 111 VLMs on it.
🧵2/11
June 12, 2025 at 5:04 PM
👁️ 𝐂𝐚𝐧 𝐕𝐢𝐬𝐢𝐨𝐧 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬 (𝐕𝐋𝐌𝐬) 𝐈𝐧𝐟𝐞𝐫 𝐇𝐮𝐦𝐚𝐧 𝐆𝐚𝐳𝐞 𝐃𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧?
Knowing where someone looks is key to a Theory of Mind. We test 111 VLMs and 65 humans to compare their inferences.
Project page: grow-ai-like-a-child.github.io/gaze/
🧵1/11
June 12, 2025 at 5:04 PM