Zory Zhang
zoryzhang.bsky.social
Zory Zhang
@zoryzhang.bsky.social
Computational modeling of human learning: cognitive development, language acquisition, social learning, causal learning... Brown PhD student with ‪@daphnab.bsky.social‬
Like this idea! Game dev is traditionally hard to be combined with psychology experiments but AI is getting better and better at helping with this.
October 14, 2025 at 6:10 PM
GrowAI Team: @growai.bsky.social
June 12, 2025 at 5:04 PM
With the amazing GrowAI team: Pinyuan Feng (equally contributed), Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, @hokin.bsky.social , Ziqiao Ma, Yijiang Li, & Dezhi Luo.

🧵11/11 🎉
June 12, 2025 at 5:04 PM
Besides understanding VLMs, this explanation also suggests that VLM training should include more embodied social interaction, such that natural human-AI interaction can stem from next-token/frame-prediction training. We also recommend a better learning curriculum design📚.
🧵9/11
June 12, 2025 at 5:04 PM
We leave this explanation open for further investigation and conclude that this work shows how controlled studies can complement benchmarking by providing aspects that explanations need to account for, as a way to constrain the hypothesis space to better understand VLMs🌟.
🧵8/11
June 12, 2025 at 5:04 PM
Surprisingly, their accuracy does not differ between front views and side views, while humans do (p<0.001). VLMs may rely on 👺head orientation rather than 👀eye gaze direction, making them "robust" to side views that increase the geometric ambiguity of eye direction.
🧵7/11
June 12, 2025 at 5:04 PM
On the other hand, the performance of Gemini 1.5 Pro, GPT-4o, InternLM, Qwen2.5, and GLM becomes closer to the chance level as difficulty increases (with increasing proximity and number of objects). They likely employ heuristics that break down under difficult conditions.
🧵6/11
June 12, 2025 at 5:04 PM
Before that, we need to establish baselines. 65 human participants were presented with MC questions like the one below. Their performance degrades 📉 with increasing proximity, increasing number of objects, and when the camera view switches from the front to the side.
🧵5/11
June 12, 2025 at 5:04 PM
In addition to the chance-level accuracy, VLMs responded with every possible answer almost equally frequently. Are they random guessers? 🤡 Spoiler: top-tier VLMs are not, as we further analyzed how their performance varies with respect to the controlled variables. 🤗
🧵4/11
June 12, 2025 at 5:04 PM
We found that humans excel at gaze inference (~91% accuracy), but 94 of 111 VLMs performed about as well as if they had guessed randomly without looking at the images (~42%) 😲. Even the best, like GPT-4o, hit only ~50%. Bigger (or newer) VLMs are not better. 🫤
🧵3/11
June 12, 2025 at 5:04 PM
We systematically manipulated variables across 900 evaluation stimuli: View (left/right/front), Proximity (1-3 scale), Number of objects (2-4), etc., and tested 65 human participants (45 stimuli per person) and 111 VLMs on it.
🧵2/11
June 12, 2025 at 5:04 PM