Hokin
banner
hokin.bsky.social
Hokin
@hokin.bsky.social
Philosopher, Scientist, Engineer
https://hokindeng.github.io/
developmental embodiedment 😎

#DevelopmentalEmbodiedment #GrowAI
November 7, 2025 at 5:38 AM
While failure cases clearly show idiosyncratic patterns 🧩🤔, we currently lack a principled framework to systematically analyze or interpret them 🔍. We invite everyone to explore these examples 🧪, as they may offer valuable clues for future research directions 💡🧠🚀.
November 4, 2025 at 9:56 PM
Here is a generated video for solving the Raven's Matrices from video models. For more, checkout grow-ai-like-a-child.com/video-reason/
November 4, 2025 at 9:55 PM
Raven's Matrices is the one of standard tasks in testing IQ in humans, which require subjects to find patterns and regularities. Intriguingly, video models are able to solve them quite well !
November 4, 2025 at 9:53 PM
Here is an example of testing mental rotation in video models. For more, checkout grow-ai-like-a-child.com/video-reason/
November 4, 2025 at 9:52 PM
For testing mental rotation, we give them an {n}-voxel structure with some tilted camera views (20-40° elevation) and ask them to horizontally rotate with exactly 180° azimuth change. The hard part is 1) don't deform 2) rotate the right degree. Interesting, some models are able to do it quite well.
November 4, 2025 at 9:52 PM
Here is a video example. For more, checkout grow-ai-like-a-child.com/video-reason/
November 4, 2025 at 9:49 PM
For the Sudoku problems, the video models need to fill the gap with the correct number in order to have each row and column all have 1, 2, 3. Surprisingly, this is the easiest task for video model.
November 4, 2025 at 9:49 PM
Here is an example of generated video from the models solving the maze problem. Checkout more at grow-ai-like-a-child.com/video-reason/
November 4, 2025 at 9:48 PM
In the maze problems, video models need to generate videos where navigate the green dot 🟢 to the red flags 🚩 . And they are also able to do it quite well ~
November 4, 2025 at 9:48 PM
Here is a generated video for solving the Chess problem. For more examples, checkout: grow-ai-like-a-child.com/video-reason/
November 4, 2025 at 9:45 PM
Let's see some examples. Video models are able to figure out what are the checkmate moves in the following problems.
November 4, 2025 at 9:45 PM
Idiosyncratic behavioral patterns exist.

For example, Sora-2 somehow figures out how to solve Chess problems. But all other models do not have such ability.

Veo 3 and 3.1 actually are able to do mental rotation quite well, but really fail on the maze problems.
November 4, 2025 at 9:44 PM
Tasks also exhibit clear difficulty hierarchy, with Sudoku being the easiest and mental rotation being the hardest, across all models.
November 4, 2025 at 9:38 PM
Models exhibit clear performance hierarchy with Sora-2 currently being the best model.
November 4, 2025 at 9:37 PM
The basic of VMEvalkit is a Task Pair unit:

1️⃣ Initial image: unsolved puzzle
2️⃣ Text instruction: “Solve this ...”
3️⃣ Final image: correct solution (hidden during generation)

Models see (1)+(2), we compare their output to (3). Simple and straight-forward ✅
November 4, 2025 at 9:36 PM
‼️ Video models start to reason, let's build-in-public scaled eval together 🚀

github.com/hokindeng/VM... (Apache 2.0) offers
1⃣One-click inference across ALL available models
2⃣Unified API & datasets & auto resume + error handling + eval
3⃣Plug new models and tasks in <5 lines of code

a thread (1/n)
November 4, 2025 at 9:34 PM
Excited to share my essay with @carrot0817.bsky.social, Kaia Gao on the representational substrate of world-reasoning in both humans and machines has been accepted to the SpaVLE Workshop at #NeurIPS2025✨

a thread (1/n)
November 2, 2025 at 11:50 PM
7) The result is astonishing 😱 No models are able to get both the manipulation tasks and the control tasks right at the same time, which seems to completely trivial to humans ...

This suggests MLLMs completely lack core knowledge 🦜 and purely rely on shortcuts 🤫 ... (11/n)
June 30, 2025 at 7:07 AM
6) Last, we introduce "Concept Hacking" to reveal core knowledge deficiencies in the control experiment set-up.

Concept Hacking systematically manipulates the task-relevant features while preserving all task-irrelevant conditions ... (10/n)
June 30, 2025 at 7:05 AM
5) Does Reasoning Help 🤔 We further compared the reasoning-augmented models and their corresponding instruction-tuned counterparts.

Scaling test-time compute doesn't seem to be a solution 😮‍💨 ... (9/n)
June 30, 2025 at 7:04 AM
4) 🧐 Would Core Knowledge Emerge from Pure Scaling? ‼️‼️‼️Nope‼️‼️‼️

Regressions on performance of 230 models with diff. parameters & data sizes, we quantify the scaling effects.

The observation is very intriguing: "higher-level" abilities seem to be more "scalable" than "lower-level" abilities (8/n)
June 30, 2025 at 7:02 AM
3) Performance on core cognition abilities serves as a reliable predictor for achieving top results on high-level benchmarks.

Concretely, we compute the correlation matrices of MLLMs abilities on our dataset and another 26 public benchmarks and 9 higher-level abilities defined by SEED-Bench (7/n)
June 30, 2025 at 6:57 AM
2) The topology of internal cognitive representation of MLLMs is akin to human, suggesting "cognition" as a natural kind.

Like humans, physical vs. intention understanding are orthogonal, while tool-use and mechanics co-emerge.

🤔 "Artificial model organisms" for "cognition lesion study" ? 🤔 (6/n)
June 30, 2025 at 6:55 AM
‼️ RESULTS ‼️ First, MLLMs exhibit a reversed developmental trajectory 📉 compared to humans 📈: they excel at "high-level" tasks that we learn later in life 🙀 but struggle with "basic" ones that we develop in infancy 👶

This observation is statistically significant (5/n) 📊
June 30, 2025 at 6:46 AM