https://hokindeng.github.io/
For example, Sora-2 somehow figures out how to solve Chess problems. But all other models do not have such ability.
Veo 3 and 3.1 actually are able to do mental rotation quite well, but really fail on the maze problems.
For example, Sora-2 somehow figures out how to solve Chess problems. But all other models do not have such ability.
Veo 3 and 3.1 actually are able to do mental rotation quite well, but really fail on the maze problems.
1️⃣ Initial image: unsolved puzzle
2️⃣ Text instruction: “Solve this ...”
3️⃣ Final image: correct solution (hidden during generation)
Models see (1)+(2), we compare their output to (3). Simple and straight-forward ✅
1️⃣ Initial image: unsolved puzzle
2️⃣ Text instruction: “Solve this ...”
3️⃣ Final image: correct solution (hidden during generation)
Models see (1)+(2), we compare their output to (3). Simple and straight-forward ✅
github.com/hokindeng/VM... (Apache 2.0) offers
1⃣One-click inference across ALL available models
2⃣Unified API & datasets & auto resume + error handling + eval
3⃣Plug new models and tasks in <5 lines of code
a thread (1/n)
github.com/hokindeng/VM... (Apache 2.0) offers
1⃣One-click inference across ALL available models
2⃣Unified API & datasets & auto resume + error handling + eval
3⃣Plug new models and tasks in <5 lines of code
a thread (1/n)
a thread (1/n)
a thread (1/n)
This suggests MLLMs completely lack core knowledge 🦜 and purely rely on shortcuts 🤫 ... (11/n)
This suggests MLLMs completely lack core knowledge 🦜 and purely rely on shortcuts 🤫 ... (11/n)
Concept Hacking systematically manipulates the task-relevant features while preserving all task-irrelevant conditions ... (10/n)
Concept Hacking systematically manipulates the task-relevant features while preserving all task-irrelevant conditions ... (10/n)
Scaling test-time compute doesn't seem to be a solution 😮💨 ... (9/n)
Scaling test-time compute doesn't seem to be a solution 😮💨 ... (9/n)
Regressions on performance of 230 models with diff. parameters & data sizes, we quantify the scaling effects.
The observation is very intriguing: "higher-level" abilities seem to be more "scalable" than "lower-level" abilities (8/n)
Regressions on performance of 230 models with diff. parameters & data sizes, we quantify the scaling effects.
The observation is very intriguing: "higher-level" abilities seem to be more "scalable" than "lower-level" abilities (8/n)
Concretely, we compute the correlation matrices of MLLMs abilities on our dataset and another 26 public benchmarks and 9 higher-level abilities defined by SEED-Bench (7/n)
Concretely, we compute the correlation matrices of MLLMs abilities on our dataset and another 26 public benchmarks and 9 higher-level abilities defined by SEED-Bench (7/n)
Like humans, physical vs. intention understanding are orthogonal, while tool-use and mechanics co-emerge.
🤔 "Artificial model organisms" for "cognition lesion study" ? 🤔 (6/n)
Like humans, physical vs. intention understanding are orthogonal, while tool-use and mechanics co-emerge.
🤔 "Artificial model organisms" for "cognition lesion study" ? 🤔 (6/n)
This observation is statistically significant (5/n) 📊
This observation is statistically significant (5/n) 📊