Lightnews — Scholar-powered news

Kathy Garcia

@gkathy.bsky.social

Computational Cognitive Science PhD at Johns Hopkins with Leyla Isik

| BS @Stanford|

| 🔗 https://garciakathy.github.io/ |

Posts Replies Media Videos

Kathy Garcia

@gkathy.bsky.social

Together this work shows how different types of human similarity judgments can be leveraged to improve video models.

We also share our large-scale video similarity judgment dataset, and code for hybrid triplet/RSA behavior guided fine-tuning: github.com/garciakathy/...

GitHub - garciakathy/similarity-judgments-finetuning

Contribute to garciakathy/similarity-judgments-finetuning development by creating an account on GitHub.

github.com

October 3, 2025 at 1:48 PM

Kathy Garcia

@gkathy.bsky.social

In follow-up experiments we show this model generalizes better to novel social tasks, and avoids catastrophic forgetting by preserving baseline on action recognition tasks.

October 3, 2025 at 1:48 PM

Kathy Garcia

@gkathy.bsky.social

After fine-tuning, the video model explains both captures more shared variance with language models AND captures more unique variance in human judgments, indicating it learned both language-like semantics and additional visual social nuances.

October 3, 2025 at 1:48 PM

Kathy Garcia

@gkathy.bsky.social

Hybrid fine-tuning substantially increases match to human judgment on held-out videos and surpasses the best language model baseline. The hybrid loss > triplet-only loss and > RSA-only loss.

October 3, 2025 at 1:48 PM

Kathy Garcia

@gkathy.bsky.social

We fine-tune a video transformer with a novel hybrid objective function = Triplet loss (local constraints) + RSA Loss (global Pearson-correlation over pairwise distances), which captures both local and global similarity structure. We use Low Rank Adaptation to reduce the number of trainable params.

October 3, 2025 at 1:48 PM

Kathy Garcia

@gkathy.bsky.social

Despite the task being purely visual, caption embeddings from a language model predict human similarity better than any pretrained video model (e.g., mpnet-base-v2 > TimeSformer-base).

October 3, 2025 at 1:48 PM

Kathy Garcia

@gkathy.bsky.social

🚀 Together, this work highlights a major gap in AI's ability to match human social vision, and underscores the importance of developing AI models in dynamic social contexts [6/6]

April 23, 2025 at 6:08 PM

Kathy Garcia

@gkathy.bsky.social

📹 While most model features (like architecture or training objective) did not affect performance, we saw a big advantage for video versus image models along the lateral stream. But no model tested could predict anterior lateral stream responses well. [5/6]

April 23, 2025 at 6:08 PM

Kathy Garcia

@gkathy.bsky.social

🔍 Unlike visual scene features and ventral stream responses, vision models struggled to match human action and social interaction ratings, and did a poor job of predicting brain responses along the recently proposed lateral stream, specialized for social perception. [4/6]

April 23, 2025 at 6:08 PM

Kathy Garcia

@gkathy.bsky.social

🧠 We benchmarked 350+ image, video, and language models against human behavioral and neural responses to dynamic, social videos. [3/6]

April 23, 2025 at 6:08 PM

Kathy Garcia

@gkathy.bsky.social

🎥 Real-world vision is dynamic, involving complex social interactions. Current AI models provide a good match to humans in static scene vision, but how do they fare with dynamic, social stimuli? 🤔 We set out to explore this! [2/6]

April 23, 2025 at 6:08 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news