Kathy Garcia
banner
gkathy.bsky.social
Kathy Garcia
@gkathy.bsky.social
Computational Cognitive Science PhD at Johns Hopkins with Leyla Isik

| BS @Stanford|

| 🔗 https://garciakathy.github.io/ |
Together this work shows how different types of human similarity judgments can be leveraged to improve video models.

We also share our large-scale video similarity judgment dataset, and code for hybrid triplet/RSA behavior guided fine-tuning: github.com/garciakathy/...
GitHub - garciakathy/similarity-judgments-finetuning
Contribute to garciakathy/similarity-judgments-finetuning development by creating an account on GitHub.
github.com
October 3, 2025 at 1:48 PM
In follow-up experiments we show this model generalizes better to novel social tasks, and avoids catastrophic forgetting by preserving baseline on action recognition tasks.
October 3, 2025 at 1:48 PM
After fine-tuning, the video model explains both captures more shared variance with language models AND captures more unique variance in human judgments, indicating it learned both language-like semantics and additional visual social nuances.
October 3, 2025 at 1:48 PM
Hybrid fine-tuning substantially increases match to human judgment on held-out videos and surpasses the best language model baseline. The hybrid loss > triplet-only loss and > RSA-only loss.
October 3, 2025 at 1:48 PM
We fine-tune a video transformer with a novel hybrid objective function = Triplet loss (local constraints) + RSA Loss (global Pearson-correlation over pairwise distances), which captures both local and global similarity structure. We use Low Rank Adaptation to reduce the number of trainable params.
October 3, 2025 at 1:48 PM
Despite the task being purely visual, caption embeddings from a language model predict human similarity better than any pretrained video model (e.g., mpnet-base-v2 > TimeSformer-base).
October 3, 2025 at 1:48 PM
🚀 Together, this work highlights a major gap in AI's ability to match human social vision, and underscores the importance of developing AI models in dynamic social contexts [6/6]
April 23, 2025 at 6:08 PM
📹 While most model features (like architecture or training objective) did not affect performance, we saw a big advantage for video versus image models along the lateral stream. But no model tested could predict anterior lateral stream responses well. [5/6]
April 23, 2025 at 6:08 PM
🔍 Unlike visual scene features and ventral stream responses, vision models struggled to match human action and social interaction ratings, and did a poor job of predicting brain responses along the recently proposed lateral stream, specialized for social perception. [4/6]
April 23, 2025 at 6:08 PM
🧠 We benchmarked 350+ image, video, and language models against human behavioral and neural responses to dynamic, social videos. [3/6]
April 23, 2025 at 6:08 PM
🎥 Real-world vision is dynamic, involving complex social interactions. Current AI models provide a good match to humans in static scene vision, but how do they fare with dynamic, social stimuli? 🤔 We set out to explore this! [2/6]
April 23, 2025 at 6:08 PM