Hazel Doughty
banner
hazeldoughty.bsky.social
Hazel Doughty
@hazeldoughty.bsky.social
Assistant Professor at Leiden University, NL. Computer Vision, Video Understanding.
https://hazeldoughty.github.io
This was a monumental effort from a large team across Bristol, Leiden Singapore and Bath.

The VQA benchmark only scratches the surfaces of what is possible to evaluate with this detail of annotations.

Check out the website if you want to know more: hd-epic.github.io
https://hd-epic.github.io/
t.co
February 7, 2025 at 12:27 PM
VQA Benchmark

Our benchmark tests understanding in recipes, ingredients, nutrition, fine-grained actions, 3D perception, object movement and gaze. Current models have a long way to go with a best performance of 38% vs. 90% human baseline.
February 7, 2025 at 12:27 PM
Scene & Object Movements

We reconstruct participants kitchens and annotate every time an object is moved.
February 7, 2025 at 12:27 PM
Fine-grained Actions

Every action has a dense description not only describing what happens in detail, but also how and why it happens.
February 7, 2025 at 12:27 PM
As well as annotating temporal segments corresponding to each step we also annotate all the preparation needed to complete each step.
February 7, 2025 at 12:27 PM
Recipe & Nutrition

We collect details of all the recipes participants chose to perform over 3 days in their own kitchen. Alongside ingredient weights and nutrition.
February 7, 2025 at 12:27 PM
We propose a simple baseline using phrase-level negatives and visual prompting to balance coarse- and fine-grained performance. This can easily combined with existing approaches. However, there is much potential for future work.
December 10, 2024 at 6:46 AM
Incorporating fine-grained negatives into training does improve fine-grained performance, however it comes at the cost of coarse-grained performance.
December 10, 2024 at 6:46 AM
We use this evaluation to investigate current models and find they lack fine-grained understanding, particularly for adverbs and prepositions.

We also see that good coarse-grained performance does not necessarily indicate good fine-grained performance.
December 10, 2024 at 6:46 AM
We propose a new fine-grained evaluation approach which analyses a model's sensitivity to individual word variations in different parts-of-speech.

Our approach automatically creates new fine-grained negative captions and can be applied to any existing dataset.
December 10, 2024 at 6:46 AM
Current video-text retrieval benchmarks focus on coarse-grained differences as they focus on distinguishing the correct caption from captions of other, often irrelevant videos.

Captions thus rarely differ by a single word or concept.
December 10, 2024 at 6:46 AM
Training a model on the these video-text pairs results in a representation that is beneficial to motion-focused downstream tasks, particularly when little data is available for finetuning.
December 10, 2024 at 6:42 AM
Since we know how our synthetic motions have been generated we can also generate captions to describe them using pre-defined phrases. We then diversify the vocabulary and structure of our descriptions with our verb-variation paraphrasing.
December 10, 2024 at 6:42 AM