Lightnews — Scholar-powered news

Hazel Doughty

@hazeldoughty.bsky.social

260 followers 130 following 22 posts

Assistant Professor at Leiden University, NL. Computer Vision, Video Understanding.
https://hazeldoughty.github.io

Posts Replies Media Videos

Hazel Doughty

@hazeldoughty.bsky.social

VQA Benchmark

Our benchmark tests understanding in recipes, ingredients, nutrition, fine-grained actions, 3D perception, object movement and gaze. Current models have a long way to go with a best performance of 38% vs. 90% human baseline.

February 7, 2025 at 12:27 PM

Hazel Doughty

@hazeldoughty.bsky.social

Scene & Object Movements

We reconstruct participants kitchens and annotate every time an object is moved.

February 7, 2025 at 12:27 PM

Hazel Doughty

@hazeldoughty.bsky.social

Fine-grained Actions

Every action has a dense description not only describing what happens in detail, but also how and why it happens.

February 7, 2025 at 12:27 PM

Hazel Doughty

@hazeldoughty.bsky.social

As well as annotating temporal segments corresponding to each step we also annotate all the preparation needed to complete each step.

February 7, 2025 at 12:27 PM

Hazel Doughty

@hazeldoughty.bsky.social

Recipe & Nutrition

We collect details of all the recipes participants chose to perform over 3 days in their own kitchen. Alongside ingredient weights and nutrition.

February 7, 2025 at 12:27 PM

Hazel Doughty

@hazeldoughty.bsky.social

📢 Today we're releasing a new highly detailed dataset for video understanding: HD-EPIC

arxiv.org/abs/2502.04144

hd-epic.github.io

What makes the dataset unique is the vast detail contained in the annotations with 263 annotations per minute over 41 hours of video.

February 7, 2025 at 12:27 PM

Hazel Doughty

@hazeldoughty.bsky.social

Incorporating fine-grained negatives into training does improve fine-grained performance, however it comes at the cost of coarse-grained performance.

December 10, 2024 at 6:46 AM

Hazel Doughty

@hazeldoughty.bsky.social

We use this evaluation to investigate current models and find they lack fine-grained understanding, particularly for adverbs and prepositions.

We also see that good coarse-grained performance does not necessarily indicate good fine-grained performance.

December 10, 2024 at 6:46 AM

Hazel Doughty

@hazeldoughty.bsky.social

We propose a new fine-grained evaluation approach which analyses a model's sensitivity to individual word variations in different parts-of-speech.

Our approach automatically creates new fine-grained negative captions and can be applied to any existing dataset.

December 10, 2024 at 6:46 AM

Hazel Doughty

@hazeldoughty.bsky.social

Current video-text retrieval benchmarks focus on coarse-grained differences as they focus on distinguishing the correct caption from captions of other, often irrelevant videos.

Captions thus rarely differ by a single word or concept.

December 10, 2024 at 6:46 AM

Hazel Doughty

@hazeldoughty.bsky.social

Our second #ACCV2024 oral: "Beyond Coarse-Grained Matching in Video-Text Retrieval" is also being presented today.

ArXiv: arxiv.org/abs/2410.12407

We go beyond coarse-grained retrieval and explore whether models can discern subtle single-word differences in captions.

December 10, 2024 at 6:46 AM

Hazel Doughty

@hazeldoughty.bsky.social

Since we know how our synthetic motions have been generated we can also generate captions to describe them using pre-defined phrases. We then diversify the vocabulary and structure of our descriptions with our verb-variation paraphrasing.

December 10, 2024 at 6:42 AM

Hazel Doughty

@hazeldoughty.bsky.social

We address this by proposing a method to learn motion-focused representations with available spatial-focused data. We first generate synthetic local object motions and inject this into training videos.

December 10, 2024 at 6:42 AM

Hazel Doughty

@hazeldoughty.bsky.social

Moreover, large proportion of captions are uniquely identifiable by the nouns alone, meaning a caption can be matched to the correct video clip by only recognizing the correct nouns.

December 10, 2024 at 6:42 AM

Hazel Doughty

@hazeldoughty.bsky.social

Captions in current video-language pre-training and downstream datasets are spatially focused, with nouns being far more prevalent than verbs or adjectives.

December 10, 2024 at 6:42 AM

Hazel Doughty

@hazeldoughty.bsky.social

Today we're presenting out #ACCV2024 Oral "LocoMotion: Learning Motion-Focused Video-Language Representations".

We remove the spatial focus of video-language representations and instead train representations to have a motion focus.

December 10, 2024 at 6:42 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news