Lightnews — Scholar-powered news

Zhaofeng Lin

@zhaofenglin.bsky.social

59 followers 45 following 13 posts

PhD student @Trinity College Dublin | Multimodal speech recognition
https://chaufanglin.github.io/

Posts Replies Media Videos

Zhaofeng Lin

@zhaofenglin.bsky.social

Congrats!! 🤩

July 2, 2025 at 11:47 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

This paper investigates how AVSR systems exploit visual information.
Our findings reveal varying patterns across systems, with mismatches to human perception.

We recommend reporting *effective SNR gains* alongside WERs for a more comprehensive performance assessment 🧐
[8/8] 🧵

April 1, 2025 at 11:18 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

Results show Auto-AVSR may rely more on audio, with a weaker correlation between MaFI scores and IWERs in AV mode.
In contrast, AVEC shows a stronger use of visual information, with a significant negative correlation, especially in noisy conditions.

[7/8] 🧵

April 1, 2025 at 11:18 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

Finally, we explore the relationship between AVSR errors and Mouth & Facial Informativeness (MaFI) scores.

We calculated Individual WER (IWERs) for each word and performed a Pearson correlation between MaFI scores and IWERs for audio-only, video-only, and AV models. [6/8] 🧵

April 1, 2025 at 11:17 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

Occlusion tests reveal AVSR models rely differently on visual segments.

Auto-AVSR & AV-RelScore are equally affected by initial & middle occlusions, while AVEC is more impacted by middle occlusion.

Unlike humans, AVSR models do not depend on initial visual cues.

[5/8] 🧵

April 1, 2025 at 11:17 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

Speech perception research shows that visual cues at the start of a word have a stronger impact on humans.

We test AVSR by occluding the initial vs. middle third of frames for each word, comparing 3 conditions: no occlusion, initial occlusion, and middle occlusion.

[4/8] 🧵

April 1, 2025 at 11:17 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

First, we revisit *effective SNR gain* - measured by the difference in SNR at which the AVSR WER equals the reference WER for audio-only recognition at 0 dB.

This metric quantifies the benefit of the visual modality in reducing WER compared to the audio-only system. [3/n] 🧵

April 1, 2025 at 11:16 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

In this paper, we take a step back to assess SOTA systems (all from 2023) from a different perspective by considering *human speech perception*.

Through this, we hope to gain insights as to whether the visual component is being fully exploited in existing AVSR systems. [2/8] 🧵

April 1, 2025 at 11:16 AM

Zhaofeng Lin

@zhaofenglin.bsky.social

🙋‍♂️

November 27, 2024 at 10:07 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news