Lightnews — Scholar-powered news

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

270 followers 230 following 17 posts

NLP PhD student @ Northeastern
Multilingual NLP, tokenizers

https://genesith.github.io/

Posts Replies Media Videos

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

Thanks to coauthors from S2W Inc. (Jin-Woo Chung
, Keuntae Park), and KAIST (professors Kimin Lee
and Seungwon Shin)!

You can find our paper here: arxiv.org/abs/2410.23684 (11/11)

November 12, 2024 at 5:10 AM

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

"But a phrase like ट能 is very OOD. Are you sure these hallucinations are a tokenization problem?"

We think so! When we tokenize the same phrase differently to *avoid* incomplete tokens, the models generally performed much better (including a 93% reduction in Llama3.1). (7/11)

November 12, 2024 at 5:08 AM

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

We prepare up to 100 improbable bigrams for each tokenizer, and use comparable complete token bigrams as baselines.
Improbable bigrams were significantly higher to hallucinations.
(For this, we only used trained tokens to remove influence of glitch tokens.) (6/11)

November 12, 2024 at 5:08 AM

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

We test a model's ability to repeat a target phrase with three different scenarios, which should be doable even for meaningless phrases.
A target phrase is considered hallucinatory only if the model fails to repeat the phrase in all 3 prompts. (5/11)

November 12, 2024 at 5:08 AM

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

We can analyze each incomplete token's structure based on starting bytes and continuation bytes. We can then find which tokens have complementary structures.
If the pair is re-encodable to the incomplete tokens, it is a legal incomplete bigram. (4/11)

November 12, 2024 at 5:07 AM

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

ट能 combines two "incomplete tokens" ('<0xE0><0xA4>' and '<0x9F>能').
Such tokens with stray bytes rely on adjacent tokens' stray bytes to resolve as a character.
If two such tokens combine into an "improbable bigram" like ट能, we get a phrase that causes model errors. (3/11)

November 12, 2024 at 5:07 AM

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

#nlp
Have you ever wondered what "ट能" means?
Probably not, since it's not a meaningful phrase.
But if you ever did, any well-trained LLM should be able to tell you that. Right?
Not quite! We discover phrases like "ट能" trigger vulnerabilities in Byte-Level BPE Tokenizers. (1/11)

November 12, 2024 at 5:06 AM

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

A platform for coexistence.

November 8, 2024 at 5:21 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news