Eugene Jang @EMNLP
banner
eugeneonnlp.bsky.social
Eugene Jang @EMNLP
@eugeneonnlp.bsky.social
NLP PhD student @ Northeastern
Multilingual NLP, tokenizers

https://genesith.github.io/
Thanks to coauthors from S2W Inc. (Jin-Woo Chung
, Keuntae Park), and KAIST (professors Kimin Lee
and Seungwon Shin)!

You can find our paper here: arxiv.org/abs/2410.23684 (11/11)
November 12, 2024 at 5:10 AM
"But a phrase like ट能 is very OOD. Are you sure these hallucinations are a tokenization problem?"

We think so! When we tokenize the same phrase differently to *avoid* incomplete tokens, the models generally performed much better (including a 93% reduction in Llama3.1). (7/11)
November 12, 2024 at 5:08 AM
We prepare up to 100 improbable bigrams for each tokenizer, and use comparable complete token bigrams as baselines.
Improbable bigrams were significantly higher to hallucinations.
(For this, we only used trained tokens to remove influence of glitch tokens.) (6/11)
November 12, 2024 at 5:08 AM
We test a model's ability to repeat a target phrase with three different scenarios, which should be doable even for meaningless phrases.
A target phrase is considered hallucinatory only if the model fails to repeat the phrase in all 3 prompts. (5/11)
November 12, 2024 at 5:08 AM
We can analyze each incomplete token's structure based on starting bytes and continuation bytes. We can then find which tokens have complementary structures.
If the pair is re-encodable to the incomplete tokens, it is a legal incomplete bigram. (4/11)
November 12, 2024 at 5:07 AM
ट能 combines two "incomplete tokens" ('<0xE0><0xA4>' and '<0x9F>能').
Such tokens with stray bytes rely on adjacent tokens' stray bytes to resolve as a character.
If two such tokens combine into an "improbable bigram" like ट能, we get a phrase that causes model errors. (3/11)
November 12, 2024 at 5:07 AM
#nlp
Have you ever wondered what "ट能" means?
Probably not, since it's not a meaningful phrase.
But if you ever did, any well-trained LLM should be able to tell you that. Right?
Not quite! We discover phrases like "ट能" trigger vulnerabilities in Byte-Level BPE Tokenizers. (1/11)
November 12, 2024 at 5:06 AM
A platform for coexistence.
November 8, 2024 at 5:21 AM