Eugene Jang @EMNLP
banner
eugeneonnlp.bsky.social
Eugene Jang @EMNLP
@eugeneonnlp.bsky.social
NLP PhD student @ Northeastern
Multilingual NLP, tokenizers

https://genesith.github.io/
I’ll be presenting our work on Byte-level Tokenizer Vulnerabilities at the poster session at 2:00pm!

If you’ve ever encountered oddities or frustrations with #tokenization I’d love to chat about it! #EMNLP
#nlp
Have you ever wondered what "ट能" means?
Probably not, since it's not a meaningful phrase.
But if you ever did, any well-trained LLM should be able to tell you that. Right?
Not quite! We discover phrases like "ट能" trigger vulnerabilities in Byte-Level BPE Tokenizers. (1/11)
November 6, 2025 at 9:23 PM
Reposted by Eugene Jang @EMNLP
To paraphrase Dennett (rip 💔), the goal of reviewing is to determine truth, not to conquer your opponent.

Too many reviewers seem to not have internalised this. In my opinion, this is the hardest lesson a reviewer has to learn, and I want to share some thoughts.
November 27, 2024 at 5:25 PM
#nlp
Have you ever wondered what "ट能" means?
Probably not, since it's not a meaningful phrase.
But if you ever did, any well-trained LLM should be able to tell you that. Right?
Not quite! We discover phrases like "ट能" trigger vulnerabilities in Byte-Level BPE Tokenizers. (1/11)
November 12, 2024 at 5:06 AM
A platform for coexistence.
November 8, 2024 at 5:21 AM
Hello World!

The sky really is bluer on the other side.
November 8, 2024 at 5:05 AM