Lightnews — Scholar-powered news

Light up
your news

Create account Sign in

About Privacy Terms Help

Craig Schmidt

Craig Schmidt

@craigschmidt.com

510 followers 2.3K following 52 posts

Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work in R&D at Kensho Technologies.

Posts Replies Media Videos

Craig Schmidt

@craigschmidt.com

I believe he’s talking about Olin College of Engineering. Created from scratch as an undergraduate only school, with the first class in 2002. Kind of a Harvey Mudd of the east. Campus is near me, and they seem to attract great students.

October 2, 2025 at 9:34 PM

Craig Schmidt

@craigschmidt.com

The other is that is there isn't a way to specify an initial vocabulary with all 256 bytes including the continuation character ##. See github.com/huggingface/.... So in short, if you use their WordPiece you might get tokens.

WordPiece can't always avoid <unk> even with ByteLevel pretokenization. · Issue #1863 · huggingface/tokenizers

The ByteLevel pre-tokenizer is largely used to avoid the possibility of an <unk> token. However, there is a problem with the continuation characters in WordPiece that prevents you from adding all o...

September 18, 2025 at 3:42 PM

Craig Schmidt

@craigschmidt.com

I've posted a few papers I missed including yours here bsky.app/profile/crai.... Thomas pointed that out about 5 seconds after I posted on the discord :-)

Craig Schmidt @craigschmidt.com · Jul 30

14) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens et al
aclanthology.org/2025.acl-lon...

GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model

Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 3:17 PM

Craig Schmidt

@craigschmidt.com

16) Causal Estimation of Tokenisation Bias
Pietro Lesci et al
aclanthology.org/2025.acl-lon...

Causal Estimation of Tokenisation Bias

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:22 PM

Craig Schmidt

@craigschmidt.com

15) Tokenisation is NP-Complete
Philip Whittington et al
aclanthology.org/2025.acl-lon...

Tokenisation is NP-Complete

Philip Whittington, Gregor Bachmann, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:22 PM

Craig Schmidt

@craigschmidt.com

14) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens et al
aclanthology.org/2025.acl-lon...

GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model

Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:22 PM

Craig Schmidt

@craigschmidt.com

13) Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages
Georgii Andriushchenko et al
aclanthology.org/2025.acl-srw...

Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages

Georgy Andryushchenko, Vladimir V. Ivanov. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

12) Retrofitting Large Language Models with Dynamic Tokenization
Darius Feher et al
aclanthology.org/2025.acl-lon...

Retrofitting Large Language Models with Dynamic Tokenization

Darius Feher, Ivan Vulić, Benjamin Minixhofer. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

11) TokAlign: Efficient Vocabulary Adaptation via Token Alignment
Chong Li et al
aclanthology.org/2025.acl-lon...

TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Chong Li, Jiajun Zhang, Chengqing Zong. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

10) Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Kexin Chen et al
aclanthology.org/2025.acl-lon...

Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

9) Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Andrew Gambardella et al
aclanthology.org/2025.acl-sho...

Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

8) Adversarial Tokenization
Renato Lui Geh et al
aclanthology.org/2025.acl-lon...

Adversarial Tokenization

Renato Geh, Zilei Shao, Guy Van Den Broeck. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

7) Incorporating Domain Knowledge into Materials Tokenization
Yerim Oh et al
aclanthology.org/2025.acl-lon...

Incorporating Domain Knowledge into Materials Tokenization

Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

6) Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz et al
aclanthology.org/2025.acl-lon...

Beyond Text Compression: Evaluating Tokenizers Across Scales

Jonas F. Lotz, António V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

5) Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Zhu Xu et al
aclanthology.org/2025.acl-lon...

Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:...

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

4) Unsupervised Morphological Tree Tokenizer
Xiang Hu et al
aclanthology.org/2025.finding...

Unsupervised Morphological Tree Tokenizer

Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

3) Splintering Nonconcatenative Languages for Better Tokenization
Yuval Pinter et al
aclanthology.org/2025.finding...

Splintering Nonconcatenative Languages for Better Tokenization

Bar Gazit, Shaltiel Shmidman, Avi Shmidman, Yuval Pinter. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

2) Tokenization is Sensitive to Language Variation
Anna Wegmann et al
aclanthology.org/2025.finding...

Tokenization is Sensitive to Language Variation

Anna Wegmann, Dong Nguyen, David Jurgens. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

aclanthology.org

July 30, 2025 at 2:03 PM

Craig Schmidt

@craigschmidt.com

1) Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni et al
aclanthology.org/2025.acl-lon...

Byte Latent Transformer: Patches Scale Better Than Tokens

Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini...

aclanthology.org

July 30, 2025 at 2:03 PM