Marco
@mcognetta.bsky.social
Language and keyboard stuff at Google + PhD student at Tokyo Institute of Technology.
I like computers and Korean and computers-and-Korean and high school CS education.
Georgia Tech → 연세대학교 → 東京工業大学.
https://theoreticallygoodwithcomputers.com/
I like computers and Korean and computers-and-Korean and high school CS education.
Georgia Tech → 연세대학교 → 東京工業大学.
https://theoreticallygoodwithcomputers.com/
Label Smoothing
Regularization for Classification Models
leimao.github.io
November 10, 2025 at 6:32 PM
If there was a token that was _never_ observed in training, it would always just get this tiny bit of loss assigned to it. So if there was a group of them, they would all sort of drift at the same rate.
November 10, 2025 at 6:32 PM
If there was a token that was _never_ observed in training, it would always just get this tiny bit of loss assigned to it. So if there was a group of them, they would all sort of drift at the same rate.
The goal is to slightly "smooth" the target distribution: instead of probability 1 for the correct answer and 0 for everything else, it's like probability 0.99 for the correct one and 0.01/(num_classes - 1) for everything else.
November 10, 2025 at 6:32 PM
The goal is to slightly "smooth" the target distribution: instead of probability 1 for the correct answer and 0 for everything else, it's like probability 0.99 for the correct one and 0.01/(num_classes - 1) for everything else.
Ah, in multiclass classification (basically what you are doing with the final softmax layer and predicting the last token), the output distribution can be really "sharp" (it makes the model overconfident and brittle).
A trick to mitigate this is to add a little bit of weight to all other classes.
A trick to mitigate this is to add a little bit of weight to all other classes.
November 10, 2025 at 6:32 PM
Ah, in multiclass classification (basically what you are doing with the final softmax layer and predicting the last token), the output distribution can be really "sharp" (it makes the model overconfident and brittle).
A trick to mitigate this is to add a little bit of weight to all other classes.
A trick to mitigate this is to add a little bit of weight to all other classes.
Could it be from label smoothing? This would assign the same loss to all unseen tokens at all steps.
November 10, 2025 at 6:09 PM
Could it be from label smoothing? This would assign the same loss to all unseen tokens at all steps.
Thanks to this site.
Tecendil
The most accurate and up to date Tengwar transcriber
www.tecendil.com
November 8, 2025 at 8:04 AM
Thanks to this site.
Source: www.linkedin.com/in/stephen-w...
www.linkedin.com
November 7, 2025 at 7:28 PM
Source: www.linkedin.com/in/stephen-w...
It's worse than this: the author has an undergraduate degree in mathematics!
November 7, 2025 at 7:27 PM
It's worse than this: the author has an undergraduate degree in mathematics!
Temperature softmax is really quite cool (ha!). You might also like Gumbel softmax (which has a temperature analogue).
TLDR: you can sample from what _would have been_ the probability distribution produced by softmax by just adding this weird random variable to the logits and selecting the max.
TLDR: you can sample from what _would have been_ the probability distribution produced by softmax by just adding this weird random variable to the logits and selecting the max.
November 7, 2025 at 12:39 AM
Temperature softmax is really quite cool (ha!). You might also like Gumbel softmax (which has a temperature analogue).
TLDR: you can sample from what _would have been_ the probability distribution produced by softmax by just adding this weird random variable to the logits and selecting the max.
TLDR: you can sample from what _would have been_ the probability distribution produced by softmax by just adding this weird random variable to the logits and selecting the max.
The flattening is because Lim_{x→∞} c/x = 0, and so during softmax, all the logits get pushed to 0 (via the division), which means they all get converted to e^0 = 1, so then the softmax calculation gives 1/|num classes| for each output class (vocabulary tokens, in your example).
November 7, 2025 at 12:39 AM
The flattening is because Lim_{x→∞} c/x = 0, and so during softmax, all the logits get pushed to 0 (via the division), which means they all get converted to e^0 = 1, so then the softmax calculation gives 1/|num classes| for each output class (vocabulary tokens, in your example).
This is a really fun problem actually. Given two strings x and y, what is the smallest DFA that accepts x but rejects y?
cs.uwaterloo.ca/~shallit/Tal...
cs.uwaterloo.ca/~shallit/Tal...
Remarks on separating words
The separating words problem asks for the size of the smallest DFA needed to distinguish between two words of length <= n (by accepting one and rejecting the other). In this paper we survey what is kn...
arxiv.org
November 7, 2025 at 12:25 AM
This is a really fun problem actually. Given two strings x and y, what is the smallest DFA that accepts x but rejects y?
cs.uwaterloo.ca/~shallit/Tal...
cs.uwaterloo.ca/~shallit/Tal...
It's not that I didn't want to study the things I wrote about, it's just that at the time of the SoP, those are what I thought I'd have the most fun working on.
But when you hit your first idea, you should just run with it even if it's not in your "core focus", cause maybe it will become that.
But when you hit your first idea, you should just run with it even if it's not in your "core focus", cause maybe it will become that.
November 7, 2025 at 12:20 AM
It's not that I didn't want to study the things I wrote about, it's just that at the time of the SoP, those are what I thought I'd have the most fun working on.
But when you hit your first idea, you should just run with it even if it's not in your "core focus", cause maybe it will become that.
But when you hit your first idea, you should just run with it even if it's not in your "core focus", cause maybe it will become that.
For my PhD, I wrote about CJK tokenization, federated learning, and neutral language model interpretability via formal language theory.
I ended up doing a lot on CJK, but my thesis is about formal aspects of tokenization.
I ended up doing a lot on CJK, but my thesis is about formal aspects of tokenization.
November 7, 2025 at 12:20 AM
For my PhD, I wrote about CJK tokenization, federated learning, and neutral language model interpretability via formal language theory.
I ended up doing a lot on CJK, but my thesis is about formal aspects of tokenization.
I ended up doing a lot on CJK, but my thesis is about formal aspects of tokenization.
I wrote about two really specific problems for my masters: the string separability problem and high quality software implementations of automata operations.
I ended up writing my thesis on probabilistic automata algorithms.
I ended up writing my thesis on probabilistic automata algorithms.
November 7, 2025 at 12:20 AM
I wrote about two really specific problems for my masters: the string separability problem and high quality software implementations of automata operations.
I ended up writing my thesis on probabilistic automata algorithms.
I ended up writing my thesis on probabilistic automata algorithms.
Obviously this is an exaggeration, but really I got the main research areas right (i.e., automata for masters and tokenization for PhD), but the specifics were WAY off.
It just happens that way. You find something cool to work on and dig really deep and volià, you are a researcher.
It just happens that way. You find something cool to work on and dig really deep and volià, you are a researcher.
November 7, 2025 at 12:20 AM
Obviously this is an exaggeration, but really I got the main research areas right (i.e., automata for masters and tokenization for PhD), but the specifics were WAY off.
It just happens that way. You find something cool to work on and dig really deep and volià, you are a researcher.
It just happens that way. You find something cool to work on and dig really deep and volià, you are a researcher.