Lightnews — Scholar-powered news

Marco

@mcognetta.bsky.social

2.3K followers 1.3K following 700 posts

Language and keyboard stuff at Google + PhD student at Tokyo Institute of Technology.

I like computers and Korean and computers-and-Korean and high school CS education.

Georgia Tech → 연세대학교 → 東京工業大学.

https://theoreticallygoodwithcomputers.com/

Posts Replies Media Videos

Marco

@mcognetta.bsky.social

Here are two other writeups:

gist.github.com/almeidaraul/...

leimao.github.io/blog/Label-S...

Label Smoothing

Regularization for Classification Models

leimao.github.io

November 10, 2025 at 6:32 PM

Marco

@mcognetta.bsky.social

PyTorch supports this, but it's off by default.

docs.pytorch.org/docs/stable/...

CrossEntropyLoss — PyTorch 2.9 documentation

docs.pytorch.org

November 10, 2025 at 6:32 PM

Marco

@mcognetta.bsky.social

If there was a token that was _never_ observed in training, it would always just get this tiny bit of loss assigned to it. So if there was a group of them, they would all sort of drift at the same rate.

November 10, 2025 at 6:32 PM

Marco

@mcognetta.bsky.social

The goal is to slightly "smooth" the target distribution: instead of probability 1 for the correct answer and 0 for everything else, it's like probability 0.99 for the correct one and 0.01/(num_classes - 1) for everything else.

November 10, 2025 at 6:32 PM

Marco

@mcognetta.bsky.social

Ah, in multiclass classification (basically what you are doing with the final softmax layer and predicting the last token), the output distribution can be really "sharp" (it makes the model overconfident and brittle).

A trick to mitigate this is to add a little bit of weight to all other classes.

November 10, 2025 at 6:32 PM

Marco

@mcognetta.bsky.social

Could it be from label smoothing? This would assign the same loss to all unseen tokens at all steps.

November 10, 2025 at 6:09 PM

Marco

@mcognetta.bsky.social

Thanks to this site.

Tecendil

The most accurate and up to date Tengwar transcriber

www.tecendil.com

November 8, 2025 at 8:04 AM

Marco

@mcognetta.bsky.social

Source: www.linkedin.com/in/stephen-w...

www.linkedin.com

November 7, 2025 at 7:28 PM

Marco

@mcognetta.bsky.social

It's worse than this: the author has an undergraduate degree in mathematics!

November 7, 2025 at 7:27 PM

Marco

@mcognetta.bsky.social

mpkonst.com/posts/gumbel...

Reverse engineering the Gumbel Max Trick | M. Konstantinov

Intro I recently learned about the so-called Gumbel-Max and Gumbel-Softmax tricks. Essentially, the Gumbel-Max trick says that if we have a categorical distribution $\vec{\pi} = {\pi_1, \ldots \pi_K}$...

mpkonst.com

November 7, 2025 at 12:45 AM

Marco

@mcognetta.bsky.social

Temperature softmax is really quite cool (ha!). You might also like Gumbel softmax (which has a temperature analogue).

TLDR: you can sample from what _would have been_ the probability distribution produced by softmax by just adding this weird random variable to the logits and selecting the max.

November 7, 2025 at 12:39 AM

Marco

@mcognetta.bsky.social

The flattening is because Lim_{x→∞} c/x = 0, and so during softmax, all the logits get pushed to 0 (via the division), which means they all get converted to e^0 = 1, so then the softmax calculation gives 1/|num classes| for each output class (vocabulary tokens, in your example).

November 7, 2025 at 12:39 AM

Marco

@mcognetta.bsky.social

This is a really fun problem actually. Given two strings x and y, what is the smallest DFA that accepts x but rejects y?

cs.uwaterloo.ca/~shallit/Tal...

Remarks on separating words

The separating words problem asks for the size of the smallest DFA needed to distinguish between two words of length <= n (by accepting one and rejecting the other). In this paper we survey what is kn...

arxiv.org

November 7, 2025 at 12:25 AM

Marco

@mcognetta.bsky.social

It's not that I didn't want to study the things I wrote about, it's just that at the time of the SoP, those are what I thought I'd have the most fun working on.

But when you hit your first idea, you should just run with it even if it's not in your "core focus", cause maybe it will become that.

November 7, 2025 at 12:20 AM

Marco

@mcognetta.bsky.social

For my PhD, I wrote about CJK tokenization, federated learning, and neutral language model interpretability via formal language theory.

I ended up doing a lot on CJK, but my thesis is about formal aspects of tokenization.

November 7, 2025 at 12:20 AM

Marco

@mcognetta.bsky.social

I wrote about two really specific problems for my masters: the string separability problem and high quality software implementations of automata operations.

I ended up writing my thesis on probabilistic automata algorithms.

November 7, 2025 at 12:20 AM

Marco

@mcognetta.bsky.social

Obviously this is an exaggeration, but really I got the main research areas right (i.e., automata for masters and tokenization for PhD), but the specifics were WAY off.

It just happens that way. You find something cool to work on and dig really deep and volià, you are a researcher.

November 7, 2025 at 12:20 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news