Lightnews — Scholar-powered news

Tom Kempton

@tomkempton.bsky.social

120 followers 950 following 14 posts

Pure mathematician working in Ergodic Theory, Fractal Geometry, and (recently) Large Language Models. Senior Lecturer (= Associate Professor) at the University of Manchester.

Posts Replies Media Videos

Tom Kempton

@tomkempton.bsky.social

Haven't seen this and it sounds interesting, could you post a link to something using it. Thanks!

March 27, 2025 at 6:29 PM

Tom Kempton

@tomkempton.bsky.social

Thanks, I'll take a look!

February 12, 2025 at 4:06 PM

Tom Kempton

@tomkempton.bsky.social

Thanks for the reply! What I meant by confidence here (possibly the wrong word) isn't how concentrated the output prob vector is, but how close we think the output prob is to the true next token distribution (if such a thing existed...).

February 12, 2025 at 11:48 AM

Tom Kempton

@tomkempton.bsky.social

I'm not sure I really believe that there's no information to be gleaned though. Maybe one needs to think more about training dynamics...

February 12, 2025 at 11:44 AM

Tom Kempton

@tomkempton.bsky.social

So one answer to my question, which I'd not thought about until your answer, is that, while softmax is not injective on R^|V|, it is injective when you restrict it to the column space of the output embedding matrix, so there's nothing to think about here.

February 12, 2025 at 11:44 AM

Tom Kempton

@tomkempton.bsky.social

I'd guess X shouldn't be in this column space, otherwise there's a wasted dimension which doesn't make it to the output (although it would be interesting to see whether, if you included it, it contained interesting info).

February 12, 2025 at 11:44 AM

Tom Kempton

@tomkempton.bsky.social

Presumably this is well studied, could anyone point me in the direction of references?

February 12, 2025 at 8:29 AM

Tom Kempton

@tomkempton.bsky.social

Let's call a logits vector 'large' if the division term in the softmax is large. Might we guess that large logits vectors correspond to confident situations where the model is satisfied with the possible choices of next token (either many good options, or just one option but it looks great?)

February 12, 2025 at 8:29 AM

Tom Kempton

@tomkempton.bsky.social

Is it just that we initialise the network with small weights and so our prior is that this should persist?

Tips or links would be very welcome!

January 31, 2025 at 8:43 AM

Tom Kempton

@tomkempton.bsky.social

Theoretically, the later missing layers could permute earlier layers, or multiply all the activations by -1. So I don't see any reason that one should expect training a language model to result in a model where naively applying the output embedding to earlier layers is a sensible thing to do.

January 31, 2025 at 8:43 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news