Tom Kempton
banner
tomkempton.bsky.social
Tom Kempton
@tomkempton.bsky.social
Pure mathematician working in Ergodic Theory, Fractal Geometry, and (recently) Large Language Models. Senior Lecturer (= Associate Professor) at the University of Manchester.
Haven't seen this and it sounds interesting, could you post a link to something using it. Thanks!
March 27, 2025 at 6:29 PM
Thanks, I'll take a look!
February 12, 2025 at 4:06 PM
Thanks for the reply! What I meant by confidence here (possibly the wrong word) isn't how concentrated the output prob vector is, but how close we think the output prob is to the true next token distribution (if such a thing existed...).
February 12, 2025 at 11:48 AM
I'm not sure I really believe that there's no information to be gleaned though. Maybe one needs to think more about training dynamics...
February 12, 2025 at 11:44 AM
So one answer to my question, which I'd not thought about until your answer, is that, while softmax is not injective on R^|V|, it is injective when you restrict it to the column space of the output embedding matrix, so there's nothing to think about here.
February 12, 2025 at 11:44 AM
I'd guess X shouldn't be in this column space, otherwise there's a wasted dimension which doesn't make it to the output (although it would be interesting to see whether, if you included it, it contained interesting info).
February 12, 2025 at 11:44 AM
Presumably this is well studied, could anyone point me in the direction of references?
February 12, 2025 at 8:29 AM
Let's call a logits vector 'large' if the division term in the softmax is large. Might we guess that large logits vectors correspond to confident situations where the model is satisfied with the possible choices of next token (either many good options, or just one option but it looks great?)
February 12, 2025 at 8:29 AM
Is it just that we initialise the network with small weights and so our prior is that this should persist?

Tips or links would be very welcome!
January 31, 2025 at 8:43 AM
Theoretically, the later missing layers could permute earlier layers, or multiply all the activations by -1. So I don't see any reason that one should expect training a language model to result in a model where naively applying the output embedding to earlier layers is a sensible thing to do.
January 31, 2025 at 8:43 AM