Matt Goldrick
@mattgoldrick.bsky.social
Linguistics and cognitive science at Northwestern. Opinions are my own. he/him/his
And it should be noted that this work might help us reimagine the nature of the computations underlying acquisition -- so for 'tokenization' isn't not entirely clear what the tokens should eventually be users.umiacs.umd.edu/~nhf/papers/...
November 10, 2025 at 11:04 PM
And it should be noted that this work might help us reimagine the nature of the computations underlying acquisition -- so for 'tokenization' isn't not entirely clear what the tokens should eventually be users.umiacs.umd.edu/~nhf/papers/...
These are fantastic, and I think there's a ton of interesting work to be done here -- because tokenization/discovery of language structure is far from trivial and definitely not 'solved' in a general sense.
November 10, 2025 at 11:02 PM
These are fantastic, and I think there's a ton of interesting work to be done here -- because tokenization/discovery of language structure is far from trivial and definitely not 'solved' in a general sense.
I'm very excited about these models but I think we're a long way from being able to say we have in-principle solution for realistic training sizes
November 10, 2025 at 8:20 PM
I'm very excited about these models but I think we're a long way from being able to say we have in-principle solution for realistic training sizes
I think this is @glupyan.bsky.social's link: ai.meta.com/blog/textles... which includes the Zero Speech benchmarks. I agree that self-supervised models are really interesting (I'm using them in my own work right now) but as far as I know these require huge amounts of training data
November 10, 2025 at 8:19 PM
I think this is @glupyan.bsky.social's link: ai.meta.com/blog/textles... which includes the Zero Speech benchmarks. I agree that self-supervised models are really interesting (I'm using them in my own work right now) but as far as I know these require huge amounts of training data
My understanding @glupyan.bsky.social (correct me if I'm wrong!) is that tokenization is very much an open research area, esp. without any access to text -- I'm not aware of any BabyLLM work that examines audio-only or AV-only tokenization
November 10, 2025 at 4:59 PM
My understanding @glupyan.bsky.social (correct me if I'm wrong!) is that tokenization is very much an open research area, esp. without any access to text -- I'm not aware of any BabyLLM work that examines audio-only or AV-only tokenization
The mindset I try to use is: express *appropriate* gratitude for someone donating their time to sit down and think about your work. I can do that without wanting to vomit
October 31, 2025 at 5:06 PM
The mindset I try to use is: express *appropriate* gratitude for someone donating their time to sit down and think about your work. I can do that without wanting to vomit