bergelsonlab
banner
bergelsonlab.bsky.social
bergelsonlab
@bergelsonlab.bsky.social
Official account for the Bergelson Lab at Harvard, sporadically maintained by the PI:).
Just a lab, trying to figure out how babies learn language, somehow caught in the crosshairs of gov't admin battles.
@glupyan.bsky.social link's broken but, i'm thinking of e.g.: www.zerospeech.com, where less-text documented lang's are akin to babies. but e.g. training models on childes audio vs. text is a totally differently successful enterprise, so it is in principle a roadblock for now i think?
The Zero Resource Speech Benchmark (series)
www.zerospeech.com
November 10, 2025 at 7:25 PM
are you separating tokenizing from segmenting? hard for whom? bc getting words from the raw audio (or sign) is still pretty awfully rough going for our best ASR systems in cases approaching anything naturalistic (& ofc takes babies months for phonotactics, up to years for harder rarer morpho stuff)
November 10, 2025 at 4:41 PM
stop biasing the sample elena
November 10, 2025 at 4:30 PM
I'll be very curious to look back in a decade at this scientific moment. and despite my grumps above i do think a lot of really interesting insights will one day come out of this chapter of cognitive science. #CogSciSky
November 10, 2025 at 4:30 PM
3) a more meta-point. i was a bit surprised not to hear mention of the 'costs' of working w/LLMs. everyone knows fmri is expensive so let's be choosey in how we scan, but all these (environmentally crushing & ethically fraught) LLMs are still totally open season. we're not 'paying'...yet.
November 10, 2025 at 4:30 PM
2) it feels circular to take the products of the human linguistic system & ask if its structure could be learned w/o it. The model vs. human link up feels very evocative of @davidpoeppel.bsky.social's points ab aligning the "parts lists"of cognition vs. neurobiology, but subbing LLMs for neuro 3/4
November 10, 2025 at 4:30 PM
1) 'baby-like' LMs are just capped at a smaller # of words (e.g. 10mill in #BabyLm)
But starting w/tokenized text solves a huge part of the problem: figuring out where the words begin/end in the first place (over time). Baby input doesn't come pre-chewed. (cf. zero-resource folks, Dupoux etc.) 2/4
November 10, 2025 at 4:30 PM
*wept 😭
October 27, 2025 at 12:08 PM