After switching the encoder to a pretrained Resnet18, freezing layers and training for 1 epoch, my model can (kind of) drive, having learned from 81k frames in 44 laps of me playing
May 29, 2025 at 3:04 AM
After switching the encoder to a pretrained Resnet18, freezing layers and training for 1 epoch, my model can (kind of) drive, having learned from 81k frames in 44 laps of me playing
I'm supposed to be writing a technical report, but I can't stop testing out my music LSTM (tech demo for my approach to language modeling audio). Only 18M parameters btw
May 24, 2025 at 3:30 AM
I'm supposed to be writing a technical report, but I can't stop testing out my music LSTM (tech demo for my approach to language modeling audio). Only 18M parameters btw
Audio language modeling has always involved people training models to VQ audio directly. But what if we quantized mel spectrograms, then trained a vocoder like iSTFTNet, and later our AR prior on mel spectrogram indices? We can language model 44.1KHz audio with a single 1k codebook.
May 3, 2025 at 8:10 PM
Audio language modeling has always involved people training models to VQ audio directly. But what if we quantized mel spectrograms, then trained a vocoder like iSTFTNet, and later our AR prior on mel spectrogram indices? We can language model 44.1KHz audio with a single 1k codebook.