paper: arxiv.org/abs/2411.198...
They built a pure transformer codec (1B params) with an FSQ bottleneck instead of the usual RVQ approach. At 400-700 bits per second, it produces extremely high quality speech - getting close to the original audio.
paper: arxiv.org/abs/2411.198...
They built a pure transformer codec (1B params) with an FSQ bottleneck instead of the usual RVQ approach. At 400-700 bits per second, it produces extremely high quality speech - getting close to the original audio.