Paper: arxiv.org/abs/2406.18009
Demo: aka.ms/e2tts
Paper: arxiv.org/abs/2406.18009
Demo: aka.ms/e2tts
Simple—just a stack of Transformer and linear layers; no convolutions.
Faster and better—superior audio reconstruction quality with fewer MACs compared to strong convolution-based baselines.
Simple—just a stack of Transformer and linear layers; no convolutions.
Faster and better—superior audio reconstruction quality with fewer MACs compared to strong convolution-based baselines.