Learn more → buff.ly/6xLHLk6
Learn more → buff.ly/6xLHLk6
arxiv.org/pdf/2306.08543
arxiv.org/pdf/2306.08543
agustinus.kristia.de/blog/forward...
agustinus.kristia.de/blog/forward...
openreview.net/pdf?id=3zKta...
openreview.net/pdf?id=3zKta...
- It frees the student to drop modes (since divergence = 0 where the student model has no coverage)
- It adds optimization pressure even where the teacher distribution has no coverage (e.g. sampling noise)
- It frees the student to drop modes (since divergence = 0 where the student model has no coverage)
- It adds optimization pressure even where the teacher distribution has no coverage (e.g. sampling noise)
1. Flop-efficient because
no need for RL-style
search or KD-style label generation.
2. Rich in realistic mistakes and particularities of voice
3. Dense training signal
1. Flop-efficient because
no need for RL-style
search or KD-style label generation.
2. Rich in realistic mistakes and particularities of voice
3. Dense training signal