Senior Research Scientist, Google |
PhD, Princeton University
TLDR: Stacking, i.e. growing model depth gradually, not only improves training efficiency (if done right), but significantly improves downstream tasks that require *reasoning*, at similar perplexity 1/n
TLDR: Stacking, i.e. growing model depth gradually, not only improves training efficiency (if done right), but significantly improves downstream tasks that require *reasoning*, at similar perplexity 1/n