Sergei Vassilvitskii
vsergei.bsky.social
Sergei Vassilvitskii
@vsergei.bsky.social
Algorithms, predictions, privacy.
https://theory.stanford.edu/~sergei/
Synthetic Data is all the rage in LLM training, but why does it work? In arxiv.org/abs/2502.08924 we show how to analyze this question through the lens of boosting. Unlike boosting, however, our assumptions on the data and the learning method are inverted.
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper...
arxiv.org
February 14, 2025 at 1:48 PM