Emily Byun
@yewonbyun.bsky.social
PhD Student in Machine Learning at CMU. yewonbyun.github.io
14/ This work will be presented as a spotlight talk today at #COLM2025 SocialSim workshop and at NeurIPS 2025.
Paper: arxiv.org/abs/2508.06635
Code: github.com/lasilab/valid-synth-inference
Paper: arxiv.org/abs/2508.06635
Code: github.com/lasilab/valid-synth-inference
October 10, 2025 at 4:12 PM
14/ This work will be presented as a spotlight talk today at #COLM2025 SocialSim workshop and at NeurIPS 2025.
Paper: arxiv.org/abs/2508.06635
Code: github.com/lasilab/valid-synth-inference
Paper: arxiv.org/abs/2508.06635
Code: github.com/lasilab/valid-synth-inference
13/ I really enjoyed working on this project with the brilliant and kindest @shantanug.bsky.social and great mentors @zacharylipton.bsky.social @donskerclass.bsky.social @brwilder.bsky.social
October 10, 2025 at 4:12 PM
13/ I really enjoyed working on this project with the brilliant and kindest @shantanug.bsky.social and great mentors @zacharylipton.bsky.social @donskerclass.bsky.social @brwilder.bsky.social
12/ This framework provides a foundation for easily extensible estimation methods that can safely incorporate the growing variety and quality of synthetic data sources.
October 10, 2025 at 4:12 PM
12/ This framework provides a foundation for easily extensible estimation methods that can safely incorporate the growing variety and quality of synthetic data sources.
11/ At a fundamental level, this work takes a step towards understanding how synthetic data from foundation models can be used to support valid inference. As the usage and promise of FMs continue to grow, so too will the complexity of pipelines that incorporate their outputs.
October 10, 2025 at 4:12 PM
11/ At a fundamental level, this work takes a step towards understanding how synthetic data from foundation models can be used to support valid inference. As the usage and promise of FMs continue to grow, so too will the complexity of pipelines that incorporate their outputs.
10/ Empirically, we observe large gains in estimation performance (lower MSE + tighter confidence intervals with valid coverage) across diverse computational social science tasks, with benefits most pronounced in low label regimes.
October 10, 2025 at 4:12 PM
10/ Empirically, we observe large gains in estimation performance (lower MSE + tighter confidence intervals with valid coverage) across diverse computational social science tasks, with benefits most pronounced in low label regimes.
9/ In other words, in the worst case where synthetic data is *completely* uninformative (bad quality), including it does not hurt, at least asymptotically.
October 10, 2025 at 4:12 PM
9/ In other words, in the worst case where synthetic data is *completely* uninformative (bad quality), including it does not hurt, at least asymptotically.
8/ When they are independent from each other, the variance reduces to the optimal variance based only on the real data.
October 10, 2025 at 4:12 PM
8/ When they are independent from each other, the variance reduces to the optimal variance based only on the real data.
7/ Precisely: The GMM measures the cross-correlations between the synthetic and real data, producing a combination of these moments that reduces the variance of the real data moments if there is information from the synthetic data moments.
October 10, 2025 at 4:12 PM
7/ Precisely: The GMM measures the cross-correlations between the synthetic and real data, producing a combination of these moments that reduces the variance of the real data moments if there is information from the synthetic data moments.
6/ Why and when does synthetic data help? We found that the incorporation of synthetic data leads to more precise estimation and tighter confidence intervals when its moments are predictive of the real data moments
October 10, 2025 at 4:12 PM
6/ Why and when does synthetic data help? We found that the incorporation of synthetic data leads to more precise estimation and tighter confidence intervals when its moments are predictive of the real data moments
5/ Prospectively, it was not intuitive whether the incorporation of additional moments based solely on synthetic data (defined in terms of a separate parameter from the target) would yield any benefits (or even affect) the estimation of the target parameter of the real data.
October 10, 2025 at 4:12 PM
5/ Prospectively, it was not intuitive whether the incorporation of additional moments based solely on synthetic data (defined in terms of a separate parameter from the target) would yield any benefits (or even affect) the estimation of the target parameter of the real data.
4/ We propose a solution via a new estimator based on generalized method of moments (GMM) that allows us to incorporate these multiple sources of information by adding moments.
October 10, 2025 at 4:12 PM
4/ We propose a solution via a new estimator based on generalized method of moments (GMM) that allows us to incorporate these multiple sources of information by adding moments.
3/ Problem: Naively aggregating these different sources of information leads to highly biased estimates, due to differences in the underlying distribution
October 10, 2025 at 4:12 PM
3/ Problem: Naively aggregating these different sources of information leads to highly biased estimates, due to differences in the underlying distribution
2/ In limited labeled regimes, LLMs provide practitioners a cheap alternative to attain imperfect labels and even generate entirely new synthetic samples
October 10, 2025 at 4:12 PM
2/ In limited labeled regimes, LLMs provide practitioners a cheap alternative to attain imperfect labels and even generate entirely new synthetic samples
would love to join!
November 19, 2024 at 6:04 PM
would love to join!