danielmisrael.bsky.social
@danielmisrael.bsky.social
Please check out the rest of the paper! We propose: how MARIA can be used for test time scaling, how to initialize MARIA weights for efficient training, how MARIA representations differ, and more…

arxiv.org/abs/2502.06901

Thanks to my advisors Aditya Grover and @guyvdb.bsky.social
Enabling Autoregressive Models to Fill In Masked Tokens
Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently ...
arxiv.org
February 14, 2025 at 12:28 AM
MARIA 1B achieves the best throughput numbers, and MARIA 7B achieves similar throughput to DiffuLlama, but better samples as previously noted. Here, we see that ModernBERT despite being much smaller does not scale well for masked infilling because it cannot KV cache.
February 14, 2025 at 12:28 AM
We perform infilling on downstream data with words masked 50 percent. Using GPT4o-mini as a judge we compute the ELO scores for each model respectively. MARIA 7B and 1B have the highest rating ELO rating under the Bradley-Terry model.
February 14, 2025 at 12:28 AM
MARIA achieves far better perplexity than just using ModernBERT autoregressively and discrete diffusion models on downstream masked infilling test sets. Based on parameter counts, MARIA presents the most effective way to scale models for masked token infilling.
February 14, 2025 at 12:28 AM
We can get the best of both worlds with MARIA: train a linear decoder to combine the hidden states of an AR and MLM model. This enables AR masked infilling with the advantages of a more scalable AR architecture, such as KV cached inference. We combine OLMo and ModernBERT.
February 14, 2025 at 12:28 AM
Autoregressive (AR) LMs are more compute efficient to train than Masked LMs (MLM), which compute a loss on some fixed ratio e.g. 30% of the tokens instead of 100% like AR. Unlike MLM, AR models can also KV cache at inference time, but they cannot infill masked tokens.
February 14, 2025 at 12:28 AM
Interesting. I always felt the reviews should be even more independent so that the aggregate score has lower variance
January 3, 2025 at 9:21 PM