Jason Weston
jasonweston.bsky.social
Jason Weston
@jasonweston.bsky.social

Senior Director, Research Scientist @ Meta FAIR + Visiting Prof @ NYU.
Pretrain+SFT: NLP from Scratch (2011). Multilayer attention+position encode+LLM: MemNet (2015). Recent (2024): Self-Rewarding LLMs & more!

Computer science 92%
Biology 7%

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Luke Zettlemoyer

Reposted by Jason Weston

Training Large Language Models to Reason in a Continuous Latent Space

Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT)

Directly feed the last hidden state (a continuous thought) as the input embedding for the next token.

arxiv.org/abs/2412.06769

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Reposted by Jason Weston

Our new work on continuous chain of thought.
Training Large Language Models to Reason in a Continuous Latent Space

Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT)

Directly feed the last hidden state (a continuous thought) as the input embedding for the next token.

arxiv.org/abs/2412.06769

Analysis: AD picks high temp for creative & low for fact-seeking prompts, automatically via training.

Our methods AD & Latent Pref Optimization are general & can be applied to train other hyperparams or latent features.

Excited how people could *adapt* this research!
🧵4/4

We train on a mix of tasks:
GSM8K - requires factuality (low temp)
Stories - requires creativity (high temp)
UltraFeedback - general instruction following, requires mix

Results: Adaptive Decoding outperforms any fixed temperature, automatically choosing via the AD layer.
🧵3/4

Recipe 👩‍🍳:
Adaptive Decoder (AD) Layer:
- Assigns probability to each hyperparam choice (decoding temp) given hidden state. Given temp, sample a token.

Training (Latent PO):
- Train AD by sampling params+tokens & use reward model on rejected hyperparam preference pairs
🧵2/4

🚨 Adaptive Decoding via Latent Preference Optimization 🚨
- New layer for Transformer, selects decoding params automatically *per token*
- Learnt via new method Latent Preference Optimization
- Outperforms any fixed temperature decoding, choosing creativity or factuality
arxiv.org/abs/2411.09661
🧵1/4