Jason Weston
@jasonweston.bsky.social
Senior Director, Research Scientist @ Meta FAIR + Visiting Prof @ NYU.
Pretrain+SFT: NLP from Scratch (2011). Multilayer attention+position encode+LLM: MemNet (2015). Recent (2024): Self-Rewarding LLMs & more!
Pretrain+SFT: NLP from Scratch (2011). Multilayer attention+position encode+LLM: MemNet (2015). Recent (2024): Self-Rewarding LLMs & more!
Analysis: AD picks high temp for creative & low for fact-seeking prompts, automatically via training.
Our methods AD & Latent Pref Optimization are general & can be applied to train other hyperparams or latent features.
Excited how people could *adapt* this research!
🧵4/4
Our methods AD & Latent Pref Optimization are general & can be applied to train other hyperparams or latent features.
Excited how people could *adapt* this research!
🧵4/4
November 22, 2024 at 1:06 PM
Analysis: AD picks high temp for creative & low for fact-seeking prompts, automatically via training.
Our methods AD & Latent Pref Optimization are general & can be applied to train other hyperparams or latent features.
Excited how people could *adapt* this research!
🧵4/4
Our methods AD & Latent Pref Optimization are general & can be applied to train other hyperparams or latent features.
Excited how people could *adapt* this research!
🧵4/4
We train on a mix of tasks:
GSM8K - requires factuality (low temp)
Stories - requires creativity (high temp)
UltraFeedback - general instruction following, requires mix
Results: Adaptive Decoding outperforms any fixed temperature, automatically choosing via the AD layer.
🧵3/4
GSM8K - requires factuality (low temp)
Stories - requires creativity (high temp)
UltraFeedback - general instruction following, requires mix
Results: Adaptive Decoding outperforms any fixed temperature, automatically choosing via the AD layer.
🧵3/4
November 22, 2024 at 1:06 PM
We train on a mix of tasks:
GSM8K - requires factuality (low temp)
Stories - requires creativity (high temp)
UltraFeedback - general instruction following, requires mix
Results: Adaptive Decoding outperforms any fixed temperature, automatically choosing via the AD layer.
🧵3/4
GSM8K - requires factuality (low temp)
Stories - requires creativity (high temp)
UltraFeedback - general instruction following, requires mix
Results: Adaptive Decoding outperforms any fixed temperature, automatically choosing via the AD layer.
🧵3/4
Recipe 👩🍳:
Adaptive Decoder (AD) Layer:
- Assigns probability to each hyperparam choice (decoding temp) given hidden state. Given temp, sample a token.
Training (Latent PO):
- Train AD by sampling params+tokens & use reward model on rejected hyperparam preference pairs
🧵2/4
Adaptive Decoder (AD) Layer:
- Assigns probability to each hyperparam choice (decoding temp) given hidden state. Given temp, sample a token.
Training (Latent PO):
- Train AD by sampling params+tokens & use reward model on rejected hyperparam preference pairs
🧵2/4
November 22, 2024 at 1:06 PM
Recipe 👩🍳:
Adaptive Decoder (AD) Layer:
- Assigns probability to each hyperparam choice (decoding temp) given hidden state. Given temp, sample a token.
Training (Latent PO):
- Train AD by sampling params+tokens & use reward model on rejected hyperparam preference pairs
🧵2/4
Adaptive Decoder (AD) Layer:
- Assigns probability to each hyperparam choice (decoding temp) given hidden state. Given temp, sample a token.
Training (Latent PO):
- Train AD by sampling params+tokens & use reward model on rejected hyperparam preference pairs
🧵2/4