Hadi Khalaf
hadikh.bsky.social
Hadi Khalaf
@hadikh.bsky.social
phd @ harvard seas, thinking about alignment, information theory, and the likes
Reposted by Hadi Khalaf
AI is built to “be helpful” or “avoid harm”, but which principles should it prioritize and when? We call this alignment discretion. As Asimov's stories show: balancing such principles for AI behavior is tricky. In fact, we find that AI has its own set of priorities. (comic by @xkcd.com)🧵👇
February 19, 2025 at 9:08 PM
I feel queasy when I read LLM interpretability papers, some results seem wonderful but I am distrustful of the methodology and interpretation
February 2, 2025 at 4:19 AM
Reward modeling + BoN seem like a poor man's way to get good alignment because PPO training is expensive. Are perfect reward signals all we need? I'm not convinced
February 1, 2025 at 6:06 AM
The current alignment paradigm is plagued by the fact it's an flimsy adaptation of RL. RLHF is not RL but it could be and maybe it should be. There's something missing in the treatment of feedback-based alignment, and there are fundamental differences between it & RL that are not clear to me!
February 1, 2025 at 6:03 AM
spending my sunday battling with tikz (im losing)
December 9, 2024 at 1:17 AM
true agi is when your llm stops affirming everything you say
December 6, 2024 at 3:43 AM
doing your phd at a time when it *feels* everyone is studying the same things and you'd be facing major FOMO if you don't... is not fun
November 26, 2024 at 1:37 AM