Building a Natural Science of Intelligence 🧠🤖
Prev: ICoN Postdoctoral Fellow @MIT, PhD @Stanford NeuroAILab
Personal Website: https://cs.cmu.edu/~anayebi
Slides: www.cs.cmu.edu/~mgormley/co...
Full course info: bsky.app/profile/anay...
Slides: www.cs.cmu.edu/~mgormley/co...
Full course info: bsky.app/profile/anay...
We show each of these amounts to finetuning a different aspect of the Transformer.
We show each of these amounts to finetuning a different aspect of the Transformer.
In my recent @cmurobotics.bsky.social seminar talk, “Using Embodied Agents to Reverse-Engineer Natural Intelligence”,
In my recent @cmurobotics.bsky.social seminar talk, “Using Embodied Agents to Reverse-Engineer Natural Intelligence”,
Slides: www.cs.cmu.edu/~mgormley/co...
Full course info: bsky.app/profile/anay...
Slides: www.cs.cmu.edu/~mgormley/co...
Full course info: bsky.app/profile/anay...
Slides: www.cs.cmu.edu/~mgormley/co...
Full course info: bsky.app/profile/anay...
Slides: www.cs.cmu.edu/~mgormley/co...
Full course info: bsky.app/profile/anay...
We also explain the historical throughline to some of these ideas, inspired by Nobel-prize-winning observations in neuroscience!
We also explain the historical throughline to some of these ideas, inspired by Nobel-prize-winning observations in neuroscience!
This lets us study the *intrinsic* complexity of alignment, separate from specific modeling choices.
This lets us study the *intrinsic* complexity of alignment, separate from specific modeling choices.
(E.g. bounded rationality, memory, and theory of mind)
Even mild noise or imperfect memory can *exponentially* increase alignment costs—unless protocols exploit structure.
(E.g. bounded rationality, memory, and theory of mind)
Even mild noise or imperfect memory can *exponentially* increase alignment costs—unless protocols exploit structure.
Alignment can become intractable when the task state space (D) gets too large. By our lower bounds, costs always grow with MN^2, where M is the number of tasks & N is the number of agents, though for many protocols, costs additionally grow as MN^2 D.
Alignment can become intractable when the task state space (D) gets too large. By our lower bounds, costs always grow with MN^2, where M is the number of tasks & N is the number of agents, though for many protocols, costs additionally grow as MN^2 D.
Even unbounded agents can’t align efficiently if they must encode an exponentially large or high-entropy set of human values.
Safety implication: Focus on objective compression, delegation, or progressive disclosure, not one-shot full specification
Even unbounded agents can’t align efficiently if they must encode an exponentially large or high-entropy set of human values.
Safety implication: Focus on objective compression, delegation, or progressive disclosure, not one-shot full specification
– Checked in randomized polytime
– Audited with zero-knowledge proofs
– Verified using differential privacy
This makes formal auditing compatible with user privacy and proprietary weights.
– Checked in randomized polytime
– Audited with zero-knowledge proofs
– Verified using differential privacy
This makes formal auditing compatible with user privacy and proprietary weights.
We prove:
❌ In general, verifying corrigibility after arbitrary modification is undecidable—as hard as the halting problem.
So we carve out a decidable island.
We prove:
❌ In general, verifying corrigibility after arbitrary modification is undecidable—as hard as the halting problem.
So we carve out a decidable island.
– Off-switch behavior extends across time
– Spawned agents inherit corrigibility
– Gradual loss of control is modeled explicitly
We prove multi-step corrigibility and net benefit still hold under learned approximations.
– Off-switch behavior extends across time
– Spawned agents inherit corrigibility
– Gradual loss of control is modeled explicitly
We prove multi-step corrigibility and net benefit still hold under learned approximations.
As long as approximation errors are bounded, so are safety violations—and human net benefit (per Carey & @tom4everitt.bsky.social) is still guaranteed.
As long as approximation errors are bounded, so are safety violations—and human net benefit (per Carey & @tom4everitt.bsky.social) is still guaranteed.
Instead, we define *five separate utility heads*:
-Obedience
-Switch-access preservation
-Truthfulness
-Low-impact / reversibility
-Task reward
Combined *lexicographically*. Safety dominates.
Instead, we define *five separate utility heads*:
-Obedience
-Switch-access preservation
-Truthfulness
-Low-impact / reversibility
-Task reward
Combined *lexicographically*. Safety dominates.
We formalize and *guarantee* this behavior under learning and planning error.
We formalize and *guarantee* this behavior under learning and planning error.
But encoding all of human ethics isn’t feasible—and our prior work shows that large value sets hit alignment complexity barriers: tinyurl.com/wr6jrt2b
But encoding all of human ethics isn’t feasible—and our prior work shows that large value sets hit alignment complexity barriers: tinyurl.com/wr6jrt2b
We give the first provable framework that makes it implementable—unlike RLHF or Constitutional AI, which can fail when goals conflict.
🧵👇
We give the first provable framework that makes it implementable—unlike RLHF or Constitutional AI, which can fail when goals conflict.
🧵👇
Besides stand-alone usage, TNNs integrate naturally with standard feedforward networks, Transformers, and SSMs, combined via our Encoder-Attender-Decoder (EAD) architecture for tactile processing and beyond.
See our recent paper for more details on EAD: bsky.app/profile/trin...
Besides stand-alone usage, TNNs integrate naturally with standard feedforward networks, Transformers, and SSMs, combined via our Encoder-Attender-Decoder (EAD) architecture for tactile processing and beyond.
See our recent paper for more details on EAD: bsky.app/profile/trin...