Dane Carnegie Malenfant
@dvnxmvlhdf5.bsky.social
MSc. @mila-quebec.bsky.social and @mcgill.ca in the LiNC lab
Fixating on multi-agent RL, Neuro-AI and decisions
Ēka ē-akimiht
https://danemalenfant.com/
Fixating on multi-agent RL, Neuro-AI and decisions
Ēka ē-akimiht
https://danemalenfant.com/
2/3 It was wonderful to see machine learning’s impact across so many fields. Especially work from Taiwan. Like Canada, Taiwan has Indigenous peoples; I believe cultural and traditional knowledge can strengthen ML systems and how they generalize and align to real-world contexts.
November 3, 2025 at 3:53 PM
2/3 It was wonderful to see machine learning’s impact across so many fields. Especially work from Taiwan. Like Canada, Taiwan has Indigenous peoples; I believe cultural and traditional knowledge can strengthen ML systems and how they generalize and align to real-world contexts.
1/3 Thank you to CIFAR and partners in DSET for bringing me to Banff to speak on my research: The challenge of hidden gifts in multi-agent reinforcement learning arxiv.org/abs/2505.20579. We introduce a novel task on reciprocity with a scarce resource; take what you need, leave what you don’t.
November 3, 2025 at 3:53 PM
1/3 Thank you to CIFAR and partners in DSET for bringing me to Banff to speak on my research: The challenge of hidden gifts in multi-agent reinforcement learning arxiv.org/abs/2505.20579. We introduce a novel task on reciprocity with a scarce resource; take what you need, leave what you don’t.
I really enjoy reading equations in formal fields (e.g math, computer science, et.c) due to all the colours and shapes that appear in your mind but making low math public outreach posters is fun too @ivado.bsky.social @mila-quebec.bsky.social
October 22, 2025 at 3:28 PM
I really enjoy reading equations in formal fields (e.g math, computer science, et.c) due to all the colours and shapes that appear in your mind but making low math public outreach posters is fun too @ivado.bsky.social @mila-quebec.bsky.social
A nice feeling after long long hours
October 21, 2025 at 9:51 PM
A nice feeling after long long hours
To follow up on the asymptotic proof that the self-correction term works with any number of agents or coalitions, here are the results for 3 agents.
Policy gradient agent's performance suffers with more agents but self correction still stabilizes learning arxiv.org/abs/2505.20579
Policy gradient agent's performance suffers with more agents but self correction still stabilizes learning arxiv.org/abs/2505.20579
October 16, 2025 at 5:53 PM
To follow up on the asymptotic proof that the self-correction term works with any number of agents or coalitions, here are the results for 3 agents.
Policy gradient agent's performance suffers with more agents but self correction still stabilizes learning arxiv.org/abs/2505.20579
Policy gradient agent's performance suffers with more agents but self correction still stabilizes learning arxiv.org/abs/2505.20579
Here is my plan to make Bluesky more fun and active:
October 8, 2025 at 12:06 AM
Here is my plan to make Bluesky more fun and active:
4/8
To communicate this to a general audience and the #art community, I built a minimal task: two Gaussian bandits. One agent optimizes with entropy; the other doesn’t. Mid-training, the reward distribution jumps.
To communicate this to a general audience and the #art community, I built a minimal task: two Gaussian bandits. One agent optimizes with entropy; the other doesn’t. Mid-training, the reward distribution jumps.
October 7, 2025 at 6:34 PM
4/8
To communicate this to a general audience and the #art community, I built a minimal task: two Gaussian bandits. One agent optimizes with entropy; the other doesn’t. Mid-training, the reward distribution jumps.
To communicate this to a general audience and the #art community, I built a minimal task: two Gaussian bandits. One agent optimizes with entropy; the other doesn’t. Mid-training, the reward distribution jumps.
2/8
I proposed a reinforcement-learning (RL) demo: add a maximum-entropy term to increase the longevity of systems in a non-stationary environment. This is well known to the RL research community: openreview.net/forum?id=PtS...
(photo by Félix Bonne-Vie)
I proposed a reinforcement-learning (RL) demo: add a maximum-entropy term to increase the longevity of systems in a non-stationary environment. This is well known to the RL research community: openreview.net/forum?id=PtS...
(photo by Félix Bonne-Vie)
October 7, 2025 at 6:34 PM
2/8
I proposed a reinforcement-learning (RL) demo: add a maximum-entropy term to increase the longevity of systems in a non-stationary environment. This is well known to the RL research community: openreview.net/forum?id=PtS...
(photo by Félix Bonne-Vie)
I proposed a reinforcement-learning (RL) demo: add a maximum-entropy term to increase the longevity of systems in a non-stationary environment. This is well known to the RL research community: openreview.net/forum?id=PtS...
(photo by Félix Bonne-Vie)
My eye colour apparently changed after 6 years
October 3, 2025 at 12:07 AM
My eye colour apparently changed after 6 years
Hanover’s Oktoberfest honouring hip hop’s best
September 28, 2025 at 12:56 PM
Hanover’s Oktoberfest honouring hip hop’s best
Particularly, I started presenting a validation experiment of the self-correction term.
Rather than "if x then y" this tested "if not x then not y".
This inhibits learning the sub-policy for maximizing collective reward. Agents compete even with a larger reward signal not to
Rather than "if x then y" this tested "if not x then not y".
This inhibits learning the sub-policy for maximizing collective reward. Agents compete even with a larger reward signal not to
September 20, 2025 at 3:42 PM
Particularly, I started presenting a validation experiment of the self-correction term.
Rather than "if x then y" this tested "if not x then not y".
This inhibits learning the sub-policy for maximizing collective reward. Agents compete even with a larger reward signal not to
Rather than "if x then y" this tested "if not x then not y".
This inhibits learning the sub-policy for maximizing collective reward. Agents compete even with a larger reward signal not to
The “proof” to the below thread in one page
August 19, 2025 at 6:46 PM
The “proof” to the below thread in one page
But with self-correction, not all combinations of agents need to be calculated. Only the coeffecients for each level of the tree yielding O(log V) complexity
August 16, 2025 at 6:07 PM
But with self-correction, not all combinations of agents need to be calculated. Only the coeffecients for each level of the tree yielding O(log V) complexity
Without self-correction this is like walking through an n-ary tree where the highest order gradient is at the root
August 16, 2025 at 6:07 PM
Without self-correction this is like walking through an n-ary tree where the highest order gradient is at the root
and the full update is now:
August 16, 2025 at 6:07 PM
and the full update is now:
Then since the sum converges to a gradient operator distribution with the identity operator as I.
August 16, 2025 at 6:07 PM
Then since the sum converges to a gradient operator distribution with the identity operator as I.
Now let f define the correction term function for clarity
August 16, 2025 at 6:07 PM
Now let f define the correction term function for clarity
This would continue with more and more agents and is neatly a binomial distribution of the order of gradients
August 16, 2025 at 6:07 PM
This would continue with more and more agents and is neatly a binomial distribution of the order of gradients
These objectives will cleanly come out in the global optimization but since there 2 ways to leave the key with 1 agent and 1 way to leave the key with all 3 agents out of three agents, we include a coefficient infront of the second term.
August 16, 2025 at 6:07 PM
These objectives will cleanly come out in the global optimization but since there 2 ways to leave the key with 1 agent and 1 way to leave the key with all 3 agents out of three agents, we include a coefficient infront of the second term.
Then with policy independence (there isn't a better action to take since success now requires another agent), we have another correction term.
August 16, 2025 at 6:07 PM
Then with policy independence (there isn't a better action to take since success now requires another agent), we have another correction term.
Using agent k's value approximation as a surrogate for the expected collection reward is the same as before but leads to a higher order gradient.
August 16, 2025 at 6:07 PM
Using agent k's value approximation as a surrogate for the expected collection reward is the same as before but leads to a higher order gradient.
But now consider an extension to the reward function that reward two agents opening their doors and larger collective reward for 3 agents opening their doors. The 2 agent case has already been covered but now the collective reward requires 3 agents.
August 16, 2025 at 6:07 PM
But now consider an extension to the reward function that reward two agents opening their doors and larger collective reward for 3 agents opening their doors. The 2 agent case has already been covered but now the collective reward requires 3 agents.
And this term is equivalent to the q-value estimate of the collective shared reward (which is non-stationary and changes between policy updates)
August 16, 2025 at 6:07 PM
And this term is equivalent to the q-value estimate of the collective shared reward (which is non-stationary and changes between policy updates)