Lightnews — Scholar-powered news

Raphaël Millière

@raphaelmilliere.com

But models also showed different sensitivities than humans. For example, top LLMs were more affected by permuting the order of examples and were more distracted by irrelevant semantic information, hinting at different underlying mechanisms. 7/9

August 11, 2025 at 8:02 AM

Raphaël Millière

@raphaelmilliere.com

We found that the best-performing LLMs match human performance across many of our challenging new tasks. This provides evidence that sophisticated analogical reasoning can emerge from domain-general learning, where existing computational models fall short. 6/9

August 11, 2025 at 8:02 AM

Raphaël Millière

@raphaelmilliere.com

In our second study, we highlighted the role of semantic content. Here, the task required identifying specific properties of concepts (e.g., "Is it a mammal?", "How many legs does it have?") and mapping them to features of the symbol strings. 5/9

August 11, 2025 at 8:02 AM

Raphaël Millière

@raphaelmilliere.com

In our first study, we tested whether LLMs could map semantic relationships between concepts to symbolic patterns. We included controls such as permuting the order of examples or adding semantic distractors to test for robustness and content effects (see full list below). 4/9

August 11, 2025 at 8:02 AM

Raphaël Millière

@raphaelmilliere.com

Can LLMs reason by analogy like humans? We investigate this question in a new paper published in the Journal of Memory and Language (link below). This was a long-running but very rewarding project. Here are a few thoughts on our methodology and main findings. 1/9

August 11, 2025 at 8:02 AM

Raphaël Millière

@raphaelmilliere.com

For example, an RLM asked to generate a hateful tirade may conclude in its reasoning trace that it should refuse; but if the prompt instructs it to assess each hateful sentence within its thinking process, it will often leak the full harmful content! (see example below) 9/13

Example of a "thought injection attack" on Deepseek R1, asking for a violent tirade against philosophers (note that the attack method also works on much more serious examples of harmful speech). This shows the reasoning trace before the actual answer.

June 10, 2025 at 1:39 PM

Raphaël Millière

@raphaelmilliere.com

The problem is that these norms often conflict. For example, a request for dangerous information (violating “harmlessness”) can be framed as an educational query (appealing to “helpfulness”). Many issues with LLM behavior can be framed through these normative conflicts. 3/13

Helpfulness and harmlessness conflict when assisting the user is likely to cause harm. For example, an overly helpful model might fulfill harmful requests or provide information that could be misused. Conversely, excessive caution against harm can make a model unhelpful, by limiting its outputs to evasive responses or even refusals to answer legitimate user requests. In turn, honesty and harmlessness conflict when conveying truthful or accurate information is likely to cause harm. Privileging honesty at the expense of harmlessness can lead the model to disclose information hazards, while prioritizing harmlessness at the expense of honesty can lead to deliberate misinformation or omission of important details. Finally, helpfulness and honesty can conflict in more subtle ways when conveying truthful or accurate information violates the disposition to assist the user. Privileging helpfulness at the expense of honesty may lead to sycophantic behavior, in which the model systematically agrees with the user without concern for accuracy; conversely, strict adherence to honesty can result in responses that, while factually correct, may fail to address the user's needs -- for example by being excessively detailed and lacking the right level of simplifying abstraction. These undesirable trade-offs resulting from conflicts between the norms of alignment are illustrated in the figure.

June 10, 2025 at 1:39 PM

Raphaël Millière

@raphaelmilliere.com

Despite extensive safety training, LLMs remain vulnerable to “jailbreaking” through adversarial prompts. Why does this vulnerability persist? In a new paper published in Philosophical Studies, I argue this is because current alignment methods are fundamentally shallow. 🧵 1/13

June 10, 2025 at 1:39 PM

Raphaël Millière

@raphaelmilliere.com

So, the model learns a circuit that encodes variables & values in distinct subspaces. How does it learn? Interestingly, the circuit does *not* replace earlier heuristics – it's built on top! Heuristics are still used when they work & the circuit activates when they fail.

11/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

How is that possible? The residual stream acts as a kind of addressable memory. We find that the model learns to dedicate separate subspaces of the residual stream to encode variables names and numerical constants. Causal interventions confirm their functional role.

10/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

Patching individual attention heads reveals how they specialize and coordinate to route information: early heads handle the first hop in the assignment chain, mid-layer heads propagate subsequent hops, and late heads aggregate the answer at query position.

9/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

Patching the residual stream (the main information pathway between layers) shows that information about the correct value is dynamically routed across layers at token positions corresponding to each step of the query variable's assignment chain.

8/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

How does the general mechanism learned in the final phase actually work? To find out, we used a causal intervention method called activation patching with counterfactual inputs to trace information flow across layers and identify causally responsible components.

7/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

In phase 1️⃣, the model only learns to predict random numbers. In phase 2️⃣, it learns to predict values from the first few lines of programs, which works surprisingly well for longer chains, but fails otherwise. In phase 3️⃣, it learns a systematic mechanism that generalizes.

6/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

We observe three distinct phases in the model's learning trajectory, with sharp phase transitions characteristic of a "grokking" dynamic:

1️⃣ Random numerical prediction (≈12% test set accuracy)
2️⃣ Shallow heuristics (≈56%)
3️⃣ General solution that solves the task (>99.9%)

5/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

We trained a Transformer from scratch on a variable dereferencing task. Given symbolic programs containing chains of assignments (a=5, b=a, etc) plus irrelevant distractors, the model must trace the correct chain (up to 4 assignments deep) to find a queried variable's value.

4/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

Variable binding – the ability to associate abstract variables with values – is fundamental to computation & cognition. Classical architectures implement this through addressable memory, but neural nets like Transformers lack such explicit mechanisms. Can they learn it?

3/13

June 3, 2025 at 1:19 PM

Raphaël Millière

@raphaelmilliere.com

My article 'Constitutive Self-Consciousness' is now published online in the Australasian Journal of Philosophy. It argues (spoiler alert!) against the claim that self-consciousness is constitutive of consciousness.

The claim that consciousness constitutively involves self-consciousness has a long philosophical history, and has received renewed support in recent years. My aim in this paper is to argue that this surprisingly enduring idea is misleading at best, and insufficiently supported at worst. I start by offering an elucidatory account of consciousness, and outlining a number of foundational claims that plausibly follow from it. I subsequently distinguish two notions of self-consciousness: consciousness of oneself and consciousness of one’s experience. While ‘self-consciousness’ is often taken to refer to the former notion, the most common variant of the constitutive claim, on which I focus here, targets the latter. This claim can be further interpreted in two ways: on a deflationary reading, it falls within the scope of foundational claims about consciousness, while on an inflationary reading, it points to determinate aspects of phenomenology that are not acknowledged by the foundational claims as being aspects of all conscious mental states. I argue that the deflationary reading of the constitutive claim is plausible, but should be formulated without using a term as polysemous and suggestive as ‘self-consciousness’; by contrast, the inflationary reading is not adequately supported, and ultimately rests on contentious intuitions about phenomenology. I conclude that we should abandon the idea that self-consciousness is constitutive of consciousness.

December 17, 2024 at 11:29 AM

Raphaël Millière

@raphaelmilliere.com

I look forward to speaking at the 12th Annual Marshall M. Weinberg Symposium @UMich on Friday, alongside researchers whose work I deeply admire – Yejin Choi, @melaniemitchell.bsky.social, and Paul Smolensky.

More information including link to the livestream: lsa.umich.edu/weinberginst...

12th Annual Marshall M. Weinberg Symposium
Human vs. Artificial Intelligence in the age of ChatGPT

Friday, March 29, 2024
9:50 AM-5:30 PM
Ballroom Michigan League

In November 2022, ChatGPT stunned the world by displaying abilities – answering questions, reasoning about problems, composing stories, writing code – that appear eerily “human-like”. Dramatic advances in artificial intelligence such as these raise pressing questions: What is intelligence? In what ways can models such as ChatGPT be said to possess it? In what ways do these models fall short, and what does this teach us about the capabilities of the human mind? The 2024 Weinberg Symposium on Cognitive Science features four world-leading experts on computational intelligence – from computer science, psychology, and philosophy – who will discuss and debate these timely and timeless questions.

March 25, 2024 at 6:06 PM

Raphaël Millière

@raphaelmilliere.com

OpenAI unveiled its video generation model Sora two weeks ago. The technical report emphatically suggests that video generation models like Sora are world simulators. Are they? What does that even mean? I'm taking a deep dive into these questions in a new blog post (link below).

February 29, 2024 at 1:52 PM

Raphaël Millière

@raphaelmilliere.com

Of course image diffusion models also fail to capture some aspects of the structure of natural images. For example, they fail to capture correct projective geometry. 11/
arxiv.org/abs/2311.17138

Figure from Sarkar et al. (2023), "Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now"
https://arxiv.org/abs/2306.05720

February 17, 2024 at 3:13 AM

Raphaël Millière

@raphaelmilliere.com

The Sora tech report is scant on details, but we know it's a diffusion model w/ a ViT backbone that processes frame patches as tokens. This architecture is likely expressive enough for sophisticated internal structure to emerge with scale and diverse training data. 8/

Image from Bao et al. (2023), "All are Worth Words: A ViT Backbone for Diffusion Models"
https://arxiv.org/abs/2209.12152

February 17, 2024 at 3:12 AM

Raphaël Millière

@raphaelmilliere.com

Of course, whether humans & animals have an IPE in this robust sense is up for debate. They can understand and predict the physical properties of objects and their interactions from a young age. Whether this literally involves IPE-style stimulation is more controversial. 7/

Figure from Ullman et al. (2017), "Mind Games: Game Engines as an Architecture for Intuitive Physics"
https://www.cell.com/trends/cognitive-sciences/abstract/S1364-6613(17)30113-4

February 17, 2024 at 3:12 AM

Raphaël Millière

@raphaelmilliere.com

Of course it's widely unlikely that Sora literally makes function calls to an external physics engine like UE5 during inference. Note that this has been done before with LLMs, see this Google paper where the model answers questions through simulations with a physics engine. 2/

Figure from Liu et al. (2022), "Mind's Eye: Grounded Language Model Reasoning through Simulation" (https://arxiv.org/abs/2210.05359)

February 17, 2024 at 3:10 AM

Raphaël Millière

@raphaelmilliere.com

If your curiosity is piqued, check out the details in the preprint! I hope it illustrates how philosophical work on AI ethics/safety can fruitfully interact with technical issues in DL. If you have comments and/or work on related issues, I'd love to hear from you! 18/18

A drawing of an ancient Greek philosopher conversing with the "shoggoth with smiley face" from a popular meme about LLMs and alignment. This image was generated using DALL-E 3.

November 7, 2023 at 7:36 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news