André Panisson
panisson.bsky.social
André Panisson
@panisson.bsky.social
Principal Researcher @ CENTAI.eu | Leading the Responsible AI Team. Building Responsible AI through Explainable AI, Fairness, and Transparency. Researching Graph Machine Learning, Data Science, and Complex Systems to understand collective human behavior.
The authors, as seen in the preprint recently published in Arxiv, include Neel Nanda from Google Deepmind, head of the mechanistic interpretability team
arxiv.org/abs/2411.14257
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using spa...
arxiv.org
November 30, 2024 at 7:57 PM
You might like the work from @aliciacurth.bsky.social. Fantastic contributions to understanding this effect.
November 19, 2024 at 7:29 AM
👋 I do research on xAI for Graph ML and am starting to explore Mechanistic Interpretability. I'd love to be added!
November 17, 2024 at 9:21 PM
Since LLMs are essentially artefacts of human knowledge, we can use them as a lens to study human biases and behaviour patterns. Exploring their learned representations could unlock new insights. Got ideas or want to collaborate on this? Let’s connect!
November 16, 2024 at 5:46 PM
In "Do I Know This Entity?", Sparse autoencoders reveal how LLMs recognize entities they ‘know’—and how this self-knowledge impacts hallucinations. These insights could help steer models to refuse or hallucinate less. Fascinating work on interpretability of LLMs!
openreview.net/forum?id=WCR...
Do I Know This Entity? Knowledge Awareness and Hallucinations in...
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using...
openreview.net
November 16, 2024 at 5:39 PM
In Scaling and Evaluating Sparse Autoencoders, they extract 16M concepts (latents) from GPT-4 (guess the authors?).
They simplify tuning with k-sparse autoencoders and results show many improvements in explainability. Code, models (not all!) and visualizer included.
openreview.net/forum?id=tcs...
Scaling and evaluating sparse autoencoders
Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since...
openreview.net
November 16, 2024 at 5:38 PM