Lightnews — Scholar-powered news

André Panisson

@panisson.bsky.social

700 followers 470 following 11 posts

Principal Researcher @ CENTAI.eu | Leading the Responsible AI Team. Building Responsible AI through Explainable AI, Fairness, and Transparency. Researching Graph Machine Learning, Data Science, and Complex Systems to understand collective human behavior.

Posts Replies Media Videos

André Panisson

@panisson.bsky.social

The authors, as seen in the preprint recently published in Arxiv, include Neel Nanda from Google Deepmind, head of the mechanistic interpretability team
arxiv.org/abs/2411.14257

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using spa...

arxiv.org

November 30, 2024 at 7:57 PM

André Panisson

@panisson.bsky.social

Yes, and it's not so new
arxiv.org/abs/2406.04093

Scaling and evaluating sparse autoencoders

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language m...

arxiv.org

November 30, 2024 at 7:48 PM

André Panisson

@panisson.bsky.social

You might like the work from @aliciacurth.bsky.social. Fantastic contributions to understanding this effect.

November 19, 2024 at 7:29 AM

André Panisson

@panisson.bsky.social

👋 I do research on xAI for Graph ML and am starting to explore Mechanistic Interpretability. I'd love to be added!

November 17, 2024 at 9:21 PM

André Panisson

@panisson.bsky.social

Since LLMs are essentially artefacts of human knowledge, we can use them as a lens to study human biases and behaviour patterns. Exploring their learned representations could unlock new insights. Got ideas or want to collaborate on this? Let’s connect!

November 16, 2024 at 5:46 PM

André Panisson

@panisson.bsky.social

In "Do I Know This Entity?", Sparse autoencoders reveal how LLMs recognize entities they ‘know’—and how this self-knowledge impacts hallucinations. These insights could help steer models to refuse or hallucinate less. Fascinating work on interpretability of LLMs!
openreview.net/forum?id=WCR...

Do I Know This Entity? Knowledge Awareness and Hallucinations in...

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using...

openreview.net

November 16, 2024 at 5:39 PM

André Panisson

@panisson.bsky.social

In Scaling and Evaluating Sparse Autoencoders, they extract 16M concepts (latents) from GPT-4 (guess the authors?).
They simplify tuning with k-sparse autoencoders and results show many improvements in explainability. Code, models (not all!) and visualizer included. openreview.net/forum?id=tcs...

Scaling and evaluating sparse autoencoders

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since...

openreview.net

November 16, 2024 at 5:38 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news