Patrick Kahardipraja
pkhdipraja.bsky.social
Patrick Kahardipraja
@pkhdipraja.bsky.social
PhD student @ Fraunhofer HHI. Interpretability, incremental NLP, and NLU. https://pkhdipraja.github.io/
Autointerp provides us descriptions of LLMs features, but how it is evaluated varies from one setting to another. We propose FADE, a framework that enables standardized, automatic evaluation of alignment between features and autointerp descriptions across various metrics.
July 16, 2025 at 1:26 PM
Building on the gained insights, we present a probe to track for knowledge provenance during inference and show where it is localized within the input prompt. Our attempt shows promising results, with >94% ROC AUC and >84% localization accuracy.

4/4
May 26, 2025 at 4:01 PM
Using interpretability tools, we discover that heads important for RAG can be categorized into two: parametric heads that encode relational knowledge and in-context heads that are responsible for processing information in the prompt.

2/
May 26, 2025 at 4:01 PM