Mechanistic interpretability

Last updated on 12025-01-11

Mechanistic interpretability emphasizes features and circuits as the fundamental units of analysis and usually aims at understanding a fully trained neural network.

Towards Developmental Interpretability

Jargony shortname: mech interp.

Entry point

Concrete Steps to Get Started in Transformer Mechanistic Interpretability by Neel Nanda

A different perspective: Against Almost Every Theory of Impact of Interpretability by Charbel-Raphaël.

Neuronpedia: Explore sparse autoencoders, which decompose LLM into more interpretable features.

Personal take: Mech interp is producing much results these days but there are two key problems it needs to solve:

Interpretability techniques so far can only look at much smaller models. What we really need is to be able to interpret stronger model with a weaker, cheaper one, not the other way around.
We’ve made some progress on reading representations of concepts in a neural network, but what we need at the end of the day is to detect more complex thoughts and plans like “To maximize profit, I’m going to lie to the auditor.”. How do we go from detecting representations to detecting a Rube Goldberg Machine of a plan that outsmarts humans?

For these reason I’m more excited about learning the science behind neural networks and deep learning through mech interp. Some ideas in this direction are Developmental Interpretability and Singular Learning Theory.

Notes mentioning this note

Quick thoughts on theory of changes for interpretability

During winter 2023 I wanted to do a literature review of [[Mechanistic Interpretability]] research agendas and analyze their theory of...

Research interests (archive)

This page archives my research interests back then, and what happened to them. Archiving them allow me to trace the...

Research interests 2023 winter

This page archives my research interests back then, and what happened to them. Archiving them allow me to trace the...

Singular learning theory

Sumio Watanabe’s Singular Learning Theory Overview