Mechanistic interpretability

Mechanistic interpretability emphasizes features and circuits as the fundamental units of analysis and usually aims at understanding a fully trained neural network.

Jargony shortname: mech interp.

Entry point

Concrete Steps to Get Started in Transformer Mechanistic Interpretability by Neel Nanda

A different perspective: Against Almost Every Theory of Impact of Interpretability by Charbel-Raphaël.

Neuronpedia: Explore sparse autoencoders, which decompose LLM into more interpretable features.

Personal take: Mech interp is producing much results these days but there are two key problems it needs to solve:

  1. Interpretability techniques so far can only look at much smaller models. What we really need is to be able to interpret stronger model with a weaker, cheaper one, not the other way around.
  2. We’ve made some progress on reading representations of concepts in a neural network, but what we need at the end of the day is to detect more complex thoughts and plans like “To maximize profit, I’m going to lie to the auditor.”. How do we go from detecting representations to detecting a Rube Goldberg Machine of a plan that outsmarts humans?

For these reason I’m more excited about learning the science behind neural networks and deep learning through mech interp. Some ideas in this direction are Developmental Interpretability and Singular Learning Theory.

Notes mentioning this note