Developmental interpretability

Developmental Interpretability: A Novel AI Alignment Research Agenda

Developmental interpretability is a novel AI alignment research agenda studying how structure forms in neural networks.

Towards Developmental Interpretability — LessWrong

Like Singular Learning Theory which this bases on, developmental interpretability is about understanding phase transitions. This is relevant to alignment because:

In its mundane form, the goal of developmental interpretability in the context of alignment is to:

  • advance the science of detecting when structural changes happen during training,
  • localize these changes to a subset of the weights, and
  • give the changes their proper context within the broader set of computational structures in the current state of the network.

This is all valuable information that can tell evaluation pipelines or mechanistic interpretability tools when and where to look, thereby lowering the alignment tax

In its profound form, developmental interpretability aims to understand the underlying “program” of a trained neural network as some combination of this phase transition transcript (the form) together with learned knowledge that is less universal and more perceptual (the content).

How to help

Check out the Project Idea page for devinterp.com. Join their Discord.

Notes mentioning this note