Quick thoughts on theory of changes for interpretability

During winter 2023 I wanted to do a literature review of Mechanistic Interpretability research agendas and analyze their theory of change, and how they relate to AI regulation. That didn’t end up happening, but I thought about this a bit and wanted to record my thoughts here at the very least.

Mechanistic Interpretability (mechinterp) is a popular agenda in AI Alignment. It has also received a good share of criticisms when it comes to its theory of change, and a popular opinion I hear in alignment is that too many people are working on mechinterp.

Overall I think mechinterp is good, and even though it’s popular in alignment, we can probably still use a lot more people in this field. Reading an alien mind that operates magnitudes faster than human mind is bound to be difficult, so instead of thinking mechinterp as “aiming for a specific method to work”, I think of it as “the general practice of interpreting model internals”, which gives us empirical knowledge about those alien minds. From those knowledges we can form new theories and then better ways to control them.

Relevance to AI Governance

Not a whole lotta novel idea here. If we can do mechinterp well, then we could run better benchmarking by also checking what’s going on inside the model. Perhaps there could be some analogy to malicious code that we can “search” within the model, too? Search to see if this English Tutor model contains the circuits required for protein synthesis… We can probably get a lot of ideas just by imagining the model as open code after mechinterp, and applying software governance practices.

Notes mentioning this note