Research interests (archive)

This page archives my research interests back then, and what happened to them. Archiving them allow me to trace the changes in my thinking and my situations, and to make the Research Interests page more focused on what I actively work on or think about.


2024 Summer

Predicting and preventing adversarial vulnerability via developmental interpretability

I’m excited to understand the science behind how capabilities emerge through deep learning. The goal of this understanding is to be able to predict when new capabilities will arise, and ideally what they are. One promising agenda toward this direction is Developmental Interpretability, which aims to understand phase transitions in neural networks, and invent automatic methods to detect them. More specifically, I want to empirically study the training dynamics that lead to adversarial vulnerabilities.

Key Questions

  • Adversarial vulnerabilities are here because the learned function is not quite what we want. How important is this for alignment? (Eliezer has some writings on this, but I don’t fully understand.)
  • In addition to knowing when new capabilities or vulnerabilities arise during training, can we develop technique to steer the training?

Result: Dropped for now. Devinterp probably can help with adversarial vulnerability, but there are better questions to study for devinterp. Most importantly, I couldn’t figure out an experiment design we can do in academic setting.

Understanding human’s over-reliance on AI

I’m working on an AI assistants in the form of lightweight AR glasses with predictive UI that helps the user get through their day, suggesting the right actions to take at the right time under the right context. This is tricky to get right, and bad suggestions could rob the user of their agency. The harm would increase with AI’s capabilities. How do we prevent this? Perhaps by first understanding the existing dynamics of human+AI collaboration, distinguishing elements that contribute to such dynamics, and identifying settings that are likely to harmful. Example of such setting: tired programmer’s need to finish project quickly + LLM code assist’s tendency to generate erroneous code that still looks right = code with nonobvious vulnerabilities get run in production. The output of this research could be a design guideline that helps companies make better decision on developing and deploying their AI tools that interface with human. This may also help alignment schemes that rely on human-AI collaboration or full-on AI supervising (e.g. superalignment) go better.

Key Questions

  • In the formula of human’s over-reliance on AI, humans are unchanging while AI changes rapidly. How useful will this research be as AI improves and gains unexpected capabilities?
    • I think this research will likely first propose different types of AI systems categorized by our interaction with them, then theorize over-reliance by category. This approach allows it to be useful for future AI systems so long as they fall under the proposed categories.

Result: I’m still interested in studying this! This idea is too loose though. Need specific research questions, as well as experiment design.

2023 Winter

Literature review of mechanistic interpretability - theory of change

Analyze theory of change for mech interp research agendas. Inform governing bodies on how it will help AI regulation.

Result: A full literature review ended up proving to be beyond my ability and time budget. I did think about this a bit. See Quick thoughts on theory of changes for interpretability

✨ Vibe-based research ✨ + Cyborgism

Forgive the title. When I talk about vibe, I’m thinking about our nervous system’s innate ability to pattern-match and intuit conclusions; System 1 thinking (Daniel Kahneman, Thinking, Fast and Slow).

I want to note that I’m not suggesting we determine whether an AI is aligned by talking to it and “feeling out” how trustworthy it is. Rather, I think human intuition is a powerful process that we can harness or enhance. A success story here may look like exploring the neuron semantics of a LLM with a visualizer, noticing strange patterns, and in turn make conjectures about its mechanisms. Then, we use rigorous system 2 thinking to verify those conjectures.

Similar ideas appear in Cyborgism:

The object level plan of creating cyborgs for alignment boils down to two main directions: […] 2. Train alignment researchers to use these tools, develop a better intuitive understanding of how GPT behaves, leverage that understanding to exert fine-grained control over the model, and to do important cognitive work while staying grounded to the problem of solving alignment.

Finally I think it’s also valuable to better understand ✨vibes✨. There are superficial similarities (or are they that superficial?) between artificial neural networks and system 1 thinking with our own organic neural network. We should keep our mind open and be inspired by neuroscience and other disciplines.

“a vibe is a compression scheme is a probabilistic model.” - janus

“To think a lot but all at once, we have to think associatively, self-referentially, vividly, temporally” - Peli Grietzer, A Theory of Vibes

Result: This agenda is far less concrete and more of a ✨vibe✨. I still think there’s something to the parallel between deep learning and our system 1 thinking, and it’s very much worth studying this. However that job is probably better left for a neuroscience researcher. And the “success story” in the original proposal probably would never work, given that 1. there’s no guarantee that the same trained intuitions can work for different LLMs, and 2. the AI would probably be operating magnitudes faster than we do, so it’s unlikely that the human can react fast enough against a coup, even if it’s something they could identify. Perhaps automation can solve no.2.

Notes mentioning this note