Research interests

“Don’t work on dumb things!” - Lennart Heim

I’m interested in researching how to reduce AI Risk. Currently developing research agenda for my Master’s at Oregon State University.

I’m also documenting my key questions about them. Would love to hear what you think of them!

Predicting and preventing adversarial vulnerability via developmental interpretability

I’m excited to understand the science behind how capabilities emerge through deep learning. The goal of this understanding is to be able to predict when new capabilities will arise, and ideally what they are. One promising agenda toward this direction is Developmental Interpretability, which aims to understand phase transitions in neural networks, and invent automatic methods to detect them. More specifically, I want to empirically study the training dynamics that lead to adversarial vulnerabilities.

Key Questions

  • Adversarial vulnerabilities are here because the learned function is not quite what we want. How important is this for alignment? (Eliezer has some writings on this, but I don’t fully understand.)
  • In addition to knowing when new capabilities or vulnerabilities arise during training, can we develop technique to steer the training?

Understanding human’s over-reliance on AI

I’m working on an AI assistants in the form of lightweight AR glasses with predictive UI that helps the user get through their day, suggesting the right actions to take at the right time under the right context. This is tricky to get right, and bad suggestions could rob the user of their agency. The harm would increase with AI’s capabilities. How do we prevent this? Perhaps by first understanding the existing dynamics of human+AI collaboration, distinguishing elements that contribute to such dynamics, and identifying settings that are likely to harmful. Example of such setting: tired programmer’s need to finish project quickly + LLM code assist’s tendency to generate erroneous code that still looks right = code with nonobvious vulnerabilities get run in production. The output of this research could be a design guideline that helps companies make better decision on developing and deploying their AI tools that interface with human. This may also help alignment schemes that rely on human-AI collaboration or full-on AI supervising (e.g. superalignment) go better.

Key Questions

  • In the formula of human’s over-reliance on AI, humans are unchanging while AI changes rapidly. How useful will this research be as AI improves and gains unexpected capabilities?
    • I think this research will likely first propose different types of AI systems categorized by our interaction with them, then theorize over-reliance by category. This approach allows it to be useful for future AI systems so long as they fall under the proposed categories.

Finally, I’m always on the look-out for research agenda and projects that leverage my game dev experience. If you’ve got an idea, let’s get in touch!


Learn about my previous research interest, and what happened to those hopes and dreams.

Notes mentioning this note