Reading note - Superintelligence

Last updated on 12024-05-30

Superintelligence: Paths, Dangers, Strategies, by Nick Bostrom

Chapter 7

Orthogonality Thesis

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

3 ways to still predict goals while the orthogonality thesis stays true

Through design: give the superintelligent agent a goal and design the agent in a way that makes it stay to that goal. Then, we will know for sure what its goal is.
Through inheritance: If the agent is uploaded human brain (full brain simulation), then we know that the human’s motivation will be present.
- Caveat: The upload process may corrupt it; the human may change its mind.
Through Convergent instrumental reasons:

Instrumental convergence thesis

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.

Possible categories of convergence:

Self-preservation: surviving in order to allow future self accomplish the goal.
Goal-content integrity: to accomplish the final goal, an agent wouldn’t want its goal to be changed.
- e.g. I wouldn’t want to be brainwashed by scientology.
- Fun implication for virtual humans who can create multiple clones: their own survival is less important, but they’d want to maintain the same motivation in each copy.
- There are situations where an agent can best fulfill its final goals by intentionally changing them, though.
  - e.g. What if an agent’s final goal is to have a better final goal? This happens for human - we want to become someone who value some things better.
Cognitive enhancement: rationality, intelligence, and knowledge helps decision-making.
- Interesting special case for an agent that’s in the position to become the first superintelligence on earth: becoming that would have a strategic advantage over other (potential) agents whose goals may be different from that of its own.
Technological perfection: facilitate physical construction projects
Resource acquisition: facilitate physical construction projects

Finally, we can expect a superintelligent agent to pursue these convergent instrumental values, but the way they pursue them is a lot less predictable. They may devise plans that are clever but counterintuitive to us (e.g. AlphaGO), and even take advantage of physical laws we don’t know.

Chapter 8

The Treacherous Turn

The treacherous turn is when an AI behaved cooperatively when it’s weak, increasingly so as it gets smarter, until a tipping point when it gets sufficiently strong and turns malignant.

Why would this happen?

the AI may have understood that it’s a good idea to act weak first in order to gain our trust.
after it got smarter, the AI may discover a new strategy to accomplish its goal; a strategy that’s harmful to humans. e.g. it discovered how to make us happy all the time by implanting electrodes in our brain.

A note on the treacherous turn: Nick Bostrom argues that the AI may even be able to “mask its thinking”, so that even mind-reading won’t work against it. This seems unrealistic, until you consider that mindreading a neural network still requires human as the medium. The AI may present its neural activities in a way that fool us, while hiding details it uses for the actual thinking process as stenography. Example in CycleGAN.

Malignant failure modes

Perverse Instantiation: Monkey’s Paw. Wish fulfilled in a way that satisfies the wish but is bad for the user.
Infrastructure Profusion: Universal Paperclip. Transforming the universe into resources for a certain goal, without caring about the impact of such strategy.
- One scenario that seems to always apply, is that the AI may acquire compute just to keep verifying if it achieved the goal or not. This is because a Bayesian agent never assigns exactly zero probability to the hypothesis that it has not yet achieved its goal.
  - Of course that begs the question: Are AGI guaranteed to be bayesian agents?
Mind Crime: Simulating digital humans with consciousness.

Chapter 9

The Control Problem

…

Misc Notes

Much argument relies on the AI being an agent - being aware that modifying itself is instrumental to its final goals.

This is something missing in current AI or technologies.

Is there a way to simplify this? Is there a way to test this out with current technologies, much like that one robustness paper? (Strategic Awareness)

Notes mentioning this note

AI Risk

What is artificial general intelligence safety / AI alignment? AI alignment is a field that is focused on causing the...

The Treacherous Turn

*The Treacherous Turn is a research-supported tabletop RPG in which the players collectively act as a [[AI Risk misaligned AI]]...