Reading note - Superintelligence

Superintelligence: Paths, Dangers, Strategies, by Nick Bostrom

Chapter 7

Orthogonality Thesis

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

3 ways to still predict goals while the orthogonality thesis stays true

  • Through design: give the superintelligent agent a goal and design the agent in a way that makes it stay to that goal. Then, we will know for sure what its goal is.
  • Through inheritance: If the agent is uploaded human brain (full brain simulation), then we know that the human’s motivation will be present.
    • Caveat: The upload process may corrupt it; the human may change its mind.
  • Through Convergent instrumental reasons:

Instrumental convergence thesis

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.

Possible categories of convergence:

  • Self-preservation: surviving in order to allow future self accomplish the goal.
  • Goal-content integrity: to accomplish the final goal, an agent wouldn’t want its goal to be changed.
    • e.g. I wouldn’t want to be brainwashed by scientology.
    • Fun implication for virtual humans who can create multiple clones: their own survival is less important, but they’d want to maintain the same motivation in each copy.
    • There are situations where an agent can best fulfill its final goals by intentionally changing them, though.
      • e.g. What if an agent’s final goal is to have a better final goal? This happens for human - we want to become someone who value some things better.
  • Cognitive enhancement: rationality, intelligence, and knowledge helps decision-making.
    • Interesting special case for an agent that’s in the position to become the first superintelligence on earth: becoming that would have a strategic advantage over other (potential) agents whose goals may be different from that of its own.
  • Technological perfection: facilitate physical construction projects
  • Resource acquisition: facilitate physical construction projects

Finally, we can expect a superintelligent agent to pursue these convergent instrumental values, but the way they pursue them is a lot less predictable. They may devise plans that are clever but counterintuitive to us (e.g. AlphaGO), and even take advantage of physical laws we don’t know.

Chapter 8

The Treacherous Turn

The treacherous turn is when an AI behaved cooperatively when it’s weak, increasingly so as it gets smarter, until a tipping point when it gets sufficiently strong and turns malignant.

Why would this happen?

  • the AI may have understood that it’s a good idea to act weak first in order to gain our trust.
  • after it got smarter, the AI may discover a new strategy to accomplish its goal; a strategy that’s harmful to humans. e.g. it discovered how to make us happy all the time by implanting electrodes in our brain.

A note on the treacherous turn: Nick Bostrom argues that the AI may even be able to “mask its thinking”, so that even mind-reading won’t work against it. This seems unrealistic, until you consider that mindreading a neural network still requires human as the medium. The AI may present its neural activities in a way that fool us, while hiding details it uses for the actual thinking process as stenography. Example in CycleGAN.

Malignant failure modes

  • Perverse Instantiation: Monkey’s Paw. Wish fulfilled in a way that satisfies the wish but is bad for the user.
  • Infrastructure Profusion: Universal Paperclip. Transforming the universe into resources for a certain goal, without caring about the impact of such strategy.
    • One scenario that seems to always apply, is that the AI may acquire compute just to keep verifying if it achieved the goal or not. This is because a Bayersian agent never assigns exactly zero probability to the hypothesis that it has not yet achieved its goal.
      • Of course that begs the question: Are AGI guaranteed to be bayersian agents?
  • Mind Crime: Simulating digital humans with consciousness.

Chapter 9

The Control Problem

Misc Notes

Much argument relies on the AI being an agent - being aware that modifying itself is instrumental to its final goals.

This is something missing in current AI or technologies.

Is there a way to simplify this? Is there a way to test this out with current technologies, much like that one robustness paper? (Strategic Awareness)

Notes mentioning this note