Reading note - Superintelligence
Superintelligence: Paths, Dangers, Strategies, by Nick Bostrom
Chapter 7
Orthogonality Thesis
Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.
3 ways to still predict goals while the orthogonality thesis stays true
- Through design: give the superintelligent agent a goal and design the agent in a way that makes it stay to that goal. Then, we will know for sure what its goal is.
- Through inheritance: If the agent is uploaded human brain (full brain simulation), then we know that the human’s motivation will be present.
- Caveat: The upload process may corrupt it; the human may change its mind.
- Through Convergent instrumental reasons:
Instrumental convergence thesis
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.
Possible categories of convergence:
- Self-preservation: surviving in order to allow future self accomplish the goal.
-
Goal-content integrity: to accomplish the final goal, an agent wouldn’t want its goal to be changed.
- e.g. I wouldn’t want to be brainwashed by scientology.
- Fun implication for virtual humans who can create multiple clones: their own survival is less important, but they’d want to maintain the same motivation in each copy.
- There are situations where an agent can best fulfill its final goals by intentionally changing them, though.
- e.g. What if an agent’s final goal is to have a better final goal? This happens for human - we want to become someone who value some things better.
-
Cognitive enhancement: rationality, intelligence, and knowledge helps decision-making.
- Interesting special case for an agent that’s in the position to become the first superintelligence on earth: becoming that would have a strategic advantage over other (potential) agents whose goals may be different from that of its own.
- Technological perfection: facilitate physical construction projects
- Resource acquisition: facilitate physical construction projects
Finally, we can expect a superintelligent agent to pursue these convergent instrumental values, but the way they pursue them is a lot less predictable. They may devise plans that are clever but counterintuitive to us (e.g. AlphaGO), and even take advantage of physical laws we don’t know.
Chapter 8
The Treacherous Turn
The treacherous turn is when an AI behaved cooperatively when it’s weak, increasingly so as it gets smarter, until a tipping point when it gets sufficiently strong and turns malignant.
Why would this happen?
- the AI may have understood that it’s a good idea to act weak first in order to gain our trust.
- after it got smarter, the AI may discover a new strategy to accomplish its goal; a strategy that’s harmful to humans. e.g. it discovered how to make us happy all the time by implanting electrodes in our brain.
A note on the treacherous turn: Nick Bostrom argues that the AI may even be able to “mask its thinking”, so that even mind-reading won’t work against it. This seems unrealistic, until you consider that mindreading a neural network still requires human as the medium. The AI may present its neural activities in a way that fool us, while hiding details it uses for the actual thinking process as stenography. Example in CycleGAN.
Malignant failure modes
- Perverse Instantiation: Monkey’s Paw. Wish fulfilled in a way that satisfies the wish but is bad for the user.
-
Infrastructure Profusion: Universal Paperclip. Transforming the universe into resources for a certain goal, without caring about the impact of such strategy.
- One scenario that seems to always apply, is that the AI may acquire compute just to keep verifying if it achieved the goal or not. This is because a Bayesian agent never assigns exactly zero probability to the hypothesis that it has not yet achieved its goal.
- Of course that begs the question: Are AGI guaranteed to be bayesian agents?
- One scenario that seems to always apply, is that the AI may acquire compute just to keep verifying if it achieved the goal or not. This is because a Bayesian agent never assigns exactly zero probability to the hypothesis that it has not yet achieved its goal.
- Mind Crime: Simulating digital humans with consciousness.
Chapter 9
The Control Problem
…
Misc Notes
Much argument relies on the AI being an agent - being aware that modifying itself is instrumental to its final goals.
This is something missing in current AI or technologies.
Is there a way to simplify this? Is there a way to test this out with current technologies, much like that one robustness paper? (Strategic Awareness)