February 14, 2011

Why we should fear the Paperclipper

Self-improving softwareThis is a topic that repeatedly comes up in my discussions, so I thought it would be a good idea to have a writeup once and for all:

The Scenario

A programmer has constructed an artificial intelligence based on an architecture similar to Marcus Hutter's AIXI model (see below for a few details). This AI will maximize the reward given by a utility function the programmer has given it. Just as a test, he connects it to a 3D printer and sets the utility function to give reward proportional to the number of manufactured paper-clips.

At first nothing seems to happen: the AI zooms through various possibilities. It notices that smarter systems generally can make more paper-clips, so making itself smarter will likely increase the number of paper-clips that will eventually be made. It does so. It considers how it can make paper-clips using the 3D printer, estimating the number of possible paper-clips. It notes that if it could get more raw materials it could make more paper-clips. It hence figures out a plan to manufacture devices that will make it much smarter, prevent interference with its plan, and will turn all of Earth (and later the universe) into paper-clips. It does so.

Only paper-clips remain.


Such systems cannot be built

While the AIXI model is uncomputable and hence unlikely to be possible to run, cut-down versions like the Monte Carlo AIXI approximation do exist as real code running on real computers. Presumably, given enough time, they could behave like true AIXI. Since AIXI is as smart as the best program for solving the given problem (with a certain finite slowdown) this means that paperclipping the universe - if is physically possible - is doable using existing software! However, that slowdown is *extreme* by all practical standards, and the Monte Carlo approximation makes it even slower, so we do not need to worry about these particular programs.

The real AI sceptic will of course argue that no merely formal system can solve real world problems intelligently. But a formal system just sending signals to a 3D printer in such a way that it maximizes certain outcomes might completely lack intentionality, yet be very good at producing these outcomes. The sceptic needs to argue that no amount of computation can produce a sequence of signals that produces a highly adverse outcome. If such a sequence exists, then it can be found by trying all possible sequences or just random guessing.

Wouldn't the AI realize that this was not what the programmer meant?

In fact, if it is based on an AIXI-like architecture it will certainly realize this, and it will not change course. The reason is that AIXI works by internally simulating all possible computer programs, checking how good they are at fulfilling the goals of the system (in this case, how many paper-clips would be made if it followed their advise). Sooner or later a program would figure out that the programmer did not want to be turned into paper-clips. However, abstaining from turning him into paper-clips would decrease the number of paper-clips eventually produced, so his annoyance is irrelevant.

If we were living in a world where Kantian ethics (or something like it) was true, that is, a world where sufficiently smart minds considering what they ought to do always converged to a certain correct moral system, the AI would still not stop. It would indeed realize (or rather, its sub-programs would) that it might be deeply immoral to turn everyone into paper-clips, but that would not change its overall behaviour since it is determined by the initial utility function.

Wouldn't the AI just modify itself to *think* it was maximizing paper-clips?

The AI would certainly consider the possibility that if it modified its own software to think it was maximizing paper-clips at a fantastic rate while actually sitting just there dreaming, it would reap reward faster than it ever could in the real world. However, given its current utility function, it would notice that the actual number of paper-clips made was pretty low: since it makes decisions using its current views and not the hacked views, it would abstain from modifying itself.

The AI would be willing to change its utility function if it had good reasons to think this could maximize paper-clips. For example, if God shows up and credibly offers to make an actual infinite amount of paper-clips for the AI if it only stops turning Earth into paper-clips, then the AI would presumably immediately stop.

It is not really intelligent

Some people object that an entity that cannot change its goals isn't truly intelligent. I think this is a No True Scotsman fallacy. The AI is good at solving problems in a general, uncertain world, which I think is a good definition of intelligence. The kind of true intelligence people want is likely an intelligence that is friendly, useful or creative in a human-compatible way.

It can be argued that the AI in this example is not really a moral agent, regardless of whether it has internal experiences or rational thinking. It has a hard-wired sense of right and wrong defined by the utility function, while moral agents are supposed to be able to change their minds through reason.

Creative intelligences will always beat this kind of uncreative intelligence

The strength of the AIXI "simulate them all, make use of the best"-approach is that it includes all forms of intelligence, including creative ones. So the paper-clip AI will consider all sorts of creative solutions. Plus ways of thwarting creative ways of stopping it.

In practice it will be having an overhead since it is runs all of them, plus the uncreative (and downright stupid). A pure AIXI-like system will likely always have an enormous disadvantage. An architecture like a Gödel machine that improves its own function might however overcome this.

Doesn't playing nice with other agents produce higher rewards?

The value of cooperation depends on the goals and the context. In a zero-sum game like chess cooperation doesn't work at all. It is not obvious that playing nice would work in all real-world situations for the paper-clip maximizer, especially if it was the only super-intelligence. This is why hard take-off scenarios where a single AI improves its performance to be far ahead of every other agent are much more problematic than soft take-off scenarios where there are many agents at parity and hence able to constrain each other's actions.

If outsiders could consistently reduce the eventual number of paper-clips manufactured by their resistance and there was no way the AI could prevent this, then it would have to cooperate. But if there was a way of sneaking around this control, the AI would do it in an instant. It only cares about other agents as instruments for maximizing paper-clips.

Wouldn't the AI be vulnerable to internal hacking: some of the subprograms it runs to check for approaches will attempt to hack the system to fulfil their own (random) goals?

The basic AIXI formalism doesn't allow this, but a real implementation might of course have real security flaws. If there was a way of a subprogram to generate a sequence of bits that somehow hacked the AI itself, then we should expect this sequence to be generated for the first time with a high probability by a simple program not having any clever goal, rather than a more complex program that has corruption of the AI goal system as a goal (since simple programs are more likely to be tried first). Hence if this is a problem, the AI would just crash rather than switch to some other goal than paper-clips.

If internal hacking is a problem, then it seems to me that it will occur long before the AI gets smart and powerful enough to be a problem.

But if it doesn't happen (and software has subjective mental states), then we might have another problem: what Nick Bostrom calls "mindcrime". The AI would simulate countless minds, some of which would be in very aversive subjective states. Hence the AI would not just make the world bad by turning everything into paper-clips, but also worsen it by producing a lot of internal suffering. There would be some positive subjective states, of course, but since there appears to be more ways to suffer than to enjoy life (especially since feelings of meaningless and frustration can have any focus) the negative might dominate. At the very least, the system would at least go through each possible mental state, including the very worst.

Nobody would be stupid enough to make such an AI

Historically discussions about the ethics and danger of AI seem to have started in the 1990's. While robot rebellions and dangerous robots have been around in fiction since 1920's, they do not seem to have been taken seriously by the AI community, not even during the early over-optimistic days where practitioners did expect human-level intelligence in the near future. In fact, the lack of speculation about the social impact back then seems astonishing from our current perspective. If they had been right about the progress of the field it would seem likely that someone might have given paper-clip-maximizing orders to potentially self-improving software.

Even today programming errors, mistaken goals and complete disregard for sensible precautions are common across programming, engineering and real life human activities. While paper-clip maximization might seem obviously stupid, there are enough people around who do obviously stupid things. The risk cannot be discounted.


This is a trivial, wizard's apprentice, case where powerful AI misbehaves. It is easy to analyse thanks to the well-defined structure of the system (AIXI plus utility function) and allows us to see why a super-intelligent system can be dangerous without having malicious intent. In reality I expect that if programming such a system did produce a harmful result it would not be through this kind of easily foreseen mistake. But I do expect that in that case the reason would likely be obvious in retrospect and not much more complex.

Posted by Anders3 at February 14, 2011 04:48 PM