Tuesday, April 10, 2012

What is NP and why is it so hard?

For some reason that now escapes me, I recently got into a discussion with a fellow developer about the Japanese puzzle game Nurikabe in which I mentioned that solving Nurikabe was in NP. (In fact, "does this puzzle have a solution" is NP-Complete.) I then realized this person had no idea what I was talking about.

(UPDATE: I think this excellent STL video is what started this whole mess. Blame Stephan!)

If you spend enough time in certain areas of the computer industry (notably cryptography, language/compiler design, or game theory) you will frequently hear phrases like "this problem is NP-hard", or "that's equivalent to solving the halting problem." Claiming that a problem is NP-hard or even unsolvable is a very important statement, but one that is often misunderstood even by people with a general idea of what those terms mean.

So, as is my habit recently, I'm taking to my blog to expand on that conversation; this is somewhat outside the usual subject matter, but hopefully as developers you will find it interesting anyway. (And hope that no one at work ever asks me how a neutron star forms.) Also, fair warning: math ahead.

Computability Theory and Computational Complexity

The concepts we'll be talking about here come from an area of theoretical mathematics called computation theory, which is basically the point where math turns into computer science. In particular, we're dealing with a branch of computation theory called computational complexity theory and, to a lesser degree, computability theory. These forms of mathematics deal with the processes and techniques that computers use to solve problems. They are not concerned so much with solving actual problems, as they are with formalizing the process by which we arrive at and verify solutions to problems in general. Many of the pioneers in programming and language design were actually mathematicians in this field: Alan Turing, Alonzo Church, and John von Neumann for example.

When we talk about a problem in this sense, we are making a distinction between a general problem, such as "sorting a list" or "factoring a number", and instances of the problem, such as "sort this list of integers from 1 to 100" or "find all of the factors of this 19 digit integer". Computational theory attempts to make statements that apply to all instances of a problem. Certain specific instances of a problem may be easier to solve, but computational theory is about making statements that apply to the problem in general. This lets us make educated decisions about problems even given the absolute worst case input.

Computation theory deals with a variety of problem types, but for our purposes we're going to focus on a particular subset called decision problems. These are simply problems that have a "yes" or "no" answer. Most problems we deal with in practice are function problems, which have more complex answers. In theory, any function problem can be rewritten as a decision problem, and solving one effectively solves the other, but decisions problems are easier to reason about. Thus, its common to focus discussion on just decision problems, with the understanding that the same reasoning could be applied to function problems when needed.

The first thing we can ask about a given problem is whether or not a computer can even solve the problem. This question is usually expressed in terms of a theoretical form of computer called a Turing machine that provides a mathematical model for how a computer works. We can then say that a problem is decidable if there is any algorithm that we can run on the model computer that always produces the correct answer. There are two things that are required for a problem to be decidable. First, whatever answer the computer produces must be the correct one. More importantly, however, the computer must produce an answer if one exists, and must fail if one does not. This is called halting, and a key aspect of computational theory.

One of the most famous problems in computer science is the halting problem, which basically asks "given an arbitrary computer program running on a Turing machine, will it ever halt?". As far as we know, that problem is not decidable: there is no single algorithm that we can run against all computer programs ever that will tell us, even knowing its input, if that program is guaranteed to "eventually" finish, or will run forever. The study of this question is what led to the development of the Turing machine in the first place; it has since been demonstrated that many other very hard problems can be transformed, or reduced, to something that is equivalent to the halting problem. If we were able to find a solution to this other hard problem, that same solution could be used to solve the halting problem. Since the halting problem cannot be solved, no such solution can possibly exist, and we must therefore assume that our new problem is also unsolvable. When this happens, we say that the new problem is "equivalent to solving the halting problem", which is developer shorthand for "theoretically impossible", and we must then look for ways to change the problem to something that can be decided. Typically we accomplish this by limiting the input, or imposing artificial limits on the program's run length (e.g. can only run for 3 minutes, can only loop 100 times, etc.)

Time Complexity And The Big-O

Once we have determined that a problem is (or is likely to be) decidable, the next question we want to ask is how long it will take to decide the answer. Before we get there, though, lets take a quick refresher through the concept of time complexity, which is a related concept that applies to algorithms instead of problems. You should be at least vaguely familiar with this part, which is expressed in terms of "big-O notation". The complexity of an algorithm (time, space, whatever) is given in terms of the size of the input to the algorithm -- the n. It doesn't try to calculate precise, real-world performance metrics for the algorithm; rather, it expresses the general behavior of the algorithm as the size of its input increases. This is a key distinction that is often lost on inexperience developers: an ideal hash table lookup has a time complexity well below that of an unsorted array search. But, if the hash function itself takes a long time to run, the array may run in faster in clock time, for small inputs, so it may be a better practical choice in some circumstances. However, as the input size grows, eventually the hash table will dramatically outperform the array, so as a general-use solution, the hash table would be considered better.

Time complexity is written in the form O(x), where x is some mathematical value in terms of n, the input size.  These are not exact values; if one algorithm takes twice as long as another algorithm for the same input size, it will have the same overall time complexity. (In other words, constant factors are dropped from the big-O notation, as in the long run they become insignificant compared to the size of n.)

There are dozens of standard complexity measurements, but for most purposes we're only concerned about a small number of them. The most relevant ones are:
  • Constant time, O(1). This algorithm takes the same amount of time to run, regardless of the input size. Given an ideal hash function, inserting or searching a hash table takes constant time.
  • Logarithmic time, O(log n). This algorithm's performance degrades very, very slowly as the input size increases. Most "binary" algorithms, such as inserting or searching a balanced binary tree, run in logarithmic time.
  • Linear time, O(n). This algorithm's performance is directly proportional to the input size. Finding an algorithm that runs in at most linear time is a common goal for solving problems. Many simple iterative operations, like searching an unsorted linked list or array, run in linear time.
  • Polynomial time, O(n ^ k). Performance for this algorithm degrades very, very quickly as the input size grows. Many brute-force or naieve sorts and searches (e.g. bubble sort) run in polynomial time.
  • Exponential time, O(2 ^ n). Performance for these algorithms is downright awful. The algorithms that "play" board games like chess are often exponential time, where n is the size of the board.
For now, we don't really need to know the meaning of all of those terms. When designing and tuning algorithms, we try for the fastest time (and space) complexity we can achieve. But for our purposes, we don't need to be that precise. Instead, lets just pick an arbitrary point as declare that as "good enough". For various reasons, that point is the polynomial time complexity. If an algorithm runs in polynomial time or less, we will consider that algorithm "fast enough". If it takes longer than polynomial time, say exponential time, then its "not fast".

Complexity Classes

When analyzing problems, this is exactly what we do. We organize the problems into complexity classes that are based on whether or not a problem can be solved in polynomial time. We can do this because, at the problem solving level, we aren't as concerned with how fast we can solve a given problem, but whether or not a computer is able to solve the problem in a reasonable human timescale at all. Certain exponential-time algorithms, on moderately large inputs, could take hundreds or thousands of years to execute, even with the combined computing power of the entire planet.

So, for the rest of this blog post, whenever you see "polynomial time" feel free to mentally substitute "will finish during my lifetime". Whether that means 10 seconds, 10 minutes, or 10 hours will depend on the exact input size, but as long as it doesn't become 10 million years, we're calling that acceptable.

P and NP are two of the fundamental complexity classes that include problems which can be solved in polynomial time. These are mathematical sets: we say a problem is "in" a complexity class if it meets all the requirements to be a member of the set. Problems that are in P can be solved by a Turing machine in polynomial (or less, e.g. constant, logarithmic, etc) time. Intuitively, we consider these "easy" problems, though that term is somewhat relative. Problems that are in NP can be solved by a non-deterministic Turning machine in polynomial time. This is a special type of theoretical computer model that can perform more than one possible action for any given state; real computer hardware is deterministic, in that it always takes the same action for the same set of inputs. One way to look at a non-deterministic Turning machine is that it is capable of making "really good guesses" about what action to take, whereas a standard machine would have to try all possible options to find the correct one.

The previous definitions are technically correct, but don't really explain why P and NP are so important. Fortunately, there are equivalent definitions that are much more relevant: If a problem is in P, that means we can solve that problem fairly quickly. If a problem is in NP, but not P, that means we cannot (in the general case) find a solution quickly, but we can verify the solution quickly. In addition, there are related complexity classes such as NP-Hard and NP-Complete. These relate to the fact that many problems can be rewritten, or reduced, such that they are equivalent to each other. NP-Hard problems are those problems that are at least as hard as every problem in NP. Not all NP-Hard problems are in NP; even verifying the solutions to those problems may be difficult. Those problems are NP-Hard, and also in NP, are called NP-Complete; intuitively, NP-Complete problems are the hardest problems that are in NP (e.g. all equally as difficult as the others.)

As an example, lets look at one very famous NP-Complete problem, the so-called "Travelling salesman problem". In this problem, we have a salesperson that is flying around the country giving presentations. We know all of the cities she needs to visit, and how far apart those cities are. We also know that she only wants to visit each city once, then return home. The decision problem is then stated as "given any distance D (say, in miles) is there any complete path that is shorter than D?". (This is an example of how decision problems and function problems are equal: to answer "yes" to that question would, in effect, require that we find at least one route that is shorter than D miles long.) To actually produce an answer to that question is a difficult problem; the best solution algorithm we have for this problem takes exponential time. However, once we have a solution to the problem, we can easily verify that it is correct: add up the lengths of all segments and compare with D.

Another extremely important case of an NP problem is integer factorization, which is at the heart of modern cryptography. Given a very large number p, the problem is to find an integer, other than p or 1, that is a factor of p. (The decision-problem version is a bit more complex, but the analysis is identical). The best case solutions to this problem currently run at just slightly better than exponential time ("sub-exponential" time, e.g. O(2 ^ sqrt(n)) or similar); in clock time, for even moderately sized inputs, this is very long. Factoring a 768-bit number took over 2000 hours of CPU clock time, and modern crypto keys are as much as 4096 bits long (something like 30 years of clock time). However, if you are given two numbers p and n, determining "is n a factor of p" is simple integer division. This is known as a "one-way algorithm", and is crucial to the security of asymmetric-key encryption.

(This is also a good explanation of how a non-deterministic Turing machine works. Since this problem is in NP, it must be solvable in polynomial time by a non-deterministic Turning machine; and indeed, a theoretical quantum computer algorithm, which is capable of non-deterministic computation, has been found that runs in polynomial time. Fortunately, at the moment, no practical machine exists that behaves this way.)

P May Or May Not Be NP

Which brings us to one of the most important unanswered questions in all of computer science: are the sets P and NP actually equal? Is P = NP? We know that all problems that are in P are, by definition, also in NP. We also know there are problems that are outside of NP - they cannot possibly be solved in polynomial time. The open question is, are there really problems that are verifiable in polynomial time, but not solvable in polynomial time?

There are plenty of problems that we think fit that criteria, two of which we just discussed. But its possible that we just don't know enough, and that there is an algorithm that factors integers in polynomial time, for example. And this is where things get kinda messy. Remember we mentioned, briefly, the concept of an NP-Complete problem: the "hardest" problems that are in NP. We've already talked about how all of these problems are roughly equivalent to each other; more precisely, we can transform or reduce any NP-Complete problem to any other NP-Complete problem in polynomial time. If we have two NP-Complete problems, Problem A and Problem B, we can always come up with a "fast" algorithm that says "Convert problem A into problem B",  and then solve problem B. This means, if we can solve any one of those problems quickly, we can solve all of them quickly, by simply saying "transform problem A into already-solved problem C, then apply the solution to problem C."

That means, if we can prove that even one NP-Complete problem is also in P, then we can prove that all of NP is also in P -- that there are no one-way algorithms. Most computer scientists are pretty sure that this is not true: that there are problems in NP that are not in P, because it also means a whole bunch of other things we don't think are true.


Math and computer science theory is something that I don't think many developers spend a lot of their time worrying about. (Similarly, I think too many college professors worry too much about theory and not enough about practical coding techniques, like design patterns.) But just like theory is meaningless if you never put it into practice, the practical solutions never arise of no one bothers to study and invent them in the first place. These kind of computer science theoretical exercises are important for software developers to understand, at least broadly, in order to truly understand what's going on "under the hood" of their practical applications.

Next time, though, back to real coding, with the first pattern in my Practical Patterns series, the Singleton.


Anonymous said...

Great post. Thanks!