Sunday, February 8, 2015

Order of Growth - Doubling Hypothesis

When designing an algorithm that must process large data sets, performance is usually a big concern. One way to determine the running time (i.e. performance) is to simply run the program against various test cases measure the running time. For example:

long startTime = System.currentTimeMillis();

// run your algorithm

long endTime = System.currentTimeMillis();
double timeToExecute = (now - start) / 1000.0;

This works most of the time, but you don't always want to test your code against extremely large data sets. Instead, you'd rather run the code against a handful of smaller data sets and estimate how much the performance degrades as the size of the data set grows. This can save considerable time when you don't want to wait for your program to finish processing millions of records. Let's say we're trying to estimate the running time of our application when the number of records exceeds one million.

To estimate the running time for this large data set, we start with asking the question, "What is the effect on the running time of doubling the size of the data set?".

We can start by creating a list that contains four columns, like such, where N is the number of records to process.

Nsecondsratiolog_2 ratio
256???
512???
1024???
2048???
4096???
8192???
16384???
32768???


We have question marks for all of the other values for now. The next step is to run our algorithm against each sized data set in our table and add the number of seconds it takes to execute each data set to our table. As an example, let's say that our results look like this:

Nsecondsratiolog_2 ratio
2560.000??
5120.004??
10240.019??
20480.088??
40960.406??
81921.825??
163848.213??
3276836.959??

At this point, we've recorded the number of second that it takes to process each data set, where the size of the data set ranges from 256 items to 32768 items, doubling the size with each test (hence the name "doubling hypothesis").

The next step is to fill in the values in the ratio column. We'll start with the very last row. The value for the last row is the result of 36.959 / 8.213. The value for the row above it is the result of 8.213 / 1.825. We then continue until we've calculated all ratios. Note that there will not be a ratio for the top row (since there's no row above it to divide by). There will also not be a ratio for the second row, since we can't divide by zero.

After calculating the ratios, our table looks like this:

Nsecondsratiolog_2 ratio
2560.000-?
5120.004-?
10240.0194.75?
20480.0884.63?
40960.4064.61?
81921.8254.50?
163848.2134.50?
3276836.9594.50?

The last column is the log_2 ratio, which is shorthand for "log base 2 ratio". This is simply the log base 2 of the ratio value. For example, the value in the last row is the log base 2 of 4.50. If you're using the Windows calculator, you can use the natural logarithm, like this:

ln(4.50) / ln(2)

After calculating the log base 2 for each ratio value, we have the following table:

Nsecondsratiolog_2 ratio
2560.000--
5120.004--
10240.0194.752.25
20480.0884.632.21
40960.4064.612.20
81921.8254.502.17
163848.2134.502.17
3276836.9594.502.17

Notice that log_2 ratio seems to converge at 2.17. We can now plug this value into the following equation (read as "A times N to the power of b"), where "A" is a constant, "b" is the log_2 ratio, and N is the number of items we need to process. Note that T(N) is the time it takes to process N number of records.

T(N) = A * (N ^ b)

We already know that b is approximately 2.17. We can find the value of the constant A by solving the equation for one of our timings. You'll want to use one of the larger values of N. You can even try it for a few values of N. For example, if we try to solve for A using N = 32,768, we get:

A * (32,768 ^ 2.17) = 36.959
A = 0.000000005877516

So, if we wanted to know how long it would take to process 1,048,576 records, we would just calculate it as:

T(N) = 0.000000005877516 * (1,048,576 ^ 2.17)
T(N) = 68,217.474 seconds

This doubling hypothesis allows us to estimate the amount of time that our algorithm will take to process very large sets of data, without having to run the program and wait for it to complete. This can be very useful when trying to determine if we're on the right track or need to refine our algorithm to handle large inputs.