Fast * Many = Slow, or, the Problem of Fast Code

Several years ago I made a little math joke about a process that performed millions of look-ups and took forever to complete. That joke, just a short one-liner: “Fast Times Many Equals Slow” has since become a bit of a mantra.

The problem can be particularly pernicious in that a problematic set of code might be well designed, extensively tested, and extremely fast. All three points are desirable and, in some cases, indisputable for a given routine. The performance problem arises when this fast code is executed repeatedly. The problem can sneak up on you. Maybe you write a small routine that executes in 14 microseconds and then you invoke it a thousand times. That means your code will contribute 0.014 total seconds to your process. Without any other context, that sounds pretty good to me and would often be considered acceptable, maybe even excellent.

However, if the process is ramped up to a billion iterations, now that fast routine adds almost 4 hours to your processing time.

0.000014 * 1000000000 = 14000 seconds = 3:53:20

If 4 hours (plus whatever time it takes to perform the rest of the process) is considered slow… is the problem that your routine is slow or is the problem that you ran it many times?

While I’ve seen this problem repeatedly over the years, it’s been particularly prevalent lately across a variety of applications from multiple sources (i.e. it wasn’t just one developer implementing the same design over and over.) I’m sure it’s merely coincidence that so many instances were revealed in a short period of time; but that doesn’t negate the underlying cause.

Iteration is a common programming technique and is often necessary. Sometimes you can mask it by hiding behind an API that performs the iteration for you, but doing so doesn’t remove the problem, it simply shoves it somewhere else. In some cases though, pushing the work somewhere else might be the right solution. There is nothing inherently wrong with iteration itself; but rather the choice of what is included within each cycle. A few weeks ago I wrote about the importance of timing instrumentation and how I used it solve a performance problem.

The problem discussed in that article was caused by misplaced iteration. Program A would read information from an external source and pass it to Program B to parse and process the individual records retrieved from the remote service. The iteration issue occurred because A would only feed the data in small increments to B. So B still needed to iterate over each record. That part was unavoidable (at some point “somebody” needed to act on each record) the problem was B already had iteration built in and A simply added extra work by pre-iterating. Not only that, each cycle carried additional overhead in a new database call, additional network traffic and additional response processing. Internal caching benefits within B were reduced because it was forced to start over with each new invocation.

It might have been possible to make the iteration in A and/or B a little faster, or rewrite B to support multi-call caching when processing within a larger set, but the simplest and most beneficial change was to simply pass all of the data in one call from A to B. This completely eliminated the overhead of making distinct calls. The internal work of B didn’t change at all. It iterated the same as it always did. Admittedly, it’s internal caching did become more effective but the reduction of the back and forth communication with A was the biggest savings. The entire process improved by about a factor of 10.

It’s important to note this was not a scaling factor in the billions as in my earlier example. While such volumes are not uncommon, many processes have no need to cycle that many times. In this example the code was only iterating over tens of thousands of rows, a magnitude many people and processes would consider “normal.” It’s also important to note that no one cycle was slow. The A program was able to parse quickly and split quickly. B was able to parse and process the records quickly.

As noted above, this is why iteration problems can be a subtle threat. Everything passed unit testing and seemed fast, because it was. It was only in the larger scale testing with full external volumes that the problem revealed itself.

So, what is one to do about it?

Check to see if you are iterating needlessly. As seen in my parsing example, there was no need to add an extra layer of iteration. In the context of databases are you iterating row-by-row when you could be performing set-based operations? The database engine may still need to iterate across a data set; but it’s highly optimized for doing so. Adding row-by-row processing in your client is just another instance of adding iteration on top of iteration already in place. This is sometimes the hardest path to implement because it’s essentially a redesign; but it can be the most effective though by completely negating some steps.

Upon what are you iterating? If you are processing a year’s worth of data and loop through each month, do you perform duplicate work in each cycle? If so, can you move some of that work into a step of its own and only do that work once? This is a good place to check for unnecessary “fast” steps such as key-value look-ups. Maybe for a given customer id you do a lookup for the customer name. It may be very fast with a nice, indexed, unique key retrieval; but if you do it for every order, every day, of every month. It adds up, especially when you do all of that again for the next customer, and the next, and so on. If you must iterate, can you sort the data before starting to remove the need to process identical data repeatedly. If you sort orders by customer, you can lookup the customer information once on the first order for that customer and reuse it until all of that customer’s data is exhausted.

Is each cycle as fast as it can be? This may seem obvious; but it’s still worth a look. The same multiplicative factor making iteration painful can be used to your advantage as well. I worked with another DBA on a process iterating on hundreds of millions of rows. While it would have been nice to ask the vendor to redesign their product; that wasn’t really feasible as an immediate solution. Instead we shaved a few milliseconds off each iteration. One millisecond saved for one million executions is over 16 minutes removed from the total processing time. Multiply that by hundreds and we get DAYS of processing time improvement. Again, this was still not ideal, but it was appreciated by the end users waiting on the data.

Are you including extra overhead in your iterations? Can you consolidate steps? This check is a special case combination of the previous two. As an example, a process that reads a web service, then pass the results to another process, and then take the results of that an pass them to a third process to insert or update some repository – a fairly straight forward ETL process. If you watch the data in an ETL flow, do you “E”xtract one row, “T”ransform that row, and then “L”oad that row? Might it be faster to Extract all of the rows, or a large volume of them, then pass those to a transform process, which then passes them in bulk to a loader? You can eliminate the overhead of invoking each step. This overhead could take make forms, including sql vs pl/sql context switches, network latency, file open/close operations, and more. Maybe instead of processing 1000 rows individually, you can process all 1000 together in each step, or if that’s too expensive then maybe 10 batches of 100, 4 batches of 250, 20 batches of 50… whatever the hardware can handle.

Last option – can you parallelize? If individual iteration is required can you do 2 or more at the same time? Parallelizing is the most resource intensive plan of attack but that doesn’t necessarily mean it’s inappropriate. If you can isolate data into independent chunks and have the hardware available to process them, then why not? The most appealing part of parallel processing is it can be combined with any or all of the other solutions. The most important thing to remember when implementing parallelization is it doesn’t actually fix anything. If your process is flawed, it might be possible to throw enough hardware (money) at the problem to reduce the wall-clock time to something acceptable but doing so means your scaling is determined entirely by your wallet and it’s usually not going to be linear. That is, doubling the hardware doesn’t mean you’ll double the processing speed. There are problem sets where parallelizing adds additional benefits beyond reducing processing time. If you have independent data, such as regional sales, it makes sense to process the East region separately from the West region. Not only can you do both sets of work at the same time, but you also get a bonus of resiliency. A failure in the East doesn’t necessarily impact the West processing.

It might be more accurate to say “Fast * Many = A lot of time” but “Fast * Many = Slow” is a better mnemonic. Plus, the phrase is used when there is a performance issue in some looping process, which is mostly commonly described as the process being “slow.” My point was not to suggest all iteration was bad; but rather to consider the larger scale impacts when it occurs and what, if anything, you can or should do about it.

This has been a bit of a rant; but hopefully I was able to include some useful information along the way.