Solving Performance Problems with a Touch of Magic

When I’m tackling a complex performance problem spanning multiple programs, machines, services, etc. I like to wave a magic wand, invoke mystic elves, or call on friendly gnomes to assist me and simply make part of the problem disappear.

That particular step now takes zero time, thereby removing any impact it had on the overall process performance. Voila… problem solved! Right? Problem is solved?

No, of course not. There is no such magic. But we can pretend and see what would be the result if it were true.

The critical feature enabling this technique is to have timings of the various steps. If you don’t have timings, you don’t know what value to magically reduce to zero. And, if you don’t have timings, even if you could magically reduce a step to zero time, you’d have no way of knowing for sure whence the improvement came (although you might be able to take a reasonable guess.)

Why pretend to use “magic”?

If you have the timing framework and magic doesn’t really work, then what’s the point of pretending? The technique is used to drive focus on real problems vs imaginary ones; or even if there are multiple real problems, then to prioritize them.

For example, I was recently presented with an application running a multi-hour batch process. From the database-side, I could see they were inserting and updating data pretty frequently but in tiny bursts.

One of the best things about an Oracle database is the immense amount of instrumentation built into it. I was able to measure how much time the db spent working and how much time it spent waiting on the application.

After two hours, four minutes, and 23 seconds of the process running. The database had spent 7451 seconds (2:04:11) waiting for its next instruction and 12 seconds actually working.

This is where I waved my magic wand.

You have a two-hour process. If I magically make the database run in zero-time. The end result will be you still have a two-hour process. I didn’t claim the sql was perfect or that the database parameters or object definitions couldn’t possibly be improved in some way, but even if we did focus on the database as the problem, and we improved it to such a magically impossible degree such that it took zero-time, that cost and effort would have a negligible impact on the process’s overall performance.

Thus, we need to look elsewhere for a solution that produces meaningful improvement.

All of this might seem obvious, and in this extreme example, it is. But, even in this extreme – I needed to provide the numbers, the method in which I came up with them, and the explanation of what they meant in order to make the magic wand useful.

Before I gathered my numbers I could see in V$SESSION that they were often idle. But I could also see they were running lots of different statements and were very busy. So, while I had a reasonable belief that something was happening on the client side; it wasn’t until I gathered the wait timings and did the math that I found it was as extreme as it turned out to be (99.8% idle.)

Below is a version of the query I used while their session was running to get my numbers.

  SELECT x.*,
         ROUND(seconds_waiting_on_client / (seconds_waiting_on_client + seconds_everything_else) * 100, 3) percent_waiting
    FROM (  SELECT sid,
                   SUM(CASE WHEN event = 'SQL*Net message from client' THEN time_waited_micro END) / 1000000
                       seconds_waiting_on_client,
                   SUM(CASE WHEN event != 'SQL*Net message from client' THEN time_waited_micro END) / 1000000
                       seconds_everything_else
              FROM v$session_event
             WHERE sid IN (SELECT sid
                             FROM v$session  -- change this where clause as needed
                            WHERE username = 'SDS')
          GROUP BY sid) x
ORDER BY sid;

There are three important caveats about using this query.

The session needed to still be running.
The particular wait event I focused on (SQL*Net message from client) assumed there is no human interaction, which was the case for this process.
It would still be possible to use this event if there were, but it would be a less reliable measure of the system itself. Since this was a continuously operational batch process, the wait event did reflect the database waiting on the client application to tell it to do something, without artificial delays waiting for a user to click or type something.
The “everything else” is actually generous to the client’s number because “everything else” includes wait events inside the database as well as a few other client waits like messages to the client and more data from and to the client not just the db work-time.

So – before you use this query yourself to try to claim “it’s not the database”, make sure you understand the client application’s operation. Especially if there is human interaction involved.

Also note, if the “everything else” portion had been larger, it might have been necessary to break that into more distinct buckets to capture where that time was being consumed. Was it in the other client-traffic events? Was it in IO? Was it in contention with other sessions on the db? How much was actual churning on CPU? The magic wand technique worked particularly well in this example because it was such an extreme even with “everything else” being lumped together.

It’s also important to note that this could be reversed and the magic wand still works. If the client wait time was small and the work time was large (even after breaking it into more granular buckets) then we could magically have the client disappear to show it as a non-factor. Similarly, if we had to divide the “everything else” bucket into more specific categories, we might find we want to magically remove the CPU time and focus on IO waits, or magically remove those and focus on contention waits. The technique can still work, regardless of where the performance issues lie.

Finally, using magic can also be useful as a means of deciding when to stop spending time, effort, and money trying to improve a process. If you have an hour-long process that invokes a service that you have no control over, and you spend 59 minutes waiting for the service, you can magically wave all of your time away and you’ll probably still call it a one-hour process. You might be able to pressure your service provider to make some improvements; but no amount of resources expended on your side is going to significantly help the situation – even if you use magic to fix your side.

I’ll note again, that in many cases, the problem might be obvious and you might not need this technique, but sometimes I find it helpful in describing how to tackle a problem. It can add emphasis to help make your point that even using magic won’t help when it’s applied to the wrong areas.

I hope you find a use for a little magic in your day to help solve some problems. Questions and comments, as always, are welcome.

Why pretend to use “magic”?

This is where I waved my magic wand.

Leave a ReplyCancel reply