Timing is Everything, but Recording it is Even Better

I got back from speaking at Collaborate 18 a week ago. My topic was performance diagnostics.
The recurring theme of my talk is for developers to leave themselves some bread crumbs to mark their trail.

The Oracle database has wonderful diagnostic tools like ASH/AWR, SQL Monitor, tracing, and the hierarchical profiler; but as good as these tools are, the best way of measuring performance issues in your code should be your own logs. In particular, logs with timestamps.

It always blows my mind when a developer asks me to do initial triage on a system because they don’t know where the long running processes are. Why don’t you have logs that capture the begin and end points of important processes? Even if you don’t have a reason to suspect a performance issue, including a simple “Starting ABC”/”Ending ABC” set of bookends provides scope to other debug or error messages. Later, if you do have a reason to check performance, having the start and end points of your process logged gives you a means of checking how long it ran.

Coincidentally, the day I got back to the office after the conference, I was contacted about a process running inordinately slow. Of course I asked, where is the slow part and they couldn’t really say but they took a guess at one particular procedure. As it happens, about 10 years ago I wrote a lot of the code behind that procedure. Naturally, I was interested to see if I had learned anything in the past 10 years that I could use to make it faster; but first, I simply wrapped the procedure with the app team’s own logging framework and asked them to run the application.

What we found was the old procedure was actually running quite fast, averaging about 0.13 seconds per run with a worst case of about 0.82 seconds. The problem was it was being called repeatedly inside an expensive loop with each iteration feeding the procedure a small amount of data to process. The fix was to remove the iterative process and allow the old procedure to run on larger data sets. The total execution time dropped from minutes to seconds.

While I had full DBA access to this system, I didn’t use an AWR report, 10046 tracing, SQL Monitor, a pl/sql profiler, ASH data, or any sort of “Top SQL” scripts and tools to determine the problem. I didn’t need an extensive instrumentation system for this, all I needed was a set of simple timestamped begin and end messages to tell me where the application was spending its time. Of course I could add even more logging to get more granular timings; but the point is, it’s often not necessary. If you can track the scope of work progression, you can determine, without extended privileges to administrative views or tools, where your application is spending its time.

You can then talk to your DBA about “why” a particular SQL or procedure is slow, and determine what to do about it. I’m not going to recommend any particular instrumentation framework or tool because the specific choice isn’t as important as the methodology. It’s quite simple, create a procedure to write messages with timestamps to a table, preferably within an autonomous transaction. Then, adding a mere 3 lines of code to each procedure or function will be sufficient for tracking most performance related issues.

The most basic logging framework could be nothing more than this:

CREATE TABLE app_log
(
    logtime       TIMESTAMP WITH TIME ZONE,
    logmessage    VARCHAR2(4000)
);

CREATE OR REPLACE PROCEDURE write_log(p_message IN VARCHAR2)
IS
    PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
    INSERT INTO app_log(logtime, logmessage)
         VALUES (SYSTIMESTAMP, p_message);

    COMMIT;
END;

Usage of the above could just be begin and end messages. End will have 2 or more, one for success, one for failures. If you have multiple exception clauses, then each should have its own end-message.

BEGIN
    write_log('Begin process ....');
    --- Do Process ....
    write_log('End process ....');
EXCEPTION
    WHEN OTHERS THEN
        write_log('End process .... with error');
        --- Handle or reraise exception
END;

You can, of course, expand on the above and would most likely want to include other information as well to help properly scope execution. For example you might want to include username, machine, session id, or instance number. You can also include intermediate step messages which can also be used to produce performance diagnostic information. If you already have a logging/debug message framework, that’s great, simply extend the usage to include begin/end messages and you’ve turned the debugging into performance instrumentation.

It’s such a small amount of work for immense potential benefit. Unfortunately performance instrumentation is often underutilized.
Also note, these logs can also be defensive in nature. If your application is just one cog in a larger machine, perhaps a web service, or a process inside a larger ETL system; having logs of when your portion of the work began and when your portion of the work ended can help resolve finger-pointing as to where a problem lies.

A reliable set of performance metrics also makes for easier audits of SLA/QoS contracts or can be used internally for alerts when an SLA threshold is near breaching.

Also, as mentioned earlier, these logs don’t need extended privileges, so it is easier to perform investigation on systems where you may have less access than others.

I hope this helps inspire you to track your own application’s execution with some sort of logging framework. Which one isn’t as important as consistently using it.