SQL meets Calculus - Sean D. Stuber

Yesterday, Connor McDonald posted an interesting article about a math puzzle he received asking him to calculate the integral of a function over a range. He posted two variations using the Trapezoidal Rule. In his post he mentions the Riemann sums method and I thought it would be interesting to explore that option as well.

I’ve refactored his queries to both have the same shape and made them functionally equivalent (i.e. they produce the same results given the same inputs.) In both of my versions I calculate the f values first and in the outer query I sum the results. The difference between the two, as in Connor’s examples, is the use of CONNECT BY vs a Recursive-WITH clause.

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN 3 * x * x + 2 * x;
    END;
    boundaries AS (SELECT 1 x_start, 5 x_end, 0.00001 delta FROM DUAL),
    subresults
    AS
        (    SELECT LEVEL seq, 
                    f(x_start + (LEVEL - 1) * delta) fnc, 
                    x_start + (LEVEL - 1) * delta inp
               FROM boundaries
         CONNECT BY LEVEL <= (x_end - x_start) / delta + 1)
SELECT
   SUM(fnc) * delta reimann_sum,
   SUM(CASE WHEN seq IN (1, (x_end - x_start) / delta + 1) 
         THEN 1 
         ELSE 2
        END * fnc) * delta / 2 trapezoidal_rule
  FROM subresults CROSS JOIN boundaries;

Or the same thing but using Recursive-WITH…

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN 3 * x * x + 2 * x;
    END;
    boundaries AS(SELECT 1 x_start, 5 x_end, 0.00001 delta FROM DUAL),
    subresults (seq, fnc, inp)
    AS
        (SELECT 1 seq, f(x_start) fnc, x_start inp FROM boundaries
         UNION ALL
         SELECT seq + 1, f(inp + delta), inp + delta
           FROM subresults, boundaries
          WHERE inp + delta <= x_end)
SELECT
   SUM(fnc) * delta reimann_sum,
   SUM(CASE WHEN seq IN (1, (x_end - x_start) / delta + 1) 
         THEN 1 
         ELSE 2
        END * fnc) * delta / 2 trapezoidal_rule
  FROM subresults CROSS JOIN boundaries;

As Connor noted, the trapezoidal rule did produce slightly more accurate results.

A step further…

I liked this topic and wanted to see if I could do more with it. Could I do the inverse? That is, could I calculate the derivative of a function?

From introductory calculus we know the function f(x) is differentiable at x if the limit $\displaystyle\lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$ exists. That limit will be the slope of the tangent to curve of f(x), that is, the derivative at x.

So, perhaps the most obvious approach to try to simulate the limit directly. We’ll calculate the slope at x for h in a series where h gets progressively smaller and we’ll loop until the precision gets so small that the resulting slope calculation starts to repeat because Oracle’s NUMBER data type can’t represent distinct values anymore. This will be as close to 0 as our limit can get.

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN x ** 3 + x ** 2;
    END;
    params AS(SELECT 5 x, 1 h FROM DUAL),
    limit_series (seq, x, slope, h)
    AS
        (SELECT 1 seq, x, (f(x+h) - f(x)) / h, h/10
           FROM params
         UNION ALL
         SELECT seq + 1, x, (f(x+h) - f(x)) / h, h/10
           FROM limit_series
          WHERE slope != (f(x + h) - f(x)) / h)
  SELECT *
    FROM limit_series
ORDER BY seq;

SEQ X SLOPE                          H
--- - ---------------------------    ----------------------
1   5 102                            0.1                                   
2   5 86.61                          0.01                                  
3   5 85.1601                        0.001                                 
4   5 85.016001                      0.0001                                
5   5 85.00160001                    0.00001                               
6   5 85.0001600001                  0.000001                              
7   5 85.000016000001                0.0000001                             
8   5 85.00000160000001              0.00000001                            
9   5 85.0000001600000001            0.000000001                           
10  5 85.000000016000000001          0.0000000001                          
11  5 85.00000000160000000001        0.00000000001                         
12  5 85.0000000001600000000001      0.000000000001                        
13  5 85.000000000016000000000001    0.0000000000001                       
14  5 85.0000000000016               0.00000000000001                      
15  5 85.00000000000016              0.000000000000001                     
16  5 85.000000000000016             0.0000000000000001                    
17  5 85.0000000000000016            0.00000000000000001                   
18  5 85.00000000000000016           0.000000000000000001                  
19  5 85.000000000000000016          0.0000000000000000001                 
20  5 85                             0.00000000000000000001

That turned out to be pretty easy and we can jump directly to the final answer with a small tweak to the final query sorting the values in reverse and taking the first row (seq=20)

  SELECT slope 
    FROM limit_series
ORDER BY seq DESC
FETCH FIRST 1 ROWS ONLY;

SLOPE
-----
85

And we can verify the results because the derivative of $\displaystyle x^3+x^2$ is $\displaystyle 3x^2+2x$ and therefore the derivative value at x=5 is $\displaystyle 3\cdot(5^2) + 2\cdot(5)$ or $75 + 10$ or $85$ .

Let’s try another function now cos(x) whose derivative is -sin(x).

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN cos(x);
    END;
    params AS(SELECT 5 x, 1 h FROM DUAL),
    limit_series (seq, x, slope, h)
    AS
        (SELECT 1 seq, x, (f(x+h) - f(x)) / h, h/10
           FROM params
         UNION ALL
         SELECT seq + 1, x, (f(x+h) - f(x)) / h, h/10
           FROM limit_series
          WHERE slope != (f(x + h) - f(x)) / h)
  SELECT slope 
    FROM limit_series
ORDER BY seq DESC
FETCH FIRST 1 ROWS ONLY;

SLOPE
-----
0

Oops, that’s not right! It should be something close to 0.9589. If we remove the ORDER BY and FETCH FIRST clauses we can see the series trend.

0.6765081011871397560790131264093670971281
0.943155572497542988539363814154243625703
0.95748999356370644806813409211376650739
0.9587822837615216602736007115817115509
0.958910089955670003198194216079624562
0.95892285633622909333678406976316905
0.9589241328318859165793975821436058
0.958924260480027597524728563518437
0.95892427324482752559495185125713
0.9589242745213073760017205615097
0.958924274648955359618394885793
0.95892427466172015796582229163
0.9589242746629966378004226305
0.95892427466312428578388118
0.95892427466313705058222794
0.9589242746631383270620536
0.958924274663138454709886
0.95892427466313846747505
0.9589242746631384687557
0.958924274663138469013
0.95892427466313846892
0.9589242746631384839
0.958924274663138464
0.95892427466313856
0.9589242746631501
0.9589242746631
0.9589242746636
0.9589242746726
0.958924274783
0.9589242752
0.9589242798
0.958924411
0.95892467
0.9589222
0.959018
0.95953
0.9642
0.929
1.54
0

We can see the same problem using the secant method, that is $\displaystyle\lim_{h \to 0} \frac{f(x+h) - f(x-h)}{2h}$

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN cos(x);
    END;
    params AS(SELECT 5 x, 1 h FROM DUAL),
    limit_series (seq, x, slope, h)
    AS
        (SELECT 1 seq, x, (f(x+h) - f(x-h)) / (2*h), h/10
           FROM params
         UNION ALL
         SELECT seq + 1, x, (f(x+h) - f(x-h)) / (2*h), h/10
           FROM limit_series
          WHERE slope != (f(x+h) - f(x-h)) / (2*h))
  SELECT slope 
    FROM limit_series;

0.80690695375698896759241024051033739345
0.957326866452025794131232044035308361106
0.95890829267180424921144573688737137974
0.95892411484243401607217485403390024155
0.9589242730649313452536938533630892785
0.95892427464715639764884867536399627
0.95892427466297864818071055773554835
0.958924274663136870686029967592652
0.958924274663138452911083161770375
0.95892427466313846873333369371355
0.9589242746631384688915561989995
0.958924274663138468893138423115
0.95892427466313846889315424525
0.958924274663138468893154465
0.958924274663138468893154635
0.95892427466313846889315095
0.9589242746631384688930085
0.958924274663138468893035
0.9589242746631384688973
0.9589242746631384689925
0.95892427466313846866
0.95892427466313848135
0.9589242746631384385
0.958924274663138065
0.9589242746631426
0.9589242746630745
0.95892427466335
0.9589242746676
0.9589242747075
0.95892427444
0.9589242722
0.958924335
0.958923905
0.95891715
0.958992
0.959265
0.96155
0.9535
1.28
0

In both the tangent and secant estimates we can see they fairly quickly approached the correct answer but then began dwindling away until the rounding errors and precision limits caused them to fluctuate and then become 0. This is a known problem with an arithmetic approach in computation due to the limits of the numerical data types.

As h approaches 0 rounding errors evaluating the functions begin to creep in. Furthermore we know that eventually x+h will no longer be distinguishable from x. For the Oracle NUMBER type, that limit is 10^-38. While the NUMBER type can represent values much smaller than that, it can’t reliably apply them to other values in arithmetic or function calls while maintaining mathematical integrity. Formally, this limit is known a machine epsilon. The name originating from the hardware limitations. The NUMBER type isn’t represented in the hosting hardware itself but the same principle still applies. There will always be some limit. Even if the NUMBER type, operators, and functions were able to hold and process a billion significant digits there would still be a limit. It would be a much smaller value but it would still exist.

Fortunately, if we know the epsilon value we can use it to calculate a viable minimum h. This will not only ameliorate the precision problems, it also means we don’t need to loop for repeating values. Once we know the target h-value, we can just calculate the slope at that point.

The target h is value is $\displaystyle x\sqrt{\epsilon}$ . This doesn’t work for x=0 and, if you have very large or very small values or many significant digits for x the epsilon root may be less useful as you will introduce a new rounding error in the multiplication; but this should work for most data ranges.

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN COS(x);
    END;
    params AS(SELECT 5 x, 
                     5 * SQRT(POWER(10, -38)) h
                FROM DUAL)
SELECT (f(x + h) - f(x)) / h
  FROM params;

SLOPE
-----------------------
0.958924274663138468809

Comparing that to the known derivative -sin(x), we can see the estimate is now quite close; only deviating after 19 digits.

SELECT -SIN(5) FROM DUAL;

SLOPE
------------------------------------------
0.9589242746631384688931544061559939733579

Going back to our initial polynomial function $\displaystyle x^3+x^2$ we can see this method is still valid.

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN x ** 3 + x ** 2;
    END;
    params AS(SELECT 5 x, 
                     5 * SQRT(POWER(10, -38)) h
                FROM DUAL)
SELECT (f(x + h) - f(x)) / h slope
  FROM params;

SLOPE
---------------------
85.000000000000000008

Earlier I mentioned using a secant as an alternate means of calculating the slope. You can use the same method here with the secant formula.

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN x ** 3 + x ** 2;
    END;
    params AS(SELECT 5 x, 
                     5 * SQRT(POWER(10, -38)) h
                FROM DUAL)
SELECT (f(x + h) - f(x-h)) / (2*h) slope
  FROM params;

SLOPE
-----
85

WITH
    FUNCTION f(x NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        RETURN cos(x);
    END;
    params AS(SELECT 5 x, 
                     5 * SQRT(POWER(10, -38)) h
                FROM DUAL)
SELECT (f(x + h) - f(x-h)) / (2*h) slope
  FROM params;

SLOPE
------------------------
0.9589242746631384688821

And we get a little more precision using this method.

There are many other methods of calculating derivatives; hopefully these give you a little taste. I thought this was a fun exercise and I thank Connor for posting the original inspiration that got it started.

A step further…

Leave a ReplyCancel reply