Reading a PDF blob with PL/SQL and SQL

This is another article inspired by a recent request. That is, given a PDF file with tabular data, could I extract the data into a more human-usable format?
Yes, it is possible but it requires multiple steps and does come with some caveats.

The first step is converting the PDF into a format that is more easily processed with SQL and PL/SQL. To that end, Oracle Text includes functionality to parse a PDF BLOB into an XHTML CLOB.

With a BLOB containing a PDF with contents that look like this:

We can convert that to text of this form:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta name="SOURCE MIME TYPE" content="application/pdf" />
<meta name="creation date" content="D:20180415223633-04'00'"></meta>
<meta name="revision date" content="D:20180415223633-04'00'"></meta>
<meta name="Creator" content="
Microsoft® Excel® 2010"></meta>
<meta name="publisher" content="Microsoft® Excel® 2010"></meta>
<style type="text/css">
div.c {
 white-space:pre;
 position:absolute;
}
</style>
</head>
<body>

...

XHTML can’t be parsed directly using native functions; but simply removing the !DOCTYPE declaration is sufficient to turn the contents into usable XML.
Within the BODY of the document, the PDF text will be stored in DIV tags such as these:

<div class="c" style="top:111px;left:70px;font-size:15px;font-family:Calibri, serif;">Pocket </div>
<div class="c" style="top:111px;left:195px;font-size:15px;font-family:Calibri, serif;">3 </div>
<div class="c" style="top:111px;left:239px;font-size:15px;font-family:Calibri, serif;">$1.68 </div>
<div class="c" style="top:111px;left:320px;font-size:15px;font-family:Calibri, serif;">$5.04</div>
<div class="c" style="top:131px;left:70px;font-size:15px;font-family:Calibri, serif;">Wocket </div>
<div class="c" style="top:131px;left:195px;font-size:15px;font-family:Calibri, serif;">5 </div>
<div class="c" style="top:131px;left:239px;font-size:15px;font-family:Calibri, serif;">$2.00 </div>
<div class="c" style="top:131px;left:313px;font-size:15px;font-family:Calibri, serif;">$10.00</div>
<div class="c" style="top:150px;left:70px;font-size:15px;font-family:Calibri, serif;">Widget </div>
<div class="c" style="top:150px;left:188px;font-size:15px;font-family:Calibri, serif;">13 </div>
<div class="c" style="top:150px;left:213px;font-size:15px;font-family:Calibri, serif;">$2,342.23 </div>
<div class="c" style="top:150px;left:287px;font-size:15px;font-family:Calibri, serif;">$30,448.99</div>

However, there is no concept of a “table” associated with these nodes. To create rows and columns the style attribute must be parsed based on the top and left pixel positions. Data that is in the same “row” will usually be written to the same top location. Columns are little more difficult because numeric data will typically be right-justified; but the text is stored within the pdf based on it’s left-most position. To construct more traditional columns you must count values within a row. The 1st position within a row, regardless of the pixel location is the 1st column, 2nd position is the 2nd column and so on.

Thus, in the example above, column one is “Pocket”, “Wocket”, and “Widget” all at position 70. Column two is the number 3 at position 195, number 5 also at position 195 and then number 13 at position 188. The same process is repeated for the Price and Total columns. Pivoting by the column counters can turn the sequential DIV values into traditional rows and columns, producing results like this:

A                    B                    C                    D                   
-------------------- -------------------- -------------------- --------------------
Sample Table                                                                       
Item                 Qty                  Price                Total               
Pocket               3                    $1.68                $5.04               
Wocket               5                    $2.00                $10.00              
Widget               13                   $2,342.23            $30,448.99

Note, the table header became left-justified as a result of the parsing technique because there was only one “column” in that row. Similar issues will result if there are blank (null) values within a table, since there will be no place holder to count. So, this technique is not guaranteed to produce immediately useful results. The output may require manual adjustment if data is out of sequence.

I have loaded a few blobs into a table, simply named “my_blobs” and extract the contents with queries of this form, extending the PIVOT clause to add columns as needed. The PDF referenced below is the same one linked to above.

  SELECT a,
         b,
         c,
         d
    FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val
            FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown,
                         TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln,
                         val
                    FROM XMLTABLE(xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div'
                                  PASSING (SELECT pdf_to_xml(content)
                                             FROM my_blobs
                                            WHERE name = 'SampleTable.pdf')
                                  COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(20) PATH 'text()') x))
         PIVOT (MAX(val) FOR coln IN (1 a, 2 b, 3 c, 4 d))
ORDER BY rown;

The same query logic also works with larger PDFs too. Here’s one from the publicly available Statistical Annex of the European Economy.

SELECT a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
    FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val
            FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown,
                         TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln,
                         TRIM(val) val
                    FROM XMLTABLE(
                             xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div'
                             PASSING (SELECT pdf_to_xml(content)
                                        FROM my_blobs
                                       WHERE name = 'https://ec.europa.eu/info/sites/info/files/saee_autumn_2017_en.pdf')
                             COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(10) PATH 'text()') x))
         PIVOT (MAX(val)
               FOR coln
               IN (1 a,2 b,3 c,4 d,5 e,6 f,7 g,8 h,9 i,10 j,11 k,12 l,13 m,14 n,15 o,16 p,17 q,18 r,19 s,20 t,21 u,22 v,23 w,24 x,25 y,26 z))
ORDER BY rown

Scrolling through the we can pick out individual tables such as this one listing Population and Labor market numbers for various countries by year. Note this also provides an example of the problem with missing values in a table. 1991 has a summary row in addition to a row of the individual country values. The values on the right side of the table are parsed into the left by the column counting.

1960       9153       55433      1221       2835       8332       30263      46636      50200      573        2121       2779       332        333        11487      7047       8858       1585       4068       4430       247685     235006                                                
1965       9464       58619      1305       2877       8550       31962      49825      52112      591        2266       2971       350        323        12295      7271       8999       1649       4371       4564       260363     246887                                                
1970       9656       60651      1371       2951       8793       33761      51939      53822      614        2359       3140       358        327        13032      7467       8680       1725       4538       4606       269790     255715                                                
1975       9801       61829      1441       3178       9047       35758      53926      55441      502        2456       3302       365        309        13660      7579       9093       1794       4739       4711       278930     264388                                                
1980       9859       61566      1489       3402       9643       37491      55150      56434      509        2512       3413       364        321        14148      7549       9766       1901       4980       4780       285277     270152                                                
1981       9858       61682      1500       3444       9729       37759      55462      56502      515        2519       3433       365        324        14247      7569       9851       1907       5016       4800       286481     271268                                                
1982       9856       61638      1510       3481       9790       37986      55797      56544      521        2531       3457       366        331        14312      7574       9912       1910       5055       4827       287397     272081                                                
1983       9854       61423      1521       3506       9847       38172      56096      56564      528        2546       3485       366        335        14368      7562       9958       1922       5092       4856       288000     272571                                                
1984       9855       61175      1531       3530       9896       38330      56369      56577      535        2562       3514       366        335        14423      7561       9996       1932       5127       4882       288497     272961                                                
1985       9857       61024      1541       3541       9934       38470      56649      56593      542        2579       3545       367        341        14488      7565       10024      1942       5162       4902       289064     273413                                                
1986       9859       61066      1552       3542       9967       38585      56937      56596      548        2600       3579       368        347        14567      7570       10033      1966       5194       4918       289794     274008                                                
1987       9870       61077      1565       3543       10001      38685      57244      56602      554        2627       3616       371        350        14664      7575       10030      1990       5223       4932       290516     274593                                                
1988       9904       61449      1574       3531       10037      38767      57572      56629      560        2653       3655       374        353        14760      7585       10020      1995       5251       4947       291615     275574                                                
1989       9940       62063      1581       3510       10089      38828      57912      56672      568        2667       3684       378        356        14846      7620       10005      1996       5276       4964       292954     276826                                                
1990       9968       63253      1582       3506       10197      38867      58227      56719      580        2663       3698       382        360        14947      7678       9983       1998       5299       4986       294894     278714                                                
1991       640                   280354                                                                                                                                                                                                                                                      
1991       10006      79973      1574       3526       10320      38966      58520      56759      595        2651       3704       387        364        15068      7755       9960       1999       5303       5014       312444     296253                                                
1992       10047      80500      1545       3549       10399      39158      58811      56797      611        2614       3700       392        367        15182      7841       9952       1996       5305       5042       313811     297671                                                
1993       10086      80946      1506       3563       10460      39361      59066      56832      626        2563       3683       398        371        15290      7906       9965       1992       5325       5066       315006     298940                                                
1994       10116      81147      1474       3571       10513      39549      59286      56843      639        2521       3657       404        375        15381      7936       9992       1989       5347       5088       315828     299826                                                
1995       10137      81308      1448       3601       10562      39719      59501      56844      651        2485       3629       410        378        15460      7948       10026      1989       5363       5108       316567     300624                                                
1996       10157      81466      1425       3626       10609      39884      59713      56860      661        2457       3602       414        380        15526      7959       10064      1990       5374       5125       317292     301403                                                
1997       10181      81510      1406       3661       10661      40050      59926      56890      671        2433       3575       420        383        15608      7968       10109      1986       5383       5140       317962     302124                                                
1998       10203      81446      1393       3714       10721      40214      60147      56907      679        2410       3549       425        385        15703      7977       10160      1982       5391       5154       318560     302770                                                
1999       10226      81422      1379       3755       10762      40370      60457      56916      687        2391       3524       431        387        15809      7992       10218      1984       5396       5166       319272     303524                                                
2000       10251      81457      1401       3804       10806      40554      60872      56942      694        2368       3500       437        390        15922      8012       10290      1989       5401       5176       320266     304523                                                
2001       10287      81517      1393       3864       10862      40766      61317      56980      702        2338       3471       442        393        16043      8042       10363      1992       5380       5188       321338     305670                                                
2002       10333      81578      1384       3932       10902      41424      61764      57100      710        2310       3443       447        396        16147      8082       10420      1995       5379       5201       322944     307328                                                
2003       10376      81549      1375       3997       10928      42196      62202      57413      718        2288       3415       452        399        16223      8118       10459      1996       5379       5213       324697     309126                                                
2004       10421      81456      1366       4067       10955      42859      62661      57845      728        2263       3377       459        401        16276      8169       10484      1997       5382       5228       326396     310880                                                
2005       10479      81337      1359       4160       10987      43663      63133      58191      739        2239       3323       466        404        16317      8225       10503      2001       5387       5246       328157     312707                                                
2006       10548      81173      1351       4270       11020      44361      63574      58428      751        2219       3270       473        405        16341      8268       10522      2008       5391       5266       329639     314245                                                
2007       10626      80992      1343       4400       11049      45236      63967      58787      767        2201       3231       481        407        16378      8295       10543      2019       5397       5289       331407     316042                                                
2008       10710      80764      1338       4496       11078      45983      64324      59242      787        2178       3198       489        409        16440      8322       10558      2022       5406       5313       333057     317719                                                
2009       10797      80483      1336       4539       11107      46368      64655      59578      808        2142       3163       498        412        16526      8341       10568      2042       5418       5339       334120     318800                                                
2010       10896      80284      1333       4560       11121      46562      64974      59830      829        2097       3097       508        414        16612      8361       10573      2049       5430       5363       334894     319643                                                
2011       10994      80275      1330       4577       11105      46736      65294      60060      851        2059       3028       519        416        16693      8389       10558      2053       5398       5388       335722     320588                                                
2012       11068      80426      1325       4590       11045      46766      65615      60339      864        2034       2988       532        419        16752      8426       10515      2057       5406       5414       336581     321488                                                
2013       11125      80646      1320       4602       10965      46593      65953      60646      862        2013       2958       545        423        16800      8477       10457      2060       5413       5439       337298     322249                                                
2014       11180      80983      1316       4615       10892      46455      66290      60789      853        1994       2932       558        427        16863      8544       10401      2062       5419       5463       338036     323033                                                
2015       11239      81687      1313       4642       10821      46410      66590      60731      848        1977       2905       569        432        16932      8630       10358      2063       5422       5481       339050     324089                                                
2016       11295      82491      1316       4684       10784      46450      66858      60628      852        1959       2868       584        438        17030      8740       10326      2065       5431       5495       340292     325364                                                
2017       11358      83171      1320       4734       10734      46491      67219      60587      852        1940       2825       598        442        17104      8810       10305      2068       5434       5516       341509     326629                                                
2018       11416      83575      1321       4782       10677      46529      67582      60578      857        1921       2785       613        445        17174      8871       10295      2070       5438       5537       342467     327630                                                
2019       11475      83784      1321       4826       10620      46565      67947      60559      862        1905       2749       627        449        17239      8925       10289      2072       5442       5558       343215     328415

Another option, if you know ranges of horizontal pixel locations that define columns, you can use those ranges with a CASE statement instead of counting from left to right. While more complex, this would allow you to maintain correct position even with missing values. The counting method allows the query to be more reusable across varied documents but specifying ranges of left-values may produce more desirable results.

In both of the queries above I use a function to convert the BLOB to an XMLTYPE. I’ve used two forms of the function. The first, and simpler form uses the deprecated IFILTER function.

CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB)
    RETURN XMLTYPE
IS
    v_clob   CLOB;
    v_xml    XMLTYPE;
BEGIN
    DBMS_LOB.createtemporary(v_clob, TRUE);
    DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite);
    ctx_doc.ifilter(p_pdf, v_clob);

    DBMS_LOB.close(v_clob);

    -- This is an quick hack for illustration purposes
    -- to prevent validation errors.
    -- Simply take out the DOCTYPE and DTD references.
    v_xml := xmltype(REGEXP_REPLACE(v_clob, '<!DOCTYPE [^>]+>'));

    DBMS_LOB.freetemporary(v_clob);
    RETURN v_xml;
END;

The second form requires the creation of a text policy and preference before creating the function. For this purpose, some of the simple defaults from the Oracle documentation are sufficient.

BEGIN
    ctx_ddl.create_preference(preference_name => 'fast_filter', object_name => 'AUTO_FILTER');
    ctx_ddl.set_attribute(preference_name   => 'fast_filter',
                          attribute_name    => 'OUTPUT_FORMATTING',
                          attribute_value   => 'FALSE');
    ctx_ddl.create_policy(policy_name => 'my_policy', filter => 'fast_filter');
END;

CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB)
    RETURN XMLTYPE
IS
    v_clob   CLOB;
    v_xml    XMLTYPE;
BEGIN
    DBMS_LOB.createtemporary(v_clob, TRUE);
    DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite);
    ctx_doc.policy_filter('my_policy',
                          p_pdf,
                          v_clob,
                          FALSE);

    DBMS_LOB.close(v_clob);

    -- This is an quick hack for illustration purposes
    -- to prevent validation errors.
    -- Simply take out the DOCTYPE and DTD references. 
    v_xml := xmltype(REGEXP_REPLACE(v_clob, '<!DOCTYPE [^>]+>'));

    DBMS_LOB.freetemporary(v_clob);
    RETURN v_xml;
END;

If you use the second form and later decide the policy and preference are no longer needed you can drop them:

BEGIN
    ctx_ddl.drop_policy('my_policy');
    ctx_ddl.drop_preference('fast_filter');
END;

Hopefully the function and query above will provide a solid framework for pursuing similar searches within your own PDF files. One last caveat I’ve found is that rows are not always exactly aligned. Fortunately the worst case I’ve found was some columns within a “row” were offset from the others by a single pixel. I was able to compensate by rounding the row value to nearest 10 ROUND(…,-1). For documents with small font size or rows with tight spacing that may be insufficient, requiring more sophisticated grouping math. Also, I expect that offset of 1 is simply the case for some of the specific files I’ve encountered. Other files may have more widely varied values, again requiring more complicated rounding or grouping rules to aggregate to consistent values.

2 thoughts on “Reading a PDF blob with PL/SQL and SQL”

Kostas Georgiou says:

2019-10-23 at 08:25

Hello, i ve tried this, taking as example the first pdf. When i m running the select statement which uses the pdf_to_xml function i am taking the following error

ORA-31020: For security reasons, ftp and http access over XDB repository is not allowed on server side

Any ideas?

1. Sean D. Stuber says:
  
  2019-11-11 at 21:39
  
  Hi thanks for reading. I’m sorry for the slow response.
  
  The error explains the problem pretty nicely but doesn’t tell you much about what to do about it.
  The XDB repository can’t make the call to the external source to validate against the DTD when constructing the xmltype.
  
  So, you have a few options
  
  1 – you can install the DTD locally and validate against that
  2 – you can remove the DOCTYPE tag with DTD completely so there’s nothing to validate (I’ve modified the example functions to do this)
  3 – you can disable DTD validation by setting a session event (for more information on this see MOS Doc ID 1407634.1)
  
  ALTER SESSION SET EVENTS = ‘31156 trace name context forever, level 0x2’

2 thoughts on “Reading a PDF blob with PL/SQL and SQL”

Leave a ReplyCancel reply