This is another article inspired by a recent request. That is, given a PDF file with tabular data, could I extract the data into a more human-usable format?
Yes, it is possible but it requires multiple steps and does come with some caveats.
The first step is converting the PDF into a format that is more easily processed with SQL and PL/SQL. To that end, Oracle Text includes functionality to parse a PDF BLOB into an XHTML CLOB.
With a BLOB containing a PDF with contents that look like this:
We can convert that to text of this form:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <meta name="SOURCE MIME TYPE" content="application/pdf" /> <meta name="creation date" content="D:20180415223633-04'00'"></meta> <meta name="revision date" content="D:20180415223633-04'00'"></meta> <meta name="Creator" content=" Microsoft® Excel® 2010"></meta> <meta name="publisher" content="Microsoft® Excel® 2010"></meta> <style type="text/css"> div.c { white-space:pre; position:absolute; } </style> </head> <body>
...
XHTML can’t be parsed directly using native functions; but simply removing the !DOCTYPE declaration is sufficient to turn the contents into usable XML.
Within the BODY of the document, the PDF text will be stored in DIV tags such as these:
<div class="c" style="top:111px;left:70px;font-size:15px;font-family:Calibri, serif;">Pocket </div> <div class="c" style="top:111px;left:195px;font-size:15px;font-family:Calibri, serif;">3 </div> <div class="c" style="top:111px;left:239px;font-size:15px;font-family:Calibri, serif;">$1.68 </div> <div class="c" style="top:111px;left:320px;font-size:15px;font-family:Calibri, serif;">$5.04</div> <div class="c" style="top:131px;left:70px;font-size:15px;font-family:Calibri, serif;">Wocket </div> <div class="c" style="top:131px;left:195px;font-size:15px;font-family:Calibri, serif;">5 </div> <div class="c" style="top:131px;left:239px;font-size:15px;font-family:Calibri, serif;">$2.00 </div> <div class="c" style="top:131px;left:313px;font-size:15px;font-family:Calibri, serif;">$10.00</div> <div class="c" style="top:150px;left:70px;font-size:15px;font-family:Calibri, serif;">Widget </div> <div class="c" style="top:150px;left:188px;font-size:15px;font-family:Calibri, serif;">13 </div> <div class="c" style="top:150px;left:213px;font-size:15px;font-family:Calibri, serif;">$2,342.23 </div> <div class="c" style="top:150px;left:287px;font-size:15px;font-family:Calibri, serif;">$30,448.99</div>
However, there is no concept of a “table” associated with these nodes. To create rows and columns the style attribute must be parsed based on the top and left pixel positions. Data that is in the same “row” will usually be written to the same top location. Columns are little more difficult because numeric data will typically be right-justified; but the text is stored within the pdf based on it’s left-most position. To construct more traditional columns you must count values within a row. The 1st position within a row, regardless of the pixel location is the 1st column, 2nd position is the 2nd column and so on.
Thus, in the example above, column one is “Pocket”, “Wocket”, and “Widget” all at position 70. Column two is the number 3 at position 195, number 5 also at position 195 and then number 13 at position 188. The same process is repeated for the Price and Total columns. Pivoting by the column counters can turn the sequential DIV values into traditional rows and columns, producing results like this:
A B C D -------------------- -------------------- -------------------- -------------------- Sample Table Item Qty Price Total Pocket 3 $1.68 $5.04 Wocket 5 $2.00 $10.00 Widget 13 $2,342.23 $30,448.99
Note, the table header became left-justified as a result of the parsing technique because there was only one “column” in that row. Similar issues will result if there are blank (null) values within a table, since there will be no place holder to count. So, this technique is not guaranteed to produce immediately useful results. The output may require manual adjustment if data is out of sequence.
I have loaded a few blobs into a table, simply named “my_blobs” and extract the contents with queries of this form, extending the PIVOT clause to add columns as needed. The PDF referenced below is the same one linked to above.
SELECT a, b, c, d FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown, TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln, val FROM XMLTABLE(xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div' PASSING (SELECT pdf_to_xml(content) FROM my_blobs WHERE name = 'SampleTable.pdf') COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(20) PATH 'text()') x)) PIVOT (MAX(val) FOR coln IN (1 a, 2 b, 3 c, 4 d)) ORDER BY rown;
The same query logic also works with larger PDFs too. Here’s one from the publicly available Statistical Annex of the European Economy.
SELECT a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z FROM (SELECT rown, DENSE_RANK() OVER(PARTITION BY rown ORDER BY coln) coln, val FROM (SELECT TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'top:[0-9]+'), 5)) rown, TO_NUMBER(SUBSTR(REGEXP_SUBSTR(style, 'left:[0-9]+'), 6)) coln, TRIM(val) val FROM XMLTABLE( xmlnamespaces(DEFAULT 'http://www.w3.org/1999/xhtml'), '/html/body/div' PASSING (SELECT pdf_to_xml(content) FROM my_blobs WHERE name = 'https://ec.europa.eu/info/sites/info/files/saee_autumn_2017_en.pdf') COLUMNS style VARCHAR2(100) PATH '@style', val VARCHAR2(10) PATH 'text()') x)) PIVOT (MAX(val) FOR coln IN (1 a,2 b,3 c,4 d,5 e,6 f,7 g,8 h,9 i,10 j,11 k,12 l,13 m,14 n,15 o,16 p,17 q,18 r,19 s,20 t,21 u,22 v,23 w,24 x,25 y,26 z)) ORDER BY rown
Scrolling through the we can pick out individual tables such as this one listing Population and Labor market numbers for various countries by year. Note this also provides an example of the problem with missing values in a table. 1991 has a summary row in addition to a row of the individual country values. The values on the right side of the table are parsed into the left by the column counting.
1960 9153 55433 1221 2835 8332 30263 46636 50200 573 2121 2779 332 333 11487 7047 8858 1585 4068 4430 247685 235006 1965 9464 58619 1305 2877 8550 31962 49825 52112 591 2266 2971 350 323 12295 7271 8999 1649 4371 4564 260363 246887 1970 9656 60651 1371 2951 8793 33761 51939 53822 614 2359 3140 358 327 13032 7467 8680 1725 4538 4606 269790 255715 1975 9801 61829 1441 3178 9047 35758 53926 55441 502 2456 3302 365 309 13660 7579 9093 1794 4739 4711 278930 264388 1980 9859 61566 1489 3402 9643 37491 55150 56434 509 2512 3413 364 321 14148 7549 9766 1901 4980 4780 285277 270152 1981 9858 61682 1500 3444 9729 37759 55462 56502 515 2519 3433 365 324 14247 7569 9851 1907 5016 4800 286481 271268 1982 9856 61638 1510 3481 9790 37986 55797 56544 521 2531 3457 366 331 14312 7574 9912 1910 5055 4827 287397 272081 1983 9854 61423 1521 3506 9847 38172 56096 56564 528 2546 3485 366 335 14368 7562 9958 1922 5092 4856 288000 272571 1984 9855 61175 1531 3530 9896 38330 56369 56577 535 2562 3514 366 335 14423 7561 9996 1932 5127 4882 288497 272961 1985 9857 61024 1541 3541 9934 38470 56649 56593 542 2579 3545 367 341 14488 7565 10024 1942 5162 4902 289064 273413 1986 9859 61066 1552 3542 9967 38585 56937 56596 548 2600 3579 368 347 14567 7570 10033 1966 5194 4918 289794 274008 1987 9870 61077 1565 3543 10001 38685 57244 56602 554 2627 3616 371 350 14664 7575 10030 1990 5223 4932 290516 274593 1988 9904 61449 1574 3531 10037 38767 57572 56629 560 2653 3655 374 353 14760 7585 10020 1995 5251 4947 291615 275574 1989 9940 62063 1581 3510 10089 38828 57912 56672 568 2667 3684 378 356 14846 7620 10005 1996 5276 4964 292954 276826 1990 9968 63253 1582 3506 10197 38867 58227 56719 580 2663 3698 382 360 14947 7678 9983 1998 5299 4986 294894 278714 1991 640 280354 1991 10006 79973 1574 3526 10320 38966 58520 56759 595 2651 3704 387 364 15068 7755 9960 1999 5303 5014 312444 296253 1992 10047 80500 1545 3549 10399 39158 58811 56797 611 2614 3700 392 367 15182 7841 9952 1996 5305 5042 313811 297671 1993 10086 80946 1506 3563 10460 39361 59066 56832 626 2563 3683 398 371 15290 7906 9965 1992 5325 5066 315006 298940 1994 10116 81147 1474 3571 10513 39549 59286 56843 639 2521 3657 404 375 15381 7936 9992 1989 5347 5088 315828 299826 1995 10137 81308 1448 3601 10562 39719 59501 56844 651 2485 3629 410 378 15460 7948 10026 1989 5363 5108 316567 300624 1996 10157 81466 1425 3626 10609 39884 59713 56860 661 2457 3602 414 380 15526 7959 10064 1990 5374 5125 317292 301403 1997 10181 81510 1406 3661 10661 40050 59926 56890 671 2433 3575 420 383 15608 7968 10109 1986 5383 5140 317962 302124 1998 10203 81446 1393 3714 10721 40214 60147 56907 679 2410 3549 425 385 15703 7977 10160 1982 5391 5154 318560 302770 1999 10226 81422 1379 3755 10762 40370 60457 56916 687 2391 3524 431 387 15809 7992 10218 1984 5396 5166 319272 303524 2000 10251 81457 1401 3804 10806 40554 60872 56942 694 2368 3500 437 390 15922 8012 10290 1989 5401 5176 320266 304523 2001 10287 81517 1393 3864 10862 40766 61317 56980 702 2338 3471 442 393 16043 8042 10363 1992 5380 5188 321338 305670 2002 10333 81578 1384 3932 10902 41424 61764 57100 710 2310 3443 447 396 16147 8082 10420 1995 5379 5201 322944 307328 2003 10376 81549 1375 3997 10928 42196 62202 57413 718 2288 3415 452 399 16223 8118 10459 1996 5379 5213 324697 309126 2004 10421 81456 1366 4067 10955 42859 62661 57845 728 2263 3377 459 401 16276 8169 10484 1997 5382 5228 326396 310880 2005 10479 81337 1359 4160 10987 43663 63133 58191 739 2239 3323 466 404 16317 8225 10503 2001 5387 5246 328157 312707 2006 10548 81173 1351 4270 11020 44361 63574 58428 751 2219 3270 473 405 16341 8268 10522 2008 5391 5266 329639 314245 2007 10626 80992 1343 4400 11049 45236 63967 58787 767 2201 3231 481 407 16378 8295 10543 2019 5397 5289 331407 316042 2008 10710 80764 1338 4496 11078 45983 64324 59242 787 2178 3198 489 409 16440 8322 10558 2022 5406 5313 333057 317719 2009 10797 80483 1336 4539 11107 46368 64655 59578 808 2142 3163 498 412 16526 8341 10568 2042 5418 5339 334120 318800 2010 10896 80284 1333 4560 11121 46562 64974 59830 829 2097 3097 508 414 16612 8361 10573 2049 5430 5363 334894 319643 2011 10994 80275 1330 4577 11105 46736 65294 60060 851 2059 3028 519 416 16693 8389 10558 2053 5398 5388 335722 320588 2012 11068 80426 1325 4590 11045 46766 65615 60339 864 2034 2988 532 419 16752 8426 10515 2057 5406 5414 336581 321488 2013 11125 80646 1320 4602 10965 46593 65953 60646 862 2013 2958 545 423 16800 8477 10457 2060 5413 5439 337298 322249 2014 11180 80983 1316 4615 10892 46455 66290 60789 853 1994 2932 558 427 16863 8544 10401 2062 5419 5463 338036 323033 2015 11239 81687 1313 4642 10821 46410 66590 60731 848 1977 2905 569 432 16932 8630 10358 2063 5422 5481 339050 324089 2016 11295 82491 1316 4684 10784 46450 66858 60628 852 1959 2868 584 438 17030 8740 10326 2065 5431 5495 340292 325364 2017 11358 83171 1320 4734 10734 46491 67219 60587 852 1940 2825 598 442 17104 8810 10305 2068 5434 5516 341509 326629 2018 11416 83575 1321 4782 10677 46529 67582 60578 857 1921 2785 613 445 17174 8871 10295 2070 5438 5537 342467 327630 2019 11475 83784 1321 4826 10620 46565 67947 60559 862 1905 2749 627 449 17239 8925 10289 2072 5442 5558 343215 328415
Another option, if you know ranges of horizontal pixel locations that define columns, you can use those ranges with a CASE statement instead of counting from left to right. While more complex, this would allow you to maintain correct position even with missing values. The counting method allows the query to be more reusable across varied documents but specifying ranges of left-values may produce more desirable results.
In both of the queries above I use a function to convert the BLOB to an XMLTYPE. I’ve used two forms of the function. The first, and simpler form uses the deprecated IFILTER function.
CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB) RETURN XMLTYPE IS v_clob CLOB; v_xml XMLTYPE; BEGIN DBMS_LOB.createtemporary(v_clob, TRUE); DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite); ctx_doc.ifilter(p_pdf, v_clob); DBMS_LOB.close(v_clob); -- This is an quick hack for illustration purposes -- to prevent validation errors. -- Simply take out the DOCTYPE and DTD references. v_xml := xmltype(REGEXP_REPLACE(v_clob, '<!DOCTYPE [^>]+>')); DBMS_LOB.freetemporary(v_clob); RETURN v_xml; END;
The second form requires the creation of a text policy and preference before creating the function. For this purpose, some of the simple defaults from the Oracle documentation are sufficient.
BEGIN ctx_ddl.create_preference(preference_name => 'fast_filter', object_name => 'AUTO_FILTER'); ctx_ddl.set_attribute(preference_name => 'fast_filter', attribute_name => 'OUTPUT_FORMATTING', attribute_value => 'FALSE'); ctx_ddl.create_policy(policy_name => 'my_policy', filter => 'fast_filter'); END; CREATE OR REPLACE FUNCTION pdf_to_xml(p_pdf IN BLOB) RETURN XMLTYPE IS v_clob CLOB; v_xml XMLTYPE; BEGIN DBMS_LOB.createtemporary(v_clob, TRUE); DBMS_LOB.open(v_clob, DBMS_LOB.lob_readwrite); ctx_doc.policy_filter('my_policy', p_pdf, v_clob, FALSE); DBMS_LOB.close(v_clob); -- This is an quick hack for illustration purposes -- to prevent validation errors. -- Simply take out the DOCTYPE and DTD references. v_xml := xmltype(REGEXP_REPLACE(v_clob, '<!DOCTYPE [^>]+>')); DBMS_LOB.freetemporary(v_clob); RETURN v_xml; END;
If you use the second form and later decide the policy and preference are no longer needed you can drop them:
BEGIN ctx_ddl.drop_policy('my_policy'); ctx_ddl.drop_preference('fast_filter'); END;
Hopefully the function and query above will provide a solid framework for pursuing similar searches within your own PDF files. One last caveat I’ve found is that rows are not always exactly aligned. Fortunately the worst case I’ve found was some columns within a “row” were offset from the others by a single pixel. I was able to compensate by rounding the row value to nearest 10 ROUND(…,-1). For documents with small font size or rows with tight spacing that may be insufficient, requiring more sophisticated grouping math. Also, I expect that offset of 1 is simply the case for some of the specific files I’ve encountered. Other files may have more widely varied values, again requiring more complicated rounding or grouping rules to aggregate to consistent values.
Hello, i ve tried this, taking as example the first pdf. When i m running the select statement which uses the pdf_to_xml function i am taking the following error
ORA-31020: For security reasons, ftp and http access over XDB repository is not allowed on server side
Any ideas?
Hi thanks for reading. I’m sorry for the slow response.
The error explains the problem pretty nicely but doesn’t tell you much about what to do about it.
The XDB repository can’t make the call to the external source to validate against the DTD when constructing the xmltype.
So, you have a few options
1 – you can install the DTD locally and validate against that
2 – you can remove the DOCTYPE tag with DTD completely so there’s nothing to validate (I’ve modified the example functions to do this)
3 – you can disable DTD validation by setting a session event (for more information on this see MOS Doc ID 1407634.1)
ALTER SESSION SET EVENTS = ‘31156 trace name context forever, level 0x2’