Yesterday Kim Berg Hansen posted a couple entries to his blog about scraping and parsing the Oracle Developer Choice Awards voting pages and building summaries of the results.
I’ve been doing something similar since the voting began but took a slightly different approach. I do like the idea of storing them in a table to be able to derive a history but I never did that, I instead simply looked at a snapshot in time and didn’t bother to keep trends.
Kim took two approaches, one using an apex call which he has said didn’t work very reliably and another where he simply pastes the html himself.
My method is similar but I use UTL_HTTP calls to extract the html.
Also, my function returns the raw html and then I do cleanup in my sql statement, he cleans his up within his scraping function.
Since he’s storing the results that makes sense, why incur the cost of cleanup each time he read his table?
Here is my version of the scraping function
CREATE OR REPLACE FUNCTION read_oracle_dev_choice(p_category IN VARCHAR2) RETURN CLOB IS v_http_request UTL_HTTP.req; v_http_response UTL_HTTP.resp; v_buffer VARCHAR2(32767); v_clob CLOB; v_req_ctx UTL_HTTP.request_context_key; BEGIN v_req_ctx := UTL_HTTP.create_request_context('file:/home/oracle/devchoice', 'pa55w0rd', FALSE); v_http_request := UTL_HTTP.begin_request('https://community.oracle.com/community/database/awards/' || p_category || '-voting/', 'GET', 'HTTP/1.1', v_req_ctx); v_http_response := UTL_HTTP.get_response(v_http_request); --DBMS_OUTPUT.put_line('Response status code: ' || v_http_response.status_code); --DBMS_OUTPUT.put_line('Response reason phrase: ' || v_http_response.reason_phrase); --DBMS_OUTPUT.put_line('Response HTTP version: ' || v_http_response.http_version); DBMS_LOB.createtemporary(v_clob, TRUE); BEGIN LOOP UTL_HTTP.read_text(v_http_response, v_buffer, 32767); DBMS_LOB.writeappend(v_clob, LENGTH(v_buffer), v_buffer); END LOOP; EXCEPTION WHEN UTL_HTTP.end_of_body THEN UTL_HTTP.end_response(v_http_response); END; UTL_HTTP.destroy_request_context(v_req_ctx); RETURN v_clob; END;
Once this scraping function has been created and you have your corresponding ACL and wallet entries, you can extract the results with a single SQL statement.
SELECT * FROM (SELECT x.*, RANK() OVER(PARTITION BY category ORDER BY (score_rank + upvote_rank + uppct_rank)) avg_rank FROM (SELECT x.*, RANK() OVER(PARTITION BY category ORDER BY score DESC) score_rank, RANK() OVER(PARTITION BY category ORDER BY upvotes DESC) upvote_rank, RANK() OVER(PARTITION BY category ORDER BY uppct DESC) uppct_rank FROM (SELECT x.*, ROUND(100 * upvotes / totalvotes, 2) uppct FROM (SELECT category, finalist, score, (score + 10 * votes) / 20 upvotes, (10 * votes - score) / 20 downvotes, votes totalvotes FROM ( SELECT cat category, TO_NUMBER(REGEXP_SUBSTR(score, '[0-9]+')) score, TO_NUMBER(REGEXP_SUBSTR(votes, '[0-9]+')) votes, TRIM(finalist) finalist FROM (SELECT 'sql' cat, read_oracle_dev_choice('sql') html FROM DUAL UNION ALL SELECT 'plsql', read_oracle_dev_choice('plsql') FROM DUAL UNION ALL SELECT 'apex', read_oracle_dev_choice('apex') FROM DUAL UNION ALL SELECT 'db-design', read_oracle_dev_choice('db-design') FROM DUAL UNION ALL SELECT 'ords', read_oracle_dev_choice('ords') FROM DUAL), XMLTABLE( '//div[@class="jive-content-ideas-list-item clearfix"]' PASSING xmltype( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( html, '^.+?', '', 1, 0, 'n'), '.+$', '', 1, 0, 'n'), CHR(38) || '[^;]+?;')) COLUMNS score VARCHAR2(20) PATH './div/div/strong', votes VARCHAR2(20) PATH './div/span/span[@class="idea-vote-count"]', finalist VARCHAR2(20) PATH './div[@class="jive-content"]/div/div[@class="jive-content-title"]') x)) x) x) x) ORDER BY category DESC, score DESC;And this produces results like this:
CATEGORY FINALIST SCORE UPVOTES DOWNVOTES TOTALVOTES UPPCT SCORE_RANK UPVOTE_RANK UPPCT_RANK AVG_RANK --------- -------------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ---------- sql Emrah Mete 3010 335 34 369 90.79 1 1 1 1 sql Sayan Malakshinov 1780 210 32 242 86.78 2 2 2 2 sql Sean Stuber 790 98 19 117 83.76 3 3 3 3 sql Kim Berg Hansen 680 86 18 104 82.69 4 4 4 4 sql Erik Van Roon 400 55 15 70 78.57 5 5 6 5 sql Justin Cave 300 44 14 58 75.86 6 6 8 7 sql Matthias Rogel 290 38 9 47 80.85 7 7 5 6 sql Stew Ashton 260 38 12 50 76 8 7 7 8 plsql Adrian Billington 1080 112 4 116 96.55 1 2 1 1 plsql Roger Troller 1030 117 14 131 89.31 2 1 4 2 plsql Sean Stuber 910 97 6 103 94.17 3 3 2 3 plsql Patrick Barel 690 79 10 89 88.76 4 4 5 4 plsql Morten Braten 620 68 6 74 91.89 5 5 3 4 plsql Kim Berg Hansen 440 51 7 58 87.93 6 6 7 6 plsql Bill Coulam 400 46 6 52 88.46 7 7 6 7 ords Dietmar Aust 1240 128 4 132 96.97 1 1 1 1 ords Dimitri Gielis 800 86 6 92 93.48 2 2 2 2 ords Morten Braten 430 50 7 57 87.72 3 3 4 3 ords Kiran Pawar 350 39 4 43 90.7 4 4 3 4 ords Anton Nielsen 240 28 4 32 87.5 5 5 5 5 ords Tim St. Hilaire 130 16 3 19 84.21 6 6 6 6 db-design Heli Helskyaho 1030 119 16 135 88.15 1 1 2 1 db-design Michelle Kolbe 630 75 12 87 86.21 2 2 3 2 db-design Rob Lockard 520 57 5 62 91.94 3 3 1 2 db-design Mark Hoxey 160 23 7 30 76.67 4 4 4 4 apex Jari Laine 720 78 6 84 92.86 1 1 4 1 apex Morten Braten 680 73 5 78 93.59 2 2 3 2 apex Juergen Schuster 560 58 2 60 96.67 3 3 1 2 apex Kiran Pawar 430 48 5 53 90.57 4 4 5 5 apex Karen Cannell 280 30 2 32 93.75 5 5 2 4 apex Paul MacMillan 130 21 8 29 72.41 6 6 7 6 apex Trent Schafer 120 17 5 22 77.27 7 7 6 7I hope you find this helpful and a fun exercise to pursue.
If you haven’t voted, I encourage you to do so here…
I’m a finalist in the SQL and PL/SQL Categories and would, of course, appreciate a vote, be sure to check the other categories though too and vote up the finalists there that have helped you in some way.
Also note, you can vote up more than one person in a category, so by all means vote up as many as you want, including the other finalists in my categories if you want.p.s. – an interesting “feature” of the web scraping is the URL works better when it includes a trailing slash “/”. I have no idea why this should help, but without it the returned CLOB will sometimes (often) contain “Unexpected Error Occurred” rather than the expected results. Occasionally the reverse happens though. If anyone can explain why this is, I would appreciate it.