A day late but still fun - Sean D. Stuber

Yesterday Kim Berg Hansen posted a couple entries to his blog about scraping and parsing the Oracle Developer Choice Awards voting pages and building summaries of the results.

I’ve been doing something similar since the voting began but took a slightly different approach. I do like the idea of storing them in a table to be able to derive a history but I never did that, I instead simply looked at a snapshot in time and didn’t bother to keep trends.

Kim took two approaches, one using an apex call which he has said didn’t work very reliably and another where he simply pastes the html himself.

My method is similar but I use UTL_HTTP calls to extract the html.
Also, my function returns the raw html and then I do cleanup in my sql statement, he cleans his up within his scraping function.
Since he’s storing the results that makes sense, why incur the cost of cleanup each time he read his table?

Here is my version of the scraping function

CREATE OR REPLACE FUNCTION read_oracle_dev_choice(p_category IN VARCHAR2)
    RETURN CLOB
IS
    v_http_request    UTL_HTTP.req;
    v_http_response   UTL_HTTP.resp;
    v_buffer          VARCHAR2(32767);
    v_clob            CLOB;
    v_req_ctx         UTL_HTTP.request_context_key;
BEGIN
    v_req_ctx := UTL_HTTP.create_request_context('file:/home/oracle/devchoice', 'pa55w0rd', FALSE);

    v_http_request :=
        UTL_HTTP.begin_request('https://community.oracle.com/community/database/awards/' || p_category || '-voting/',
                               'GET',
                               'HTTP/1.1',
                               v_req_ctx);

    v_http_response := UTL_HTTP.get_response(v_http_request);
    --DBMS_OUTPUT.put_line('Response status code: ' || v_http_response.status_code);
    --DBMS_OUTPUT.put_line('Response reason phrase: ' || v_http_response.reason_phrase);
    --DBMS_OUTPUT.put_line('Response HTTP version: ' || v_http_response.http_version);

    DBMS_LOB.createtemporary(v_clob, TRUE);

    BEGIN
        LOOP
            UTL_HTTP.read_text(v_http_response, v_buffer, 32767);
            DBMS_LOB.writeappend(v_clob, LENGTH(v_buffer), v_buffer);
        END LOOP;
    EXCEPTION
        WHEN UTL_HTTP.end_of_body
        THEN
            UTL_HTTP.end_response(v_http_response);
    END;

    UTL_HTTP.destroy_request_context(v_req_ctx);

    RETURN v_clob;
END;

Once this scraping function has been created and you have your corresponding ACL and wallet entries, you can extract the results with a single SQL statement.

  SELECT *
    FROM (SELECT x.*, RANK() OVER(PARTITION BY category ORDER BY (score_rank + upvote_rank + uppct_rank)) avg_rank
            FROM (SELECT x.*,
                         RANK() OVER(PARTITION BY category ORDER BY score DESC) score_rank,
                         RANK() OVER(PARTITION BY category ORDER BY upvotes DESC) upvote_rank,
                         RANK() OVER(PARTITION BY category ORDER BY uppct DESC) uppct_rank
                    FROM (SELECT x.*, ROUND(100 * upvotes / totalvotes, 2) uppct
                            FROM (SELECT category,
                                         finalist,
                                         score,
                                         (score + 10 * votes) / 20 upvotes,
                                         (10 * votes - score) / 20 downvotes,
                                         votes totalvotes
                                    FROM (   SELECT cat category,
                                                    TO_NUMBER(REGEXP_SUBSTR(score, '[0-9]+')) score,
                                                    TO_NUMBER(REGEXP_SUBSTR(votes, '[0-9]+')) votes,
                                                    TRIM(finalist) finalist
                                               FROM (SELECT 'sql' cat, read_oracle_dev_choice('sql') html FROM DUAL
                                                     UNION ALL
                                                     SELECT 'plsql', read_oracle_dev_choice('plsql') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'apex', read_oracle_dev_choice('apex') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'db-design', read_oracle_dev_choice('db-design') FROM DUAL
                                                     UNION ALL
                                                     SELECT 'ords', read_oracle_dev_choice('ords') FROM DUAL),
                                                    XMLTABLE(
                                                        '//div[@class="jive-content-ideas-list-item clearfix"]'
                                                        PASSING xmltype(
                                                                    REGEXP_REPLACE(
                                                                        REGEXP_REPLACE(
                                                                            REGEXP_REPLACE(
                                                                                html,
                                                                                '^.+?',
                                                                                '',
                                                                                1,
                                                                                0,
                                                                                'n'),
                                                                            '.+$',
                                                                            '',
                                                                            1,
                                                                            0,
                                                                            'n'),
                                                                        CHR(38) || '[^;]+?;'))
                                                        COLUMNS score VARCHAR2(20) PATH './div/div/strong',
                                                                votes VARCHAR2(20)
                                                                    PATH './div/span/span[@class="idea-vote-count"]',
                                                                finalist VARCHAR2(20)
                                                                    PATH './div[@class="jive-content"]/div/div[@class="jive-content-title"]') x))
                                 x) x) x)
ORDER BY category DESC, score DESC;

And this produces results like this:


CATEGORY  FINALIST                  SCORE    UPVOTES  DOWNVOTES TOTALVOTES      UPPCT SCORE_RANK UPVOTE_RANK UPPCT_RANK   AVG_RANK
--------- -------------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------
sql       Emrah Mete                 3010        335         34        369      90.79          1           1          1          1
sql       Sayan Malakshinov          1780        210         32        242      86.78          2           2          2          2
sql       Sean Stuber                 790         98         19        117      83.76          3           3          3          3
sql       Kim Berg Hansen             680         86         18        104      82.69          4           4          4          4
sql       Erik Van Roon               400         55         15         70      78.57          5           5          6          5
sql       Justin Cave                 300         44         14         58      75.86          6           6          8          7
sql       Matthias Rogel              290         38          9         47      80.85          7           7          5          6
sql       Stew Ashton                 260         38         12         50         76          8           7          7          8
plsql     Adrian Billington          1080        112          4        116      96.55          1           2          1          1
plsql     Roger Troller              1030        117         14        131      89.31          2           1          4          2
plsql     Sean Stuber                 910         97          6        103      94.17          3           3          2          3
plsql     Patrick Barel               690         79         10         89      88.76          4           4          5          4
plsql     Morten Braten               620         68          6         74      91.89          5           5          3          4
plsql     Kim Berg Hansen             440         51          7         58      87.93          6           6          7          6
plsql     Bill Coulam                 400         46          6         52      88.46          7           7          6          7
ords      Dietmar Aust               1240        128          4        132      96.97          1           1          1          1
ords      Dimitri Gielis              800         86          6         92      93.48          2           2          2          2
ords      Morten Braten               430         50          7         57      87.72          3           3          4          3
ords      Kiran Pawar                 350         39          4         43       90.7          4           4          3          4
ords      Anton Nielsen               240         28          4         32       87.5          5           5          5          5
ords      Tim St. Hilaire             130         16          3         19      84.21          6           6          6          6
db-design Heli Helskyaho             1030        119         16        135      88.15          1           1          2          1
db-design Michelle Kolbe              630         75         12         87      86.21          2           2          3          2
db-design Rob Lockard                 520         57          5         62      91.94          3           3          1          2
db-design Mark Hoxey                  160         23          7         30      76.67          4           4          4          4
apex      Jari Laine                  720         78          6         84      92.86          1           1          4          1
apex      Morten Braten               680         73          5         78      93.59          2           2          3          2
apex      Juergen Schuster            560         58          2         60      96.67          3           3          1          2
apex      Kiran Pawar                 430         48          5         53      90.57          4           4          5          5
apex      Karen Cannell               280         30          2         32      93.75          5           5          2          4
apex      Paul MacMillan              130         21          8         29      72.41          6           6          7          6
apex      Trent Schafer               120         17          5         22      77.27          7           7          6          7

I hope you find this helpful and a fun exercise to pursue.

If you haven’t voted, I encourage you to do so here…

I’m a finalist in the SQL and PL/SQL Categories and would, of course, appreciate a vote, be sure to check the other categories though too and vote up the finalists there that have helped you in some way.
Also note, you can vote up more than one person in a category, so by all means vote up as many as you want, including the other finalists in my categories if you want.

p.s. – an interesting “feature” of the web scraping is the URL works better when it includes a trailing slash “/”. I have no idea why this should help, but without it the returned CLOB will sometimes (often) contain “Unexpected Error Occurred” rather than the expected results. Occasionally the reverse happens though. If anyone can explain why this is, I would appreciate it.

Leave a ReplyCancel reply