Close

Splitting a clob into rows

I’ve used this tool for a wide variety of other parsing projects. One of the interesting tuning techniques I used was to pull the clob apart into 32K varchar2 chunks.
It is possible to split the clob directly using the DBMS_LOB package or through the overloaded SQL functions; but clobs are expensive objects. Varchar2 variables on the other hand are relatively light weight making the sub-parsing within them much faster. Doing this does take a little bit of care though to make sure the chunks don’t accidentally split a line in two.

I make a few assumptions in the code below. I assume most lines will be more than 32K long. If they are, I will split into fixed length lines until I either run out of data or reach a delimiter. I also assume I’ll be working with single byte characters. If you use multibyte character sets, there are two places in the code, you will want to change. These lines are indicated with comments.

The returned VCTAB type is a simple table collection type.

CREATE OR REPLACE TYPE VCTAB as table of varchar2(4000)
CREATE OR REPLACE FUNCTION split_clob(
    p_clob        IN CLOB,
    p_delimiter   IN VARCHAR2 DEFAULT CHR(10))
    RETURN vctab
    PIPELINED
IS
    --                    .///.
    --                   (0 o)
    ---------------0000--(_)--0000---------------
    --
    --  Sean D. Stuber
    --  sean.stuber@gmail.com
    --
    --             oooO      Oooo
    --------------(   )-----(   )---------------
    --             \ (       ) /
    --              \_)     (_/

    c_line_limit    CONSTANT INTEGER := 4000; -- for single byte characters
    --c_line_limit    CONSTANT INTEGER := 1000; -- for multibyte characters
    
    -- Chunk limit is slightly less than a full varchar2 limit to allow for
    -- dbms_lob.substr return length limits with multibyte character sets
    -- For single byte characters, this is still sufficient.

    c_chunk_limit   CONSTANT INTEGER := 8 * c_line_limit;

    v_clob_length            INTEGER;
    v_clob_index             INTEGER;
    v_chunk                  VARCHAR2(32767);
    v_chunk_end              INTEGER;
    v_chunk_length           INTEGER;
    v_chunk_index            INTEGER;
    v_delim_len              INTEGER := LENGTH(p_delimiter);
    v_line_end               INTEGER;
BEGIN
    v_clob_length := DBMS_LOB.getlength(p_clob);
    v_clob_index := 1;

    WHILE v_clob_index <= v_clob_length
    LOOP
        --
        -- Pull one chunk off the clob at a time and process it.
        -- This is because it's MUCH faster to use built in functions
        -- on a varchar2 type than to use dbms_lob functions on a clob.
        --
        v_chunk := DBMS_LOB.SUBSTR(p_clob, c_chunk_limit, v_clob_index);

        IF v_clob_index > v_clob_length - c_chunk_limit  -- for single byte
        --IF v_clob_index > v_clob_length - length(v_chunk) -- for multibyte
        THEN
            -- If we walked off the end the clob,
            -- then the chunk is whatever we picked up at the end
            -- delimited or not.
            v_clob_index := v_clob_length + 1;
        ELSE
            -- Find the last delimiter in the chunk, mark that as the end.
            v_chunk_end := INSTR(v_chunk, p_delimiter, -1);

            IF v_chunk_end = 0
            THEN
                -- If there aren't any delimiters in a chunk
                -- then return a slightly smaller piece of the chunk
                -- to avoid accidentally splitting a delimiter
                -- that spans two chunks.
                -- This would cause part of the delimiter
                -- to be returned in the last line
                -- of one chunk, and the rest
                -- in the first line of the next chunk.
                -- To avoid the possibility,
                -- simply reduce the chunk by one line.
                v_chunk := SUBSTR(v_chunk, 1, c_chunk_limit - c_line_limit);
                v_clob_index := v_clob_index + c_chunk_limit - c_line_limit;
            ELSE
                -- If there are delimiters (this is the expected condtion)
                -- then pull the chunk up to and including the delimiter.
                v_chunk := SUBSTR(v_chunk, 1, v_chunk_end + v_delim_len - 1);
                v_clob_index := v_clob_index + v_chunk_end + v_delim_len - 1;
            END IF;
        END IF;

        --
        --  Given a varchar2 chunk split it into lines
        --
        v_chunk_index := 1;
        v_chunk_length := LENGTH(v_chunk);

        WHILE v_chunk_index <= v_chunk_length
        LOOP
            v_line_end := INSTR(v_chunk, p_delimiter, v_chunk_index);

            IF v_line_end = 0 OR (v_line_end - v_chunk_index) > c_line_limit
            THEN
                PIPE ROW (SUBSTR(v_chunk, v_chunk_index, c_line_limit));
                v_chunk_index := v_chunk_index + c_line_limit;
            ELSE
                PIPE ROW (SUBSTR(v_chunk,
                                 v_chunk_index,
                                 v_line_end - v_chunk_index));
                v_chunk_index := v_line_end + v_delim_len;
            END IF;
        END LOOP;
    END LOOP;

    RETURN;
EXCEPTION
    WHEN no_data_needed
    THEN
        NULL;
END split_clob;
/


18 thoughts on “Splitting a clob into rows

  1. No other comments here? Really? This solution seems to be the fastest clob splitter by far – if you are parsing large clobs you must try this.

    Thanks Sean!

  2. I have found many functions of this type. This is the only one that works and does not alter the data.

  3. Thanks Sean! Excellent piece of code! Great work! I was using a function which was using REGEXP_LIKE and was running for 5 hours on some table. This code reduced the time to 0,4 seconds!

  4. Dear Sean,

    It is perfect code which helped me a lot, thank you very much.

    Could I have a question?

    The result of this function is collection.
    Can I get somehow immediately a single row?
    Becase in this case I get only one row(‘Collection’), after that I have to click on this row(‘Collection’) and I get the values which I need.

    Could you help me please? I do very appreciate.

    Regards,

    1. Hi, thank you for reading and I’m glad you found the function useful.

      The result being returned is one row. It’s one row consisting of a collection.
      But, if you want the elements of the collection to be returned as rows, you will need to query the collection as if it were a table.

      There are a few ways you can do that:

      You can pass your query to the TABLE function and then query it

      I can’t see your query, so I’ll just make up my own.
      I have a table containing the texts of various books called “ETEXTS”

      Below I’m parsing the lines of the book War and Peace into rows.

      SELECT COLUMN_VALUE
      FROM TABLE((SELECT split_clob(text)
      FROM etexts
      WHERE name = ‘Project Gutenberg: War and Peace’))

      Another method is to use the join syntax

      SELECT t.COLUMN_VALUE
      FROM etexts, table(split_clob(etexts.text)) t
      WHERE name = ‘Project Gutenberg: War and Peace’

      Depending on your database version, you might not even need to use the TABLE function as it will be implied.

      SELECT t.COLUMN_VALUE
      FROM etexts, split_clob(etexts.text) t
      WHERE name = ‘Project Gutenberg: War and Peace’

      COLUMN_VALUE is the system generated name of the collection results when it is queried as a table.
      You can alias it to something more friendly just as you would with a normal table or view column.

      I hope that helps

  5. Hi Sean,

    Thank you very much for this piece of code – it has helped a great deal in past couple of years. Nevertheless I think I’ve found a small bug in it.

    Please consider following example.
    * Database charset is multi-byte (e.g. AL32UTF8) which means that the length of v_chunk (after calling dbms_lob.substr) might be less than c_chunk_limit even when the end of the clob is not reached yet.
    * p_clob length is 32725 (which is less than c_chunk_limit).
    * First v_chunk length is 32411 (which is less than p_clob length).

    In this case the condition “v_clob_index > v_clob_length – c_chunk_limit” returns true already on first time. As a result first 32411 chars are read from clob and the rest is ignored, which is not expected behaviour.

    For a fix I’d suggest changing the condition to following:
    “v_clob_index > v_clob_length – length(v_chunk)”
    I.e. instead of c_chunk_limit use the number of actually read characters.

    And as a side note – currently you have v_clob_length initialized twice, but looks like one should be enough.

    Best regards,
    Madis

    1. Thank you for reading and your comments. I will admit, I’ve written the code primarily for single byte character sets as that is what I mostly use.
      I have made updates in the code to better support multibyte characters and have indicated in the comments where the lines should be changed.

  6. Thanks a lot for this very helpful article, Sean.
    Saved me lots of time and effort today.

  7. Hi Sean,
    I am trying to get your function running but it’s still returning a clob. what am i doing wrong

    SELECT COLUMN_VALUE
    FROM TABLE((SELECT split_clob(myclob)
    FROM mytable
    where id = ‘20212022JA20051102BN’))

    1. The function, as presented in the article, can’t return a CLOB, the return type is “VCARRAY”

      CREATE OR REPLACE FUNCTION split_clob(p_clob IN CLOB, p_delimiter IN VARCHAR2 DEFAULT CHR(10))
      RETURN vcarray

      Try copying the code again and recompiling the function. The return code will be a collection of varchar2 values (vcarray)

      Note you do have to precreate the vcarray collection type before compiling the function.
      The collection’s ddl is in the code box above the function.

  8. Hello Sean, I’ve used this code to split clob of up to 45 MB in lines and it really much faster then what we used before. However, I think I found a bug. I believe the line
    v_chunk := SUBSTR(v_chunk, 1, v_chunk_end);
    (Just under raise_application_error(-20000,’No delimiters found!’);)
    should read:
    v_chunk := SUBSTR(v_chunk, 1, v_chunk_end + pl_delim_len – 1);
    That way, the whole of the delimeter will be part of the chunk.
    We use a delimiter like chr(10||chr(13) (carriage return/line feed) with a length > 1
    In that case, without the fix, part of the delimeter will become part of the piped line and with the start of a new chunk of 32K, we even missed lines.

    With the fix I believe it works fine.

    1. Thank you for pointing this out. I’ve made several changes in the code including adding better support for multibyte characters.

Comments are closed.