Close

Using Object Types: User-Defined Aggregates – CSV aggregation and uniqueness

Previously, I’ve shown how to use a user-defined aggregate to process numeric values; but they can be used for text as well.

Recently I was tasked with importing some denormalized text data and aggregating into a consistent form.

The source data consisted of many rows holding identifiers, multiple comma-separated values (CSV,) and several other fields. For the purposes of this article I’m only concerned with the identifiers and the CSV text.

So, after importing the source I ended up with raw data in this form:

CREATE TABLE csv_example
(
    id      NUMBER,
    csv1    VARCHAR2(100),
    csv2    VARCHAR2(100)
);

INSERT INTO csv_example(id, csv1, csv2) VALUES (1, 'a', 'a,c');
INSERT INTO csv_example(id, csv1, csv2) VALUES (1, 'a,b,c,d', 'e,f');
INSERT INTO csv_example(id, csv1, csv2) VALUES (1, 'd,a', 'a,d');
INSERT INTO csv_example(id, csv1, csv2) VALUES (2, 'a,c', 'b');
INSERT INTO csv_example(id, csv1, csv2) VALUES (2, 'b,c', 'd');
INSERT INTO csv_example(id, csv1, csv2) VALUES (2, 'e,a,c', 'e,c');
INSERT INTO csv_example(id, csv1, csv2) VALUES (3, 'x', 'x');
INSERT INTO csv_example(id, csv1, csv2) VALUES (3, 'y', NULL);
INSERT INTO csv_example(id, csv1, csv2) VALUES (3, 'y', 'y');
INSERT INTO csv_example(id, csv1, csv2) VALUES (3, 'z', NULL);

COMMIT;

So there are rows with duplicate values, rows with overlapping csv sets, rows with single values, and rows with multiple values. Then desired results would be one row for each id, where the distinct sub-values of the multiple text columns aggregated into a single csv.

ID  CSV
--  ---------
1   a,b,c,d,e,f
2   a,b,c,d,e
3   x,y,z

One way would be to use a function such as str2tbl on a concatenated list of both values.

SELECT id, LISTAGG(val, ',') WITHIN GROUP (ORDER BY val) csv
  FROM (SELECT DISTINCT id, COLUMN_VALUE val
          FROM (SELECT id, COLUMN_VALUE
                  FROM csv_example, TABLE(str2tbl(csv1||','||csv2))))
GROUP BY id

For a small number of values that are themselves small, this works fine; but as the number of columns increases, and the number of values in each column increases, and the size of the individual values increases, this method breaks down. Eventually we’ll hit the upper limit of a varchar2 when concatenating them to build the initial collection from str2tbl.

To address that we can create a collection for each column and use the MULTISET UNION operator to combine them into one large collection. This method allows for any number of columns with contents of any size. The only limiting factor then is the size of the distinct list of final values. Assuming the final delimited string will fit within a varchar2 value, then we can use the native LISTAGG function.

SELECT id, LISTAGG(val, ',') WITHIN GROUP (ORDER BY val) csv
  FROM (SELECT DISTINCT id, COLUMN_VALUE val
          FROM (SELECT id, str2tbl(csv1) MULTISET UNION str2tbl(csv2) coll
                  FROM csv_example),
               TABLE(coll))
GROUP BY id

If the resulting value will be too big for a varchar2 then we can use tbl2clob to create a clob value from the resulting aggregated set of values.

SELECT id, tbl2clob(SET(CAST(COLLECT(COLUMN_VALUE ORDER BY COLUMN_VALUE) AS vctab)), ',') val
  FROM (SELECT id, str2tbl(csv1) MULTISET UNION str2tbl(csv2) coll
          FROM csv_example),
       TABLE(coll)
GROUP BY id;

While all of those work, are reasonably efficient and not overly complex, I thought it would be even easier if I could simply aggregate the csv columns more directly. To that end I have constructed a user-defined aggregate that will accept a collection of columns, each of which is a comma-separated value list. The aggregation will combine the contents of all of the columns into a single collection of distinct elements, sort them, and then return a single CLOB value.
If a varchar2 result would suffice it would simply be a matter of change clob to varchar2 below and swapping the tbl2clob function with the tbl2str function.

CREATE OR REPLACE TYPE csvtabagg_obj AS OBJECT
(
    v_values vctab,
    STATIC FUNCTION odciaggregateinitialize(ctx IN OUT csvtabagg_obj)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregateiterate(self IN OUT csvtabagg_obj, p_csvtab IN vctab)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregatemerge(self IN OUT csvtabagg_obj, ctx2 IN csvtabagg_obj)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregateterminate(self IN csvtabagg_obj, returnvalue OUT CLOB, flags IN NUMBER)
        RETURN NUMBER
);
/

CREATE OR REPLACE TYPE BODY csvtabagg_obj
IS
    STATIC FUNCTION odciaggregateinitialize(ctx IN OUT csvtabagg_obj)
        RETURN NUMBER
    IS
    BEGIN
        ctx := csvtabagg_obj(NULL);
        RETURN odciconst.success;
    END odciaggregateinitialize;

    MEMBER FUNCTION odciaggregateiterate(self IN OUT csvtabagg_obj, p_csvtab IN vctab)
        RETURN NUMBER
    IS
        v_length   PLS_INTEGER;
        v_start    PLS_INTEGER;
        v_index    PLS_INTEGER;
    BEGIN
        IF v_values IS NULL
        THEN
            v_values := vctab();
        END IF;

        FOR i IN 1 .. p_csvtab.COUNT
        LOOP
            v_length := LENGTH(p_csvtab(i));
            v_start := 1;

            WHILE (v_start <= v_length)
            LOOP
                v_index := INSTR(p_csvtab(i), ',', v_start);

                v_values.EXTEND();

                IF v_index = 0
                THEN
                    v_values(v_values.COUNT) := SUBSTR(p_csvtab(i), v_start);
                    v_start := v_length + 1;
                ELSE
                    v_values(v_values.COUNT) := SUBSTR(p_csvtab(i), v_start, v_index - v_start);
                    v_start := v_index + 1;
                END IF;
            END LOOP;
        END LOOP;

        RETURN odciconst.success;
    END odciaggregateiterate;

    MEMBER FUNCTION odciaggregatemerge(self IN OUT csvtabagg_obj, ctx2 IN csvtabagg_obj)
        RETURN NUMBER
    IS
    BEGIN
        IF v_values IS NULL
        THEN
            v_values := ctx2.v_values;
        ELSE
            v_values := v_values MULTISET UNION ctx2.v_values;
        END IF;

        RETURN odciconst.success;
    END odciaggregatemerge;

    MEMBER FUNCTION odciaggregateterminate(self IN csvtabagg_obj, returnvalue OUT CLOB, flags IN NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        SELECT tbl2clob(CAST(COLLECT(COLUMN_VALUE ORDER BY COLUMN_VALUE) AS vctab), ',')
          INTO returnvalue
          FROM TABLE(SET(v_values));

        RETURN odciconst.success;
    END odciaggregateterminate;
END;
/

CREATE OR REPLACE FUNCTION csvagg2clob(p_csv IN vctab)
    RETURN CLOB
    PARALLEL_ENABLE
    AGGREGATE USING csvtabagg_obj;
/

With the new aggregate type and function, the query is reduced to a simple function call and group by, easily extensible to additional columns simply by adding them to the vctab collection in the aggregate parameter.

SQL> SELECT id, csvagg2clob(vctab(csv1, csv2)) csv
  2    FROM csv_example
  3  GROUP BY id;

        ID CSV
---------- ---------------------------------------
         1 a,b,c,d,e,f
         2 a,b,c,d,e
         3 x,y,z

I thought this was an interesting request when it was first presented. I hope the various options above provide some inspiration in your own parsing and reporting tasks.