In my previous post I showed how to set up data loads for the TPC-H test data. All of the steps provided should work regardless of the data volume. However, for large data sets you may want to parallelize the data generation and data import in order to speed up the process.
The user, directory, permissions, target tables, constraints, and indexes from the previous post will be the same. As with the small data sets the constraints and indexes won’t be added until after the data is loaded.
As with the small sets, dbgen is used but instead of scale 4, I’ll use scale 100 (approximately 100GB of data.)
I’ll split it into 4 parts for parallelization. Thus needing to call dbgen 4 times, once for each part.
$ ./dbgen -s 100 -S 1 -C 4 $ ./dbgen -s 100 -S 2 -C 4 $ ./dbgen -s 100 -S 3 -C 4 $ ./dbgen -s 100 -S 4 -C 4
These can be run sequentially or in 4 separate sessions. If you have higher cpu capacity then I suggest running with more than 4 parts and/or run them in parallel sessions.
Note, sessions other than the first will each try to replace the nation.tbl file so you’ll be prompted if you want to replace it or not. Y or N is fine, the same data will be generated regardless.
$ ./dbgen -s 100 -S 2 -C 4 TPC-H Population Generator (Version 2.17.2) Copyright Transaction Processing Performance Council 1994 - 2010 Do you want to overwrite ./nation.tbl ? [Y/N]: n
When all 4 calls are complete you’ll have the following files.
$ ls *.tbl.* customer.tbl.1 orders.tbl.1 part.tbl.2 customer.tbl.2 orders.tbl.2 part.tbl.3 customer.tbl.3 orders.tbl.3 part.tbl.4 customer.tbl.4 orders.tbl.4 region.tbl lineitem.tbl.1 partsupp.tbl.1 supplier.tbl.1 lineitem.tbl.2 partsupp.tbl.2 supplier.tbl.2 lineitem.tbl.3 partsupp.tbl.3 supplier.tbl.3 lineitem.tbl.4 partsupp.tbl.4 supplier.tbl.4 nation.tbl part.tbl.1
The NATION and REGION tables are small and fixed in size (25 and 5 rows respectively) regardless of the scale value chosen. So they only have one file and will not be processed in parallel.
The other tables will use parallel 4 with multi-file locations in their corresponding external tables.
CREATE TABLE tpch.ext_part ( p_partkey NUMBER(10, 0), p_name VARCHAR2(55), p_mfgr CHAR(25), p_brand CHAR(10), p_type VARCHAR2(25), p_size INTEGER, p_container CHAR(10), p_retailprice NUMBER, p_comment VARCHAR2(23) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('part.tbl*')) PARALLEL 4; CREATE TABLE tpch.ext_supplier ( s_suppkey NUMBER(10, 0), s_name CHAR(25), s_address VARCHAR2(40), s_nationkey NUMBER(10, 0), s_phone CHAR(15), s_acctbal NUMBER, s_comment VARCHAR2(101) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('supplier.tbl*')) PARALLEL 4; CREATE TABLE tpch.ext_partsupp ( ps_partkey NUMBER(10, 0), ps_suppkey NUMBER(10, 0), ps_availqty INTEGER, ps_supplycost NUMBER, ps_comment VARCHAR2(199) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('partsupp.tbl*')) PARALLEL 4; CREATE TABLE tpch.ext_customer ( c_custkey NUMBER(10, 0), c_name VARCHAR2(25), c_address VARCHAR2(40), c_nationkey NUMBER(10, 0), c_phone CHAR(15), c_acctbal NUMBER, c_mktsegment CHAR(10), c_comment VARCHAR2(117) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('customer.tbl*')) PARALLEL 4; -- read date values as yyyy-mm-dd text CREATE TABLE tpch.ext_orders ( o_orderkey NUMBER(10, 0), o_custkey NUMBER(10, 0), o_orderstatus CHAR(1), o_totalprice NUMBER, o_orderdate CHAR(10), o_orderpriority CHAR(15), o_clerk CHAR(15), o_shippriority INTEGER, o_comment VARCHAR2(79) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('orders.tbl*')) PARALLEL 4; -- read date values as yyyy-mm-dd text CREATE TABLE tpch.ext_lineitem ( l_orderkey NUMBER(10, 0), l_partkey NUMBER(10, 0), l_suppkey NUMBER(10, 0), l_linenumber INTEGER, l_quantity NUMBER, l_extendedprice NUMBER, l_discount NUMBER, l_tax NUMBER, l_returnflag CHAR(1), l_linestatus CHAR(1), l_shipdate CHAR(10), l_commitdate CHAR(10), l_receiptdate CHAR(10), l_shipinstruct CHAR(25), l_shipmode CHAR(10), l_comment VARCHAR2(44) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('lineitem.tbl*')) PARALLEL 4; CREATE TABLE tpch.ext_nation ( n_nationkey NUMBER(10, 0), n_name CHAR(25), n_regionkey NUMBER(10, 0), n_comment VARCHAR(152) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('nation.tbl')); CREATE TABLE tpch.ext_region ( r_regionkey NUMBER(10, 0), r_name CHAR(25), r_comment VARCHAR(152) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('region.tbl'));
In the previous post the target tables weren’t created with a parallel clause. So, add that now.
ALTER TABLE tpch.part PARALLEL 4; ALTER TABLE tpch.supplier PARALLEL 4; ALTER TABLE tpch.partsupp PARALLEL 4; ALTER TABLE tpch.customer PARALLEL 4; ALTER TABLE tpch.orders PARALLEL 4; ALTER TABLE tpch.lineitem PARALLEL 4;
Clear out any test data from prior runs and load the new data. Use parallel dml to ensure maximum speed on load.
TRUNCATE TABLE tpch.part; TRUNCATE TABLE tpch.supplier; TRUNCATE TABLE tpch.partsupp; TRUNCATE TABLE tpch.customer; TRUNCATE TABLE tpch.orders; TRUNCATE TABLE tpch.lineitem; TRUNCATE TABLE tpch.nation; TRUNCATE TABLE tpch.region; ALTER SESSION SET nls_date_format='YYYY-MM-DD'; ALTER SESSION ENABLE PARALLEL DML; INSERT /*+ APPEND */ INTO tpch.part SELECT * FROM tpch.ext_part; INSERT /*+ APPEND */ INTO tpch.supplier SELECT * FROM tpch.ext_supplier; INSERT /*+ APPEND */ INTO tpch.partsupp SELECT * FROM tpch.ext_partsupp; INSERT /*+ APPEND */ INTO tpch.customer SELECT * FROM tpch.ext_customer; INSERT /*+ APPEND */ INTO tpch.orders SELECT * FROM tpch.ext_orders; INSERT /*+ APPEND */ INTO tpch.lineitem SELECT * FROM tpch.ext_lineitem; INSERT /*+ APPEND */ INTO tpch.nation SELECT * FROM tpch.ext_nation; INSERT /*+ APPEND */ INTO tpch.region SELECT * FROM tpch.ext_region;
Like the dbgen step you can run these sequentially or in parallel. Because the selects and inserts will all be parallel, I chose to run them sequentially. If you have the hardware to run in parallel, you can do so, or simply increase the number of input files and the parallel parameter on the tables to match your available cpu capacity.
Finally, add the constraints and indexes as shown in the previous post and you’re done.