While Oracle has their own SCOTT, SH, OE, HR, etc. sample schemas, it’s often useful to have the ability to scale data volumes for different experiments. The TPC-H schema and sample data sets provide a convenient means of doing so. They are especially helpful in scenarios such as this, blogging, where readers may have a database but not the same data. Setting up the TPC-H data is fairly simple.
First, go to the TPC home, click Downloads and select the tools zip file. You may also want to read the pdf file documenting the schema and the data scales. Unzip the file somewhere on your linux system. As of the time of this writing 2.17.2 is the current version. In the future the steps may change slightly but should be fairly similar to what is described below.
When you unzip, a directory 2.17.2 will be created. Change to that directory, make a copy of the makefile template and edit it.
$ cd 2.17.2/dbgen $ cp makefile.suite makefile $ vi makefile
Within the makefile, change the following lines
CC = gcc DATABASE= ORACLE MACHINE = LINUX WORKLOAD = TPCH
Then make will create the dbgen utility.
$ make gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o build.o build.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o driver.o driver.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o bm_utils.o bm_utils.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o rnd.o rnd.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o print.o print.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o load_stub.o load_stub.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o bcd2.o bcd2.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o speed_seed.o speed_seed.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o text.o text.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o permute.o permute.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o rng64.o rng64.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -O -o dbgen build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o -lm gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o qgen.o qgen.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o varsub.o varsub.c gcc -g -DDBNAME=\"dss\" -DLINUX -DORACLE -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -O -o qgen build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm
We’ll generate a small, approximately 4GB, set of test data and place it in a directory we’ll use later for the upload into the Oracle tables.
$ ./dbgen -s 4 TPC-H Population Generator (Version 2.17.2) Copyright Transaction Processing Performance Council 1994 - 2010 $ ls *.tbl customer.tbl lineitem.tbl nation.tbl orders.tbl partsupp.tbl part.tbl region.tbl supplier.tbl $ mv *.tbl /home/oracle/tpch
Within the database, set up the tpch schema. All tables will be created according to the layouts described in section 1.4.1 of the TPC-H standard specification. For the purposes of data import, a set of external tables will also be created. These are not part of TPC-H itself and may be left in place or dropped after data load is complete. Also, per section 1.4.2, constraints are optional. I’m including the allowable primary key, foreign key, not null, and check constraints described in the 1.4.2 subsections. Other than the default indexes created to support the primary key constraints, no other indexes are included in the steps below. For efficiency of data loading, the constraints and indexes will be added after the data loading is complete.
CREATE USER tpch IDENTIFIED BY tpch; GRANT CREATE SESSION, CREATE TABLE, UNLIMITED TABLESPACE TO tpch; CREATE OR REPLACE DIRECTORY tpch_dir AS '/home/oracle/tpch'; GRANT READ ON DIRECTORY tpch_dir TO tpch; -- 1.4.1 -- per 1.4.2.1 all table columns may be defined NOT NULL CREATE TABLE tpch.ext_part ( p_partkey NUMBER(10, 0), p_name VARCHAR2(55), p_mfgr CHAR(25), p_brand CHAR(10), p_type VARCHAR2(25), p_size INTEGER, p_container CHAR(10), p_retailprice NUMBER, p_comment VARCHAR2(23) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('part.tbl')); CREATE TABLE tpch.part ( p_partkey NUMBER(10, 0) NOT NULL, p_name VARCHAR2(55) NOT NULL, p_mfgr CHAR(25) NOT NULL, p_brand CHAR(10) NOT NULL, p_type VARCHAR2(25) NOT NULL, p_size INTEGER NOT NULL, p_container CHAR(10) NOT NULL, p_retailprice NUMBER NOT NULL, p_comment VARCHAR2(23) NOT NULL ); CREATE TABLE tpch.ext_supplier ( s_suppkey NUMBER(10, 0), s_name CHAR(25), s_address VARCHAR2(40), s_nationkey NUMBER(10, 0), s_phone CHAR(15), s_acctbal NUMBER, s_comment VARCHAR2(101) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('supplier.tbl')); CREATE TABLE tpch.supplier ( s_suppkey NUMBER(10, 0) NOT NULL, s_name CHAR(25) NOT NULL, s_address VARCHAR2(40) NOT NULL, s_nationkey NUMBER(10, 0) NOT NULL, s_phone CHAR(15) NOT NULL, s_acctbal NUMBER NOT NULL, s_comment VARCHAR2(101) NOT NULL ); CREATE TABLE tpch.ext_partsupp ( ps_partkey NUMBER(10, 0), ps_suppkey NUMBER(10, 0), ps_availqty INTEGER, ps_supplycost NUMBER, ps_comment VARCHAR2(199) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('partsupp.tbl')); CREATE TABLE tpch.partsupp ( ps_partkey NUMBER(10, 0) NOT NULL, ps_suppkey NUMBER(10, 0) NOT NULL, ps_availqty INTEGER NOT NULL, ps_supplycost NUMBER NOT NULL, ps_comment VARCHAR2(199) NOT NULL ); CREATE TABLE tpch.ext_customer ( c_custkey NUMBER(10, 0), c_name VARCHAR2(25), c_address VARCHAR2(40), c_nationkey NUMBER(10, 0), c_phone CHAR(15), c_acctbal NUMBER, c_mktsegment CHAR(10), c_comment VARCHAR2(117) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('customer.tbl')); CREATE TABLE tpch.customer ( c_custkey NUMBER(10, 0) NOT NULL, c_name VARCHAR2(25) NOT NULL, c_address VARCHAR2(40) NOT NULL, c_nationkey NUMBER(10, 0) NOT NULL, c_phone CHAR(15) NOT NULL, c_acctbal NUMBER NOT NULL, c_mktsegment CHAR(10) NOT NULL, c_comment VARCHAR2(117) NOT NULL ); -- read date values as yyyy-mm-dd text CREATE TABLE tpch.ext_orders ( o_orderkey NUMBER(10, 0), o_custkey NUMBER(10, 0), o_orderstatus CHAR(1), o_totalprice NUMBER, o_orderdate CHAR(10), o_orderpriority CHAR(15), o_clerk CHAR(15), o_shippriority INTEGER, o_comment VARCHAR2(79) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('orders.tbl')); CREATE TABLE tpch.orders ( o_orderkey NUMBER(10, 0) NOT NULL, o_custkey NUMBER(10, 0) NOT NULL, o_orderstatus CHAR(1) NOT NULL, o_totalprice NUMBER NOT NULL, o_orderdate DATE NOT NULL, o_orderpriority CHAR(15) NOT NULL, o_clerk CHAR(15) NOT NULL, o_shippriority INTEGER NOT NULL, o_comment VARCHAR2(79) NOT NULL ); -- read date values as yyyy-mm-dd text CREATE TABLE tpch.ext_lineitem ( l_orderkey NUMBER(10, 0), l_partkey NUMBER(10, 0), l_suppkey NUMBER(10, 0), l_linenumber INTEGER, l_quantity NUMBER, l_extendedprice NUMBER, l_discount NUMBER, l_tax NUMBER, l_returnflag CHAR(1), l_linestatus CHAR(1), l_shipdate CHAR(10), l_commitdate CHAR(10), l_receiptdate CHAR(10), l_shipinstruct CHAR(25), l_shipmode CHAR(10), l_comment VARCHAR2(44) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('lineitem.tbl')); CREATE TABLE tpch.lineitem ( l_orderkey NUMBER(10, 0), l_partkey NUMBER(10, 0), l_suppkey NUMBER(10, 0), l_linenumber INTEGER, l_quantity NUMBER, l_extendedprice NUMBER, l_discount NUMBER, l_tax NUMBER, l_returnflag CHAR(1), l_linestatus CHAR(1), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct CHAR(25), l_shipmode CHAR(10), l_comment VARCHAR2(44) ); CREATE TABLE tpch.ext_nation ( n_nationkey NUMBER(10, 0), n_name CHAR(25), n_regionkey NUMBER(10, 0), n_comment VARCHAR(152) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('nation.tbl')); CREATE TABLE tpch.nation ( n_nationkey NUMBER(10, 0), n_name CHAR(25), n_regionkey NUMBER(10, 0), n_comment VARCHAR(152) ); CREATE TABLE tpch.ext_region ( r_regionkey NUMBER(10, 0), r_name CHAR(25), r_comment VARCHAR(152) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY tpch_dir ACCESS PARAMETERS ( FIELDS TERMINATED BY '|' MISSING FIELD VALUES ARE NULL ) LOCATION('region.tbl')); CREATE TABLE tpch.region ( r_regionkey NUMBER(10, 0), r_name CHAR(25), r_comment VARCHAR(152) );
Now load the data. The external tables read the date values as text, so we must set the NLS_DATE_FORMAT prior to loading so the text will be parsed correctly, or embed the formatting within each sql statement. For the small data set in this example, the steps described here should complete within a few minutes. As you scale up into larger volumes, these steps are still valid; but you may want to split the loads into separate steps and alter the external to read multiple files in parallel and use parallel dml on insert in order to speed up the process. The truncate lines aren’t necessary for the first time data load; but are included for future reloads of the dbgen data with other scaling.
TRUNCATE TABLE tpch.part; TRUNCATE TABLE tpch.supplier; TRUNCATE TABLE tpch.partsupp; TRUNCATE TABLE tpch.customer; TRUNCATE TABLE tpch.orders; TRUNCATE TABLE tpch.lineitem; TRUNCATE TABLE tpch.nation; TRUNCATE TABLE tpch.region; ALTER SESSION SET nls_date_format='YYYY-MM-DD'; INSERT /*+ APPEND */ INTO tpch.part SELECT * FROM tpch.ext_part; INSERT /*+ APPEND */ INTO tpch.supplier SELECT * FROM tpch.ext_supplier; INSERT /*+ APPEND */ INTO tpch.partsupp SELECT * FROM tpch.ext_partsupp; INSERT /*+ APPEND */ INTO tpch.customer SELECT * FROM tpch.ext_customer; INSERT /*+ APPEND */ INTO tpch.orders SELECT * FROM tpch.ext_orders; INSERT /*+ APPEND */ INTO tpch.lineitem SELECT * FROM tpch.ext_lineitem; INSERT /*+ APPEND */ INTO tpch.nation SELECT * FROM tpch.ext_nation; INSERT /*+ APPEND */ INTO tpch.region SELECT * FROM tpch.ext_region;
And finally, add the constraints and indexes.
ALTER TABLE tpch.part ADD CONSTRAINT pk_part PRIMARY KEY(p_partkey); ALTER TABLE tpch.supplier ADD CONSTRAINT pk_supplier PRIMARY KEY(s_suppkey); ALTER TABLE tpch.partsupp ADD CONSTRAINT pk_partsupp PRIMARY KEY(ps_partkey, ps_suppkey); ALTER TABLE tpch.customer ADD CONSTRAINT pk_customer PRIMARY KEY(c_custkey); ALTER TABLE tpch.orders ADD CONSTRAINT pk_orders PRIMARY KEY(o_orderkey); ALTER TABLE tpch.lineitem ADD CONSTRAINT pk_lineitem PRIMARY KEY(l_linenumber, l_orderkey); ALTER TABLE tpch.nation ADD CONSTRAINT pk_nation PRIMARY KEY(n_nationkey); ALTER TABLE tpch.region ADD CONSTRAINT pk_region PRIMARY KEY(r_regionkey); -- 1.4.2.3 ALTER TABLE tpch.partsupp ADD CONSTRAINT fk_partsupp_part FOREIGN KEY(ps_partkey) REFERENCES tpch.part(p_partkey); ALTER TABLE tpch.partsupp ADD CONSTRAINT fk_partsupp_supplier FOREIGN KEY(ps_suppkey) REFERENCES tpch.supplier(s_suppkey); ALTER TABLE tpch.customer ADD CONSTRAINT fk_customer_nation FOREIGN KEY(c_nationkey) REFERENCES tpch.nation(n_nationkey); ALTER TABLE tpch.orders ADD CONSTRAINT fk_orders_customer FOREIGN KEY(o_custkey) REFERENCES tpch.customer(c_custkey); ALTER TABLE tpch.lineitem ADD CONSTRAINT fk_lineitem_order FOREIGN KEY(l_orderkey) REFERENCES tpch.orders(o_orderkey); ALTER TABLE tpch.lineitem ADD CONSTRAINT fk_lineitem_part FOREIGN KEY(l_partkey) REFERENCES tpch.part(p_partkey); ALTER TABLE tpch.lineitem ADD CONSTRAINT fk_lineitem_supplier FOREIGN KEY(l_suppkey) REFERENCES tpch.supplier(s_suppkey); ALTER TABLE tpch.lineitem ADD CONSTRAINT fk_lineitem_partsupp FOREIGN KEY(l_partkey, l_suppkey) REFERENCES tpch.partsupp(ps_partkey, ps_suppkey); -- 1.4.2.4 - 1 ALTER TABLE tpch.part ADD CONSTRAINT chk_part_partkey CHECK(p_partkey >= 0); ALTER TABLE tpch.supplier ADD CONSTRAINT chk_supplier_suppkey CHECK(s_suppkey >= 0); ALTER TABLE tpch.customer ADD CONSTRAINT chk_customer_custkey CHECK(c_custkey >= 0); ALTER TABLE tpch.partsupp ADD CONSTRAINT chk_partsupp_partkey CHECK(ps_partkey >= 0); ALTER TABLE tpch.region ADD CONSTRAINT chk_region_regionkey CHECK(r_regionkey >= 0); ALTER TABLE tpch.nation ADD CONSTRAINT chk_nation_nationkey CHECK(n_nationkey >= 0); -- 1.4.2.4 - 2 ALTER TABLE tpch.part ADD CONSTRAINT chk_part_size CHECK(p_size >= 0); ALTER TABLE tpch.part ADD CONSTRAINT chk_part_retailprice CHECK(p_retailprice >= 0); ALTER TABLE tpch.partsupp ADD CONSTRAINT chk_partsupp_availqty CHECK(ps_availqty >= 0); ALTER TABLE tpch.partsupp ADD CONSTRAINT chk_partsupp_supplycost CHECK(ps_supplycost >= 0); ALTER TABLE tpch.orders ADD CONSTRAINT chk_orders_totalprice CHECK(o_totalprice >= 0); ALTER TABLE tpch.lineitem ADD CONSTRAINT chk_lineitem_quantity CHECK(l_quantity >= 0); ALTER TABLE tpch.lineitem ADD CONSTRAINT chk_lineitem_extendedprice CHECK(l_extendedprice >= 0); ALTER TABLE tpch.lineitem ADD CONSTRAINT chk_lineitem_tax CHECK(l_tax >= 0); -- 1.4.2.4 - 3 ALTER TABLE tpch.lineitem ADD CONSTRAINT chk_lineitem_discount CHECK(l_discount >= 0.00 AND l_discount <= 1.00); -- 1.4.2.4 - 4 ALTER TABLE tpch.lineitem ADD CONSTRAINT chk_lineitem_ship_rcpt CHECK(l_shipdate <= l_receiptdate);
And that's it, you should now have a complete TPC-H scale-4 data set to complete either the TPC-H suite of test queries, oracle labs, or run your own tests.
Enjoy!