Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Tuesday, May 30, 2017

Creating External Table for HDFS using Oracle Connector for Hadoop (OSCH)


Introduction


Oracle Big Data Connectors facilitate data access to data stored in an Apache Hadoop cluster. It can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware. There are three connectors available from which we are going to work on Oracle SQL Connector for Hadoop Distributed File System for the purpose of this post.

SQL Connector for Hadoop Distributed File System or OSCH enables an Oracle external table to access data stored in Hadoop Distributed File System (HDFS) files or a table in Apache Hive. The data can remain in HDFS or the Hive table, or it can be loaded into an Oracle database. OSCH is a command-line utility that accepts generic command line arguments supported by the org.apache.hadoop.util.Tool interface. It also provides a preprocessor for Oracle external tables.


You install OSCH on the system where Oracle Database runs as it works as a Hadoop client.. If Hive tables are used as the data source, then you must also install and run Oracle SQL Connector for HDFS on a Hadoop client where users access Hive.

For Oracle RAC systems including Oracle Exadata Database Machine, you must install and configure Oracle SQL Connector for HDFS using identical paths on all systems running Oracle instances.


OSCH uses location files (specified in the location clause of the external table) that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS. The format of a location file name is osch-timestamp-number-n.

You can  enable parallel processing with external tables by  specifying  multiple files in the location clause of the external table. The number of files determines the number of child processes started by the external table during a table read, which is known as the degree of parallelism or DOP.

Software Requirements

OSCH requires the following software
On the Hadoop cluster: Cloudera's Distribution including Apache Hadoop version 4 (CDH4) or version 5 (CDH5), Apache Hadoop 1.0 (formerly 0.20.2), or Apache Hadoop 1.1.1

Java Development Kit (JDK) 1.6_08 or later. Consult the distributor of your Hadoop software (Cloudera or Apache) for the recommended version.

Hive 0.7.1, 0.8.1, 0.9.0, 0.10.0, or 0.12.0 (required for Hive table access, otherwise optional)


On the Oracle Database system and Hadoop client systems:
Oracle Database 12c, Oracle Database 11g release 2 (11.2.0.2 or later), or Oracle Database 10g release 2 (10.2.0.5) for Linux.
The same version of Hadoop as your Hadoop cluster: CDH4, CDH5, Apache Hadoop 1.0, or Apache Hadoop 1.1.1.

The same version of JDK as your Hadoop cluster.

Prerequisites:

Hadoop Cluster is already up and running 

Steps to use OSCH

1- Object Preparation at Oracle Side

Create related tablespace and user for OSCH

SQL> CREATE TABLESPACE hdfsdata datafile 'hdfsdata.ora' size 100M;

Tablespace created.

SQL> CREATE USER hdfsuser IDENTIFIED BY hdfsuser    DEFAULT TABLESPACE hdfsdata    QUOTA UNLIMITED ON hdfsdata;

User created.


SQL> GRANT CREATE SESSION, CREATE TABLE, CREATE VIEW TO hdfsuser;
Grant succeeded.

2- Directory Preparation (OS Level)

Create appropriate folders on hadoop client/Oracle Server for OSCH

[root@en01 ~]# mkdir -p /usr/hadoopsw/osch
[root@en01 ~]# chmod -R 777 /usr/hadoopsw/osch


3- Download and Extract OSCH

Download the latest OSCH zip file (oraosch-3.6.0.zip) from below URL and extract the file into the /usr/hadoopsw/osch directory

http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/index.html

[root@en01 ~]# cd /usr/hadoopsw/osch/

[root@en01 osch]# ll

total 6084
-rw-r--r--. 1 root root 6229854 May 28 2017 oraosch-3.6.0.zip


[root@en01 osch]# unzip oraosch-3.6.0.zip

Archive: oraosch-3.6.0.zip
extracting: orahdfs-3.6.0.zip
inflating: README.txt


[root@en01 osch]# ll
total 12176
-r--r--r--. 1 root root 6227413 Aug 30  2016 orahdfs-3.6.0.zip
-rw-r--r--. 1 root root 6229854 May 28  2017 oraosch-3.6.0.zip
-r-xr-xr-x. 1 root root    6034 Aug 30  2016 README.txt

[root@en01 osch]# unzip orahdfs-3.6.0.zip

Archive:  orahdfs-3.6.0.zip
   creating: orahdfs-3.6.0/
   creating: orahdfs-3.6.0/examples/
   creating: orahdfs-3.6.0/examples/sql/
  inflating: orahdfs-3.6.0/examples/sql/mkhive_unionall_view.sql
   creating: orahdfs-3.6.0/jlib/
  inflating: orahdfs-3.6.0/jlib/osdt_cert.jar
  inflating: orahdfs-3.6.0/jlib/oraloader.jar
  inflating: orahdfs-3.6.0/jlib/oraclepki.jar
  inflating: orahdfs-3.6.0/jlib/ojdbc7.jar
  inflating: orahdfs-3.6.0/jlib/osdt_core.jar
  inflating: orahdfs-3.6.0/jlib/ora-hadoop-common.jar
  inflating: orahdfs-3.6.0/jlib/orahdfs.jar
  inflating: orahdfs-3.6.0/jlib/orai18n.jar
   creating: orahdfs-3.6.0/log/
   creating: orahdfs-3.6.0/doc/
  inflating: orahdfs-3.6.0/doc/README.txt
   creating: orahdfs-3.6.0/bin/
  inflating: orahdfs-3.6.0/bin/hdfs_stream
  inflating: orahdfs-3.6.0/bin/hdfs_stream.cmd


4- Setting Environment Variables 

Set the environment variables in .bash_profile, I've Hadoop client, Hive , PIG and Oracle Server all are one machine so you will see all related environment variables in .bash_profile.

vi ~/.bash_profile

###################Inside .bash_profile [hdpclient]########################
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar    

## Hadoop Variables
export HADOOP_HOME=/usr/hadoopsw/hadoop-2.7.3
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

#Hive Variables
export HIVE_HOME=/usr/hadoopsw/apache-hive-2.1.1-bin
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*:.
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:.

#Derby Variables
export DERBY_HOME=/usr/hadoopsw/db-derby-10.13.1.1-bin
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbyclient.jar:$DERBY_HOME/lib/derbytools.jar
export DERBY_OPTS=-Dij.protocol=jdbc:derby://hadoopedge1/

# PIG Variables
export PIG_HOME=/usr/hadoopsw/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop/

# Oracle and OSCH Settings
TMP=/tmp; export TMP
TMPDIR=$TMP; export TMPDIR

ORACLE_HOSTNAME=en01; export ORACLE_HOSTNAME
ORACLE_UNQNAME=orcl; export ORACLE_UNQNAME
ORACLE_BASE=/u01/app/oracle; export ORACLE_BASE
ORACLE_HOME=$ORACLE_BASE/product/12.2.0.1/db_1; export ORACLE_HOME
ORACLE_SID=orcl; export ORACLE_SID
export NLS_LANG=American_America.AL32UTF8
PATH=/usr/sbin:$PATH; export PATH
PATH=$ORACLE_HOME/bin:$PATH; export PATH

LD_LIBRARY_PATH=$ORACLE_HOME/lib:/lib:/usr/lib; export LD_LIBRARY_PATH
CLASSPATH=$CLASSPATH:$ORACLE_HOME/jlib:$ORACLE_HOME/rdbms/jlib; export CLASSPATH

export OSCH_HOME=/usr/hadoopsw/osch/orahdfs-3.6.0
export HADOOP_CLASSPATH=$HADOOP_HOME/lib/*:$OSCH_HOME/jlib/*:$HIVE_HOME/lib/*::$HIVE_HOME/conf
export PATH=$PATH:$HADOOP_MAPRED_HOME:$OSCH_HOME/bin


We need to make another environment variable setting, which is in a preprocessor script $OSCH_HOME/bin/hdfs_stream. The hdfs_stream script does not get the environment variables set in the bash shell and has the PATH environment variable set to /bin:/usr/bin. The script needs to access the Hadoop bin directory and as we have not installed Hadoop in the /bin:/usr/bin directories add the Hadoop bin directory to the PATH variable in the hdfs_stream script.

vi /usr/hadoopsw/osch/orahdfs-3.6.0/bin/hdfs_stream

export PATH=/usr/hadoopsw/hadoop-2.7.3/bin:/usr/bin:/bin


If the Hadoop bin directory is not added to the PATH env variable in hdfs_stream the following error gets generated when a SQL query is made on the external table generated by OSCH.

ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04095: preprocessor command /osch/orahdfs-3.6.0/bin/hdfs_stream encountered error "which no hadoop in (/usr/bin:bin)


5- Creating Directory Objects and Setting Permissions

Before running OSCH, some directory objects are required to be created with appropriate permissions set for the Oracle Database schema (ie; hdfsuser) for which an external table is to be created on the HDFS data. For one of the directory objects create an OS directory with the following command.

[root@en01 ~]# mkdir -p /data/ora_ext_tab_dir
[root@en01 ~]# chmod -R 777 /data/ora_ext_tab_dir



Run the following SQL script in SQL Plus to create directory objects OSCH_EXTTAB_DIR and OSCH_BIN_PATH and grant the required permissions to the hdfsuser schema.

GRANT CREATE ANY DIRECTORY to hdfsuser;

CREATE OR REPLACE DIRECTORY OSCH_EXTTAB_DIR AS '/data/ora_ext_tab_dir';


CREATE OR REPLACE DIRECTORY OSCH_BIN_PATH AS '/usr/hadoopsw/osch/orahdfs-3.6.0/bin';


GRANT READ, WRITE ON DIRECTORY OSCH_EXTTAB_DIR TO hdfsuser;


GRANT CREATE SESSION, CREATE TABLE, CREATE VIEW TO hdfsuser;


GRANT EXECUTE ON sys.utl_file TO hdfsuser;


GRANT READ, EXECUTE ON DIRECTORY OSCH_BIN_PATH TO hdfsuser;



6- Adding Data to HDFS

As we will be creating an external table on a delimited text file in HDFS. Create the following text file to be stored in HDFS and then we will read it using Oracle external table.
catalog.txt.

1,Item1
2,Item2
3,Item3
4,Item4
5,Item5


[hdpsysuser@nn01 ~]$ hdfs dfs -mkdir /catalog

[hdpsysuser@nn01 ~]$ hdfs dfs -put /data/mydata/catalog.txt /catalog/catalog.txt




7- Using the ExternalTable Tool

OSCH provides the ExternalTable tool to create an external table and may be run with the following command.

hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable [genericOptions] COMMAND [options]



The genericOptions are the generic command-line options supported by the Hadoop Tool interface. The ExternalTable tool is used with one of the commands listed in following table.
CommandDescription
-createTableCreates an external table and also publishes HDFS data URIs to the location files of the table
-describeDescribes tables created by OSCH
-dropDrop tables created by OSCH
-publishPublishes the HDFS data URIs to the location files of an external table
-listLocationsPrints the location files of an external table
-getDDLPrints the DDL of an external table


The ExternalTable tool supports the following options.
OptionDescription
--noexecute
Prints the execution plan of the -createTable, -publish, -drop
commands
--detailsPrints a detailed listing of the location files for the -listLocations command
--output <filename>Writes the table DDL from -createTable to the specified file


8- Creating the Configuration file for the ExternalTable tool


The OSCH ExternalTable tool is run using configuration properties; general and connection properties. The configuration properties may be specified either on the command line with –D or in an XML configuration file. We shall use an XML configuration file. Create a file catalog_hdfs.xml in which specify each property with the following structure.

catalog_hdfs.xml
<?xml version="1.0"?>
<configuration>
<property>
<name></name>
<value></value>
</property>
</configuration>




OSCH supports input formats delimited text file, data pump file and Hive table and for each input format some configuration properties are required and some are optional. For the delimited text file input the following configuration properties are required.

PropertyDescriptionValue used
oracle.hadoop.exttab.tableNameSchema qualified name of the external table in which the schema defaults to the user name if omitted.hdfs.CATALOG_EXT
oracle.hadoop.exttab.defaultDirectoryThe Oracle Database directory for the external table.OSCH_EXTTAB_DIR created earlier.
oracle.hadoop.exttab.dataPathsA comma separated list of HDFS paths.hdfs://nn01/catalog
oracle.hadoop.exttab.columnCount or oracle.hadoop.exttab.columnNamescolumnCount is the number of columns in the external table. columnNames is a comma separated list of column names and defaults to C1, C2,…Cn.
CATALOGID,
CATALOGNAME
oracle.hadoop.connection.urlThe connection URL for Oracle Database.
jdbc:oracle:thin:
@en01:1521:orcl
oracle.hadoop.connection.userThe Oracle Database user.HDFSUSER

The data type mapping for the Oracle Database columns may be set with the oracle.hadoop.exttab.colMap.columnType property, which applies to all columns and defaults toVARCHAR2The data type mapping may be set individually with the oracle.hadoop.exttab.colMap.column_name.columnType property as we have. The source type is set with the oracle.hadoop.exttab.sourceType property, which may be set to text, hive or datapumpThe default for oracle.hadoop.exttab.sourceType is text. The number of location files for the external table is set with the oracle.hadoop.exttab.locationFileCount and defaults to 4The specified value is not always used as the lesser of the number of data files and the specified value for the oracle.hadoop.exttab.locationFileCount is used. For a delimited text file input the field terminator may be specified with the oracle.hadoop.exttab.fieldTerminator property and defaults to ‘,’. The configuration file catalog_hdfs.xml is listed below; copy the file to the /osch directory.


catalog_hdfs.xml

<?xml version="1.0"?>
<configuration>
<property>
<name>oracle.hadoop.exttab.tableName</name>
<value>HDFSUSER.CATALOG_EXT</value>
</property>
<property>
<name>oracle.hadoop.exttab.locationFileCount</name>
<value>3</value>
</property>
<property>
<name>oracle.hadoop.exttab.dataPaths</name>
<value>hdfs://nn01:9000/catalog</value>
</property>
<property>
<name>oracle.hadoop.exttab.fieldTerminator</name>
<value>\u002C</value>
</property>
<property>
<name>oracle.hadoop.exttab.defaultDirectory</name>
<value>OSCH_EXTTAB_DIR</value>
</property>
<property>
<name>oracle.hadoop.exttab.columnNames</name>
<value>CATALOGID,CATALOGNAME</value>
</property>
<property>
<name>oracle.hadoop.exttab.colMap.CATALOGID.columnType</name>
<value>NUMBER</value>
</property>
<property>
<name>oracle.hadoop.exttab.colMap.CATALOGNAME.columnType</name>
<value>VARCHAR2</value>
</property>
<property>
<name>oracle.hadoop.exttab.sourceType</name>
<value>text</value>
</property>
<property>
<name>oracle.hadoop.connection.url</name>
<value>jdbc:oracle:thin:@en01:1521:orcl</value>
</property>
<property>
<name>oracle.hadoop.connection.user</name>
<value>HDFSUSER</value>
</property>
</configuration>


9- Creating the External Table for the Delimited Text File in HDFS

Next, we shall run the ExternalTable tool to create an external table for the delimited text file catalog.txt in HDFS. The configuration file is specified on the command line with –conf. Optionally the oracle.hadoop.exttab.printStackTraceproperty may be specified to print the stack trace. To create an external table and add data from the HDFS file catalog.txt to the external table run the following command from the /osch directory.

hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml –createTable

The OSCH gets started and prompts for the password for the Oracle Database schema specified as the user in the configuration file. Unlike the Oracle Loader for Hadoop the password is specified on a command prompt and not in the configuration file. Specify the OE schema password and select Enter.



[hdpclient@en01 osch]$ pwd

/usr/hadoopsw/osch



[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -createTable

Oracle SQL Connector for HDFS Release 3.6.0 - Production

Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.

[Enter Database Password:]
The create table command succeeded.

User: "HDFSUSER" performed the following actions in schema: HDFSUSER

CREATE TABLE "HDFSUSER"."CATALOG_EXT"
(
 "CATALOGID"                      NUMBER,
 "CATALOGNAME"                    VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
(
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
   ACCESS PARAMETERS
   (
     RECORDS DELIMITED BY 0X'0A'
     CHARACTERSET AL32UTF8
     PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
     FIELDS TERMINATED BY 0X'2C'
     MISSING FIELD VALUES ARE NULL
     (
       "CATALOGID" CHAR,
       "CATALOGNAME" CHAR(4000)
     )
   )
   LOCATION
   (
     'osch-20170530021546-6194-1'
   )
) PARALLEL REJECT LIMIT UNLIMITED;

The following location files were created.

osch-20170530021546-6194-1 contains 1 URI, 52 bytes

          52 hdfs://nn01:9000/catalog/catalog.txt

The following location files were created.

osch-20170529035900-5230-1 contains 1 URI, 40 bytes

          40 hdfs://nn01:9000/catalog/catalog.txt

[hdpclient@en01 osch]$




[hdpclient@en01 osch]$ ll /data/ora_ext_tab_dir

total 4
-rw-r--r--. 1 oracle oinstall 596 May 30 14:15 osch-20170530021546-6194-1


10- Verify Data by location file and SQL Plus

[hdpclient@en01 osch]$ hdfs_stream /data/ora_ext_tab_dir/osch-20170530021546-6194-1

1,Item1
2,Item2
3,Item3
4,Item4
5,Item5


Run a DESC SQL statement in SQL Plus to list the description of the external table created.

SQL> conn hdfsuser/hdfsuser
Connected.

SQL> set lines 100
SQL> desc catalog_ext;
 Name                                                  Null?    Type
 ----------------------------------------------------- -------- ------------------------------------
 CATALOGID                                                      NUMBER
 CATALOGNAME                                                    VARCHAR2(4000)


Run an SQL query on the external table to list the HDFS data for which the OSCH generates a location file for an external table. The HDFS data on which the location file for the external table is based gets listed. If you encounter any error pay attention to error and eradicate first.

SQL> select * from catalog_ext;

select * from catalog_ext
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P000
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04004: error while reading file
/data/ora_ext_tab_dir/osch-20170529035900-5230-1


[hdpclient@en01 osch]$ ll /data/ora_ext_tab_dir/

total 8
-rw-r-----. 1 oracle oinstall 2787 May 29 16:05 CATALOG_EXT_7588.log
-rw-r--r--. 1 oracle oinstall 596 May 29 15:59 osch-20170529035900-5230-1



I checked the log file and found permission issue. Log file is location in the same folder designated as eternal table's directory object.

[hdpclient@en01 osch]$ cat /data/ora_ext_tab_dir/

CATALOG_EXT_7588.log        osch-20170530021546-6194-1


LOG file opened at 05/29/17 16:09:53


KUP-05007: Warning: Intra source concurrency disabled because the preprocessor option is being used.


Field Definitions for table CATALOG_EXT
Record format DELIMITED, delimited by 0A
Data in file has same endianness as the platform
Rows with all null fields are accepted


Fields in Data Source:


CATALOGID CHAR (255)
Terminated by "2C"
Trim whitespace same as SQL Loader
CATALOGNAME CHAR (4000)
Terminated by "2C"
Trim whitespace same as SQL Loader
KUP-04004: error while reading file /data/ora_ext_tab_dir/osch-20170529035900-5230-1
KUP-04017: OS message: Error 0
KUP-04017: OS message: /usr/hadoopsw/osch/orahdfs-3.6.0/bin/hdfs_stream: line 286: /usr/hadoopsw/osch/orahdfs-3.6.0/bin/../log/osch-log_16999-0529171496063393.log: Permission denied


KUP-04118: operation "pipe read", location "skudmir:2"



Pay attention to the permission and grant permission


[root@en01 ~]# chmod -R 777 /data
[root@en01 ~]# chmod -R 777 /usr/hadoopsw/osch


Try again

SQL> select * from catalog_ext;


CATALOGID
----------
CATALOGNAME
--------------------------------------------------------------------------------
1
Item1


2
Item2


3
Item3




CATALOGID
----------
CATALOGNAME
--------------------------------------------------------------------------------
4
Item4


5
Item5




SQL>



11- Perform some more tests


Remove the file on HDFS and try the above query again

[hdpsysuser@nn01 ~]$ hdfs dfs -rm /catalog/*

17/05/29 16:18:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 45 minutes.
Moved: 'hdfs://nn01:9000/catalog/catalog.txt' to trash at: hdfs://nn01:9000/user/hdpsysuser/.Trash/Current


SQL> select * from catalog_ext;

select * from catalog_ext
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P000
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04004: error while reading file /data/ora_ext_tab_dir/osch-20170530021546-6194-1



Add file again to HDFS and try the above query again, it will succeed.

[hdpsysuser@nn01 ~]$ hdfs dfs -put /data/mydata/catalog.txt /catalog

SQL> select * from catalog_ext;

 CATALOGID
----------
CATALOGNAME
----------------------------------------------------------------------------------------------------
         1
Item1

         2
Item2




Now create a second table in Oracle and join with the external table

create table child(id number, cat_detail varchar2(20))

insert into child values (1,'Item-1 Detail 1');
insert into child values (1,'Item-1 Detail 2');
insert into child values (2,'Item-2 Detail 1');
insert into child values (4,'Item-1 Detail 2');


commit;


select ce.catalogid,catalogname,c.cat_detail
from catalog_ext ce,child c
where ce.CATALOGID = c.ID










select catalogname,count(*) cnt from hdfsuser.catalog_ext group by catalogname having count(*) > 0 order by CATALOGNAME desc

12- Describing Tables created by OSCH

To describe the table/s created by OSCH run the –describe command using the ExternalTable tool. The output from the command is listed:

[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -describe

Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.

[Enter Database Password:]

The described object is "HDFSUSER"."CATALOG_EXT"

CREATE TABLE "HDFSUSER"."CATALOG_EXT"
( "CATALOGID" NUMBER,

"CATALOGNAME" VARCHAR2(4000)

)

ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"CATALOGID" CHAR,
"CATALOGNAME" CHAR(4000)
)
)
LOCATION
( 'osch-20170529035900-5230-1'
)
)
REJECT LIMIT UNLIMITED
PARALLEL
Listing Location Files for external table: [CATALOG_EXT]
osch-20170529035900-5230-1 contains 1 URI, 40 bytes
40 hdfs://nn01:9000/catalog/catalog.txt


13- Listing the Location Files of the External Table

To list the location files for the external table HDFSUSER.CATALOG_EXT run the –listLocations command with the ExternalTabletool.

[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -listLocations

Oracle SQL Connector for HDFS Release 3.6.0 - Production

Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.

[Enter Database Password:]

Listing Location Files for external table: [CATALOG_EXT]

osch-20170529035900-5230-1 contains 1 URI, 40 bytes


40 hdfs://nn01:9000/catalog/catalog.txt

14- Getting the DDL for the External Table

The DDL for the external table created may be output with the –getDDL command.


[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -getDDL


Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
CREATE TABLE "HDFSUSER"."CATALOG_EXT"
( "CATALOGID" NUMBER,
"CATALOGNAME" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"CATALOGID" CHAR,
"CATALOGNAME" CHAR(4000)
)
)
LOCATION
( 'osch-20170529035900-5230-1'
)
)
REJECT LIMIT UNLIMITED
PARALLEL

15- Adding the HDFS Data to the External Table Location files Separately


The –createTable command adds the HDFS data to the location file of the external table. But, it may be required to add the HDFS data separately. For example, the data in the HDFS file has been modified and the new data is required to be added to the location file, or the external table has been created separately and the OSCH is being used only to add the data from HDFS. The –publish command is used to add the HDFS data to the external table location file. If a location file already exists the location file gets deleted and a new location file gets created. Run the –publish command as follows.

Add another text file but with same fields , external table does not read the new file contents until you publish
vi catalog1.txt

6,Item6
7,Item7


[hdpsysuser@nn01 ~]$ hdfs dfs -put /data/mydata/catalog1.txt /catalog


[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -publish


Oracle SQL Connector for HDFS Release 3.6.0 - Production

Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.

[Enter Database Password:]
The publish command succeeded.

ALTER TABLE "HDFSUSER"."CATALOG_EXT"
LOCATION
(
'osch-20170529045207-379-1',
'osch-20170529045207-379-2'
);

The following location files were created.
osch-20170529045207-379-1 contains 1 URI, 52 bytes
52 hdfs://nn01:9000/catalog/catalog.txt
osch-20170529045207-379-2 contains 1 URI, 20 bytes
20 hdfs://nn01:9000/catalog/catalog1.txt
The following location files were deleted.
osch-20170529035900-5230-1 was deleted.

The output from the –publish command indicates that the OE.CATALOG_EXT external table is altered and a new location file gets created with the previous location file being deleted.



16- Find the Location FIles/Log Files:

The location files and the log files are created in the OS directory for directory object for the external table, which is the ora_ext_tab_dir directory created earlier.


[hdpclient@en01 osch]$ ll /data/ora_ext_tab_dir/

total 28
-rwxrwxrwx. 1 oracle oinstall 11327 May 29 16:53 CATALOG_EXT_7588.log
-rw-r-----. 1 oracle oinstall 2 May 29 16:53 CATALOG_EXT_7592.bad
-rw-r-----. 1 oracle oinstall 705 May 29 16:53 CATALOG_EXT_7592.log
-rw-r--r--. 1 oracle oinstall 595 May 29 16:52 osch-20170529045207-379-1
-rw-r--r--. 1 oracle oinstall 596 May 29 16:52 osch-20170529045207-379-2




17- Enabling Parallel Processing
To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files, also known as the degree of parallelism, determines the number of child processes started by the external table during a table read. Ideally, the degree of parallelism is no larger than the number of data files, to avoid idle child processes.


Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;
Hints such as APPEND and PQ_DISTRIBUTE also improve performance when you are 
inserting data.




No comments: