Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, January 08, 2018

Processing Twitter (JSON) Data in Oracle (12c External Table)


Problem:


  • We have live Twitter stream data ingested by Flume to our Hadoop cluster. 
  • Flume is generating too many files in HDFS, 2 files in 1 second about 172k files in a day.
  • We have to process the Flume generated twitter JSON files.
  • Created an Oracle external table over twitter JSON files but performance is too bad because of too many files.
  • Need a remedy for the above issues.

Thursday, January 04, 2018

Using Preprocessor with External Table [Over HDFS]

Oracle 11g Release 2 introduced the PREPROCESSOR clause to identify a directory object and script used to process the files before they are read by the external table. This feature was backported to 11gR1 (11.1.0.7). The PREPROCESSOR clause is especially useful for reading compressed files, since they are unzipped and piped straight into the external table process without ever having to be unzipped on the file system.

Partitioning Oracle (12c) External Table [Over HDFS]


Partitioned external tables were introduced in Oracle Database 12c Release 2 (12.2), allowing external tables to benefit from partition pruning and partition-wise joins. With the exception of hash partitioning, many partitioning and subpartitioning strategies are supported with some restrictions. In this post I've created a test to get better performance of external table over HDFS.

Optimizing NFS Performance [HDP NFS]


Introduction

You may experience poor performance when using NFS. Careful analysis of your environment, both from the client and from the server point of view, is the first step necessary for optimal NFS performance. Aside from the general network configuration - appropriate network capacity, faster NICs, full duplex settings in order to reduce collisions, agreement in network speed among the switches and hubs, etc. - one of the most important client optimization settings are the NFS data transfer buffer sizes, specified by the mount command options rsize and wsize.

Monday, December 25, 2017

Offload Oracle Data to HDFS using RO Tablespaces


Purpose:

Offloading Oracle Data to HDFS

Prerequisites:

Hortonworks NFS Gateway is running Please, Please visit the post below

Creating Oracle External Table (12c) on HDFS using HDP NFS Gateway

Purpose:

Offloading Oracle Data to HDFS

Prerequisites:

Hortonworks NFS Gateway is running 

Configuring NFS Gateway for HDFS [HDP]



The NFS Gateway for HDFS allows clients to mount HDFS and interact with it through NFS, as if it were part of their local file system. The gateway supports NFSv3.
After mounting HDFS, a user can:

Tuesday, December 19, 2017

Configure Hortonworks Hive ODBC Driver for Oracle HS


The Hortonworks Hive ODBC Driver is used for direct SQL and HiveQL access to Apache
Hadoop / Hive distributions, enabling Business Intelligence (BI), analytics, and reporting on
Hadoop / Hive-based data. The driver efficiently transforms an application’s SQL query into
the equivalent form in HiveQL.


Monday, December 18, 2017

Query Teradata Presto from Oracle using ODBC Heterogeneous Gateway [RHEL 7]


Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. In Italian, “Presto” means fast. In the tech world, it means an open-source distributed SQL query engine for Apache™ Hadoop® that runs interactive analytic queries against data sources of all sizes. Through a single query, data is accessed where it resides. Typically, this means data in a Hadoop Distributed File System (HDFS). However, unlike other SQL on Hadoop engines, Presto can query data sources such as Apache Cassandra™, relational databases or even proprietary data stores. 

Wednesday, November 08, 2017

Using HDP Zeppelin



Apache Zeppelin is a web-based notebook that enables interactive data analytics. With Zeppelin, you can make beautiful data-driven, interactive and collaborative documents with a rich set of pre-built language backends (or interpreters, An interpreter is a plugin that enables you to access processing engines and data sources from the Zeppelin UI.) such as Scala (with Apache Spark), Python (with Apache Spark), SparkSQL, Hive, Markdown, Angular, and Shell.