Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, April 19, 2018

Working with Elasticsearch


Introduction

Elasticsearch is a  distributed, scalable, real-time search and analytics engine built on top of Apache Lucene™,. Lucene ( a library) is arguably the most advanced, high-performance, and fully featured search engine library in existence today—both open source and proprietary. It enables you to search, analyze, and explore your data whether you need full-text search, real-time analytics of structured data, or a combination of the two. 

Tuesday, April 17, 2018

Configure Rsyslog with Any Log File



Modern linux distros ship with Rsyslog which has some nice additional functionality (imfile module) that provides the ability to convert any standard text file into a Syslog message.

Tuesday, March 20, 2018

HDFS Centralized Cache Management



Due to increasing memory capacity, many interesting working sets are able to fit in aggregate cluster memory. By using HDFS centralized cache management, applications can take advantage of the performance benefits of in-memory computation. Cluster cache state is aggregated and controlled by the NameNode, allowing applications schedulers to place their tasks for cache locality. 

Configuring ACLs on HDFS


ACLs extend the HDFS permission model to support more granular file access based on arbitrary combinations of users and groups. We will discuss how to use Access Control Lists (ACLs) on the Hadoop Distributed File System (HDFS).


Tuesday, March 13, 2018

Spooling Files to HBase using Flume


Scenario:

One of my team wants to upload the contents of file existing in a specific directory (spooling dir) to HBase for some analysis. For the purpose we will be using Flume's spooldir-source which will allow users and applications to place files in spooling dir and process each line as one event to put it in HBase. It is assumed that Hadoop cluster and HBase is running, our environment is on HDP 2.6.

Tuesday, February 06, 2018

Integrating Hadoop Cluster with Microsoft Azure Blob Storage

Introduction

Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. All access to Azure Storage is done through a storage account. 

Monday, January 29, 2018

Install SQL Server 2017 on Linux [RHEL]


SQL Server 2017 now runs on Linux. It’s the same SQL Server database engine, with many similar features and services regardless of your operating system.

I'm providing below the straight away installation for it.

Thursday, January 25, 2018

Using Avro with Hive and Presto



Pre-requsites


The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde's bullet points:

Working with Apache Avro to manage Big Data Files


What is Avro?


Apache Avro is a language-neutral data serialization system and is a preferred tool to serialize data in Hadoop. Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persistent storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again.

Custom Merge Utility for Flume Generated Files


Problem:

My client is streaming tweets to HDFS location, thousands of flume files are being created on this location. Hive External Table has been created on this location, external table performance is degraded when there are too may small files.