Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Tuesday, February 06, 2018

Integrating Hadoop Cluster with Microsoft Azure Blob Storage

Introduction

Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. All access to Azure Storage is done through a storage account. 
Container: A set of blobs can be grouped as container. All blobs must be in a container. An account can contain an unlimited number of containers. A container can store an unlimited number of blobs. Note that the container name must be lowercase.

Blob is a file of any type and size. Azure Storage offers three types of blobs: block blobs, append blobs, and page blobs.

Block blobs are ideal for storing text or binary files, such as documents and media files. A single block blob can contain up to 50,000 blocks of up to 100 MB each, for a total size of slightly more than 4.75 TB (100 MB X 50,000).

Append blobs are similar to block blobs in that they are made up of blocks, but they are optimized for append operations, so they are useful for logging scenarios. A single append blob can contain up to 50,000 blocks of up to 4 MB each, for a total size of slightly more than 195 GB (4 MB X 50,000).

Page blobs can be up to 1 TB in size, and are more efficient for frequent read/write operations. Azure Virtual Machines use page blobs as operating system and data disks.
Common uses of Blob storage include:
  • Serving images or documents directly to a browser.
  • Storing files for distributed access.
  • Streaming video and audio.
  • Storing data for backup and restore, disaster recovery, and archiving.
  • Storing data for analysis by an on-premises or Azure-hosted service.



Problem:

Our HDP is deployed in our own data center and we plan to access cloud storage via the connectors available for Amazon Web Services (Amazon S3) and Microsoft Azure (ADLS, WASB).

In order to start working with data stored in a cloud storage service, you must configure
authentication with the service. Once you have configured authentication with the chosen cloud storage service, you can start working with the data. The cloud connectors allow you to seamlessly access and work with data stored in Amazon S3, Azure ADLS and Azure WASB storage services, including, but not limited to, the following use cases:

  • Collect data for analysis and then load it into Hadoop ecosystem applications such as Hive or Spark directly from cloud storage services.
  • Persist data to cloud storage services for use outside of HDP clusters.
  • Copy data stored in cloud storage services to HDFS for analysis and then copy back to the cloud when done.
  • Share data between multiple HDP clusters – and between various external non-HDP systems – by pointing at the same data sets in the cloud object stores.

The S3, ADLS and WASB connectors are implemented as individual Hadoop modules. The libraries and their dependencies are automatically placed on the classpath.






Getting Started with WASB
After setting up your Azure account you can start with WASB. WASB is automatically enabled in HDInsight clusters. But you can also mount a blob storage account manually to a Hadoop Administration instance that lives anywhere as long as it has Internet access to the blob storage. Here are the steps:

Step 1. Verify your Hadoop version This needs to be 2.7.1 or later.

[hdpclient@en01 ~]$ hadoop version

Hadoop 2.7.3

Step 2. Verify that you have below 2 jar files in your hadoop installation.

[hdpclient@en01 ~]$ ls /usr/hadoopsw/hadoop-2.7.3/share/hadoop/tools/lib/*azure*.jar
/usr/hadoopsw/hadoop-2.7.3/share/hadoop/tools/lib/azure-storage-2.0.0.jar
/usr/hadoopsw/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-azure-2.7.3.jar

Step 3. Modify hadoop-env.sh Modify $HADOOP_HOME/etc/hadoop/hadoop-env.sh file to add these 2 jar files to Hadoop classpath at the end of the file.
[hdpclient@en01 ~]$ echo $HADOOP_CLASSPATH
/usr/hadoopsw/hadoop-2.7.3/lib/*:/usr/hadoopsw/osch/orahdfs-3.6.0/jlib/*:/usr/hadoopsw/apache-hive-2.1.1-bin/lib/*:/usr/hadoopsw/apache-hive-2.1.1-bin/conf

[hdpclient@en01 ~]$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/hadoopsw/hadoop-2.7.3/share/hadoop/tools/lib/azure-storage-2.0.0.jar:/usr/hadoopsw/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-azure-2.7.3.jar


Step 4. Modify core-site.xml 

Modify $HADOOP_HOME/etc/hadoop/core-site.xml file to add these name-value pairs in the configuration to point to the blob storage.

<property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>

<property>

<name>fs.azure.account.key.my_blob_account_name.blob.core.windows.net</name>
<value>my_blob_account_key</value>
</property>

<!-- optionally set the default file system to a container -->

<property>
<name>fs.defaultFS</name>
<value>wasb://my_container_name@my_blob_account_name.blob.core.windows.net</value>
</property>

<property>
<name>fs.azure.account.key.opsbdata.blob.core.windows.net</name>
<value>n/BJAFPQrvi2CatrPsNFn2y2Z5zBg0pfyivamJnY78vcctIotqLGGoAtk7ulRZlVdVl99yOJV4t6dIAXAwlSDg==</value>
</property>

<property>         
        <name>fs.AbstractFileSystem.wasb.Impl</name>                
        <value>org.apache.hadoop.fs.azure.Wasb</value> 
</property>


If you don't want to expose your storage account key in core-site.xml. You can also encrypt it.


Step 5. Test the connection.
[[hdpclient@en01 ~]$ hadoop fs -ls wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/


18/02/05 14:26:43 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
18/02/05 14:26:43 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
18/02/05 14:26:43 INFO impl.MetricsSystemImpl: azure-file-system metrics system
started
Found 1 items
-rwxrwxrwx 1 673 2018-02-05 12:52 wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/pgemp.csv

18/02/05 14:26:44 INFO impl.MetricsSystemImpl: Stopping azure-file-system metrics system...
18/02/05 14:26:44 INFO impl.MetricsSystemImpl: azure-file-system metrics system stopped.
18/02/05 14:26:44 INFO impl.MetricsSystemImpl: azure-file-system metrics system shutdown complete.

On HDP , add the above two properties (step 4) in custom core-site.xml




Test few other commands

[hdfs@nn01 ~]$ hadoop fs -mkdir wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir 

[hdfs@nn01 ~]$ hadoop fs -ls wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/


[hdfs@nn01 ~]$ hadoop fs -put /data/mydata/catalog.txt wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir


[hdfs@nn01 ~]$ hadoop fs -cat wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir




[hdfs@nn01 ~]$ hadoop fs -cat wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir/catalog.txt


18/02/05 15:03:35 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
18/02/05 15:03:35 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
18/02/05 15:03:35 INFO impl.MetricsSystemImpl: azure-file-system metrics system started
18/02/05 15:03:37 ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for read on Blob : testdir/catalog.txt Exception details: java.io.IOException Error Code : AuthenticationFailed

[root@nn01 ~]# hdfs dfs -chmod 777 wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir/catalog1.txt

[root@nn01 ~]# hadoop fs -chown hdfs wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir


Copying Data Between Hadoop Cluster and MS Azure blob storage

The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems within and across clusters.


hadoop distcp hdfs://nn01:8020/data/oraclenfs/emp.csv wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir

OR

[hdfs@nn01 ~]$ hadoop distcp wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir/emp.csv hdfs://nn01:8020/tmp

Creating HIVE Table on Azure BLOB

DROP table scott.ITEMS;


CREATE EXTERNAL TABLE scott.ITEMS(
ITEM_ID int,
ITEM_NAME string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 'wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir/'


CREATE EXTERNAL TABLE scott.ITEMS(
ITEM_ID int,
ITEM_NAME string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE
LOCATION 'wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir/'

[hdfs@nn01 ~]$  beeline -u jdbc:hive2://dn04:10000/flume -n hive -p hive
Connecting to jdbc:hive2://dn04:10000/flume
Connected to: Apache Hive (version 1.2.1000.2.6.1.0-129)
Driver: Hive JDBC (version 1.2.1000.2.6.1.0-129)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.6.1.0-129 by Apache Hive
0: jdbc:hive2://dn04:10000/flume> CREATE EXTERNAL TABLE scott.ITEMS(
0: jdbc:hive2://dn04:10000/flume> ITEM_ID int,
0: jdbc:hive2://dn04:10000/flume> ITEM_NAME string
0: jdbc:hive2://dn04:10000/flume> )
0: jdbc:hive2://dn04:10000/flume> STORED AS TEXTFILE
0: jdbc:hive2://dn04:10000/flume> LOCATION 'wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir/';
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException: Permission denied: user=hive, path="wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir":hdfs:supergroup:drw-r--r--) (state=08S01,code=1)
0: jdbc:hive2://dn04:10000/flume>

If you get the permission issue, resolve it first and then create hive table.

[root@nn01 ~]# su - hdfs
[hdfs@nn01 ~]$ hdfs dfs -chmod -R 777 wasb://opsbdatacontainer@opsbdata.blob.core.windows.net/testdir


0: jdbc:hive2://dn04:10000/flume> select * from scott.ITEMS;
+----------------+------------------+--+
| items.item_id  | items.item_name  |
+----------------+------------------+--+
| 1              | Item1            |
| 2              | Item2            |
| 3              | Item3            |
| 4              | Item4            |
| 5              | Item5            |
| 6              | Item8            |
+----------------+------------------+--+
6 rows selected (5.055 seconds)