Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Tuesday, August 01, 2017

Hortonworks - Using HDP Spark SQL



Using SQLContext, Apache Spark SQL can read data directly from the file system. This is useful when the data you are trying to analyze does not reside in Apache Hive (for example, JSON files stored in HDFS).

Using HiveContext, Spark SQL can also read data by interacting with the Hive MetaStore. If you already use Hive, you should use HiveContext; it supports all Hive data formats and user-defined functions (UDFs), and it enables you to have full access to the HiveQL parser. HiveContext extends SQLContext, so HiveContext supports all SQLContext functionality.



[root@dn04 ~]# su spark
[spark@dn04 root]$ spark-shell
scala> import org.apache.spark.sql._
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val dfcsv = sqlContext.read.csv("/data/common/emp.csv")
scala> dfcsv.show(2)

+----+-----+--------+----+---------+----+----+---+
| _c0|  _c1|     _c2| _c3|      _c4| _c5| _c6|_c7|
+----+-----+--------+----+---------+----+----+---+
|7369|SMITH|   CLERK|7902|17-DEC-80| 800|null| 20|
|7499|ALLEN|SALESMAN|7698|20-FEB-81|1600| 300| 30|
+----+-----+--------+----+---------+----+----+---+
only showing top 2 rows

Reading JSON file in HDFS

val df = spark.read.json("/flume/twitter/FlumeData.1501481206851") 
scala> df.show(2)
scala> df.select("text").show(2)
scala> df.select($"text",$"id").show(2)
scala> df.select().count()


Query Hive Table
scala> sql("SELECT * FROM scott.dept").show()
+------+----------+--------+
|deptno|     dname|     loc|
+------+----------+--------+
|    10|ACCOUNTING|NEW YORK|
|    20|  RESEARCH|  DALLAS|
|    30|     SALES| CHICAGO|
|    40|OPERATIONS|  BOSTON|
+------+----------+--------+



Connect to Spark with beeline
By default Hotronworks Spark2 Thrift Server runs on 10016 port, just connect using it.

[spark@dn04 current]$ beeline
Beeline version 1.2.1000.2.6.1.0-129 by Apache Hive
beeline> !connect jdbc:hive2://dn04:10016/scott
Connecting to jdbc:hive2://dn04:10016/scott
Enter username for jdbc:hive2://dn04:10016/scott:
Enter password for jdbc:hive2://dn04:10016/scott:
Connected to: Spark SQL (version 2.1.1.2.6.1.0-129)
Driver: Hive JDBC (version 1.2.1000.2.6.1.0-129)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://dn04:10016/scott> show tables;
+-----------+------------+--------------+--+
| database  | tableName  | isTemporary  |
+-----------+------------+--------------+--+
| scott     | dept       | false        |
| scott     | dual       | false        |
| scott     | emp        | false        |
+-----------+------------+--------------+--+
3 rows selected (0.121 seconds)



Connect to Spark SQL via JDBC/ODBC







Related Posts:

No comments: