SparkSQL and JDBC Connection
SQL command is one of the most widely-used languages to query data from databases. SparkSQL allows user to use SQL commands to process data in Spark as if the data were stored in a database. SparkSQL will transform any SQL commands into Spark’s RDD operations. With Spark’s Thrift server, any JDBC compatible clients can submit SQL queries to SparkSQL.
In this section, we create a Spark cluster with Thrift server. Then, show how to use a JDBC client, beeline, to manipulate and query data.
Launch a Spark cluster with Thrift server
Follow this guide to launch a Spark cluster.
Then, click on Scale to add a superset node which has Thrift installed.
This node also has Apache Superset, a data visualization tool which can connect to Thrift server via JDBC connection. If interested, see an example of Thailand Government Spending for a guideline to use SparkSQL with Superset.
Note the IP of the superset node which we will use it to login to the node in the next step.
Login to the node and download sample data
Login to the superset node by using the IP address of the node.
$ ssh centos@IP_OF_SUPERSET
[centos@test-superset-0 ~]$
Run the following commands to download sample data. The sample data is from UCI bank dataset.
[centos@test-superset-0 ~]$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
[centos@test-superset-0 ~]$ unzip bank.zip
[centos@test-superset-0 ~]$ hdfs dfs -put bank-full.csv /user/centos
The sample data looks like:
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
Submit queries with beeline
Connect beeline to Thrift server. Suppose that the superset node is named test-superset-0. we set the connection string to
jdbc:hive2://localhost:10000
Set the username to centos. The password is ignored by default.
[centos@test-superset-0 ~]$ beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: centos
Enter password for jdbc:hive2://localhost:10000:
18/08/01 16:48:08 INFO Utils: Supplied authorities: localhost:10000
18/08/01 16:48:08 INFO Utils: Resolved authority: localhost:10000
18/08/01 16:48:08 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.3.1)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>
Then, type the following SQL commands in beeline.
CREATE DATABASE bank;
USE bank;
CREATE TABLE bank
USING csv
OPTIONS (header 'true', inferSchema 'true', delimiter '\073', mode 'FAILFAST')
LOCATION '/user/centos/bank-full.csv';
SELECT age, count(1) FROM bank GROUP BY age ORDER BY age;
+------+-----------+--+
| age | count(1) |
+------+-----------+--+
| 18 | 12 |
| 19 | 35 |
| 20 | 50 |
| 21 | 79 |
... skip ...
Note that beeline cannot parse ; (semicolon) in SQL command correctly. So we replace it with ascii code \073.
Type Ctrl-D to exit from beeline.