Using Zeppelin Notebook for Spark

Zeppelin is a web-based notebook for interactive programming and data visualization in browser. It supports many programming languages via Zeppelin interpreters such as scala, python, R, SQL and Bash.

Start a Zeppelin node

First, launch a spark cluster as described previously here. Then, click on the “Scale Cluster” button.

Add a zeppelin node group and click Scale.

After the cluster has finished scaling, click on the zeppelin URL to bring up the Zeppelin front page.

Create a notebook

Let’s create a new notebook.

Enter the note name, e.g. Bank and click Create.

Load sample data

In the first paragraph, copy and paste the following line. Click on the play button and wait until the paragraph has finished executing. This paragraph executes commands in Bash shell (as denoted by %sh interpreter in the first line).

%sh
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
unzip bank.zip
hdfs dfs -put bank-full.csv /user/centos

The sample data looks like:

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"

This sample data is from UCI bank dataset.

Read data and create a table in SparkSQL

In the next paragraph, remove %sh in the first line in order to use the default spark interpreter.

Copy the following code block and play this paragraph. The program will read the CSV file and create a “bank” table. It may take seconds to finish due to the overhead of creating a new Spark context. This example is adapted from Zeppelin Tutorial.

val bankText = sc.textFile("/user/centos/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
    s=>Bank(s(0).toInt,
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")

Query and create charts

Now, run the following SQL query in the next paragraph.

%sql
select age, count(1) from bank group by age order by age

The query result is shown in table by default.

Click on the Area Chart button to change the display of the result.

In the next paragraph, create a pie chart with the following query.

select marital, count(1) from bank group by marital

Then a bar chart with query:

select job, count(1) from bank group by job order by count(1) desc

We can adjust size and layout of the paragraphs as follows.

Scale down Zeppelin node group if it is no longer used

To shutdown the Zeppelin node, please scale down the Zeppelin node group to 0. The KitWai backend service will disconnect it from the cluster cleanly. Attempting to terminate the instance directly may cause undesirable results.