Using RStudio Server with SparkR
This section shows how to use RStudio Server with SparkR on Spark cluster.
Start a RStudio server node
First, launch a spark cluster as described previously here. Then, click on the “Scale Cluster” button.
Add a rstudio node group and click Scale.
After the cluster has finished scaling, click on the RStudio URL to bring up the RStudio Web UI.
Sign in and change the default password
Sign in with the default username and password as centos to enter the RStudio IDE.
After logged in, change the password by clicking on Tools menu and select Shell.
In the terminal tab, type passwd to change the password.
Create a Spark session
Back to the R console tab, type
sparkR.session()
to create a Spark session.
Create a DataFrame and query it using SQL
We try creating a DataFrame from R’s faithful dataset and display the first lines of the dataset using SparkR::head(). I
df <- as.DataFrame(faithful)
SparkR::head(df)
We are expected to see the following result.
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
We can read a CSV file into DataFrame and use SQL commands to query it. Let’s download a sample bank dataset from UCI bank dataset. In the terminal tab, type the following shell commands.
[centos@test-rstudio-0 ~]$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
[centos@test-rstudio-0 ~]$ unzip bank.zip
[centos@test-rstudio-0 ~]$ hdfs dfs -put bank-full.csv /user/centos
In R console, read bank-full.csv file into bank DataFrame.
bankDF <- read.df("bank-full.csv", source = "csv", header = TRUE, delimiter = ";")
Create a temporary table in SparkSQL and execute an SQL command.
createOrReplaceTempView(bankDF, "bank")
resultDF <- sql("SELECT age, count(1) FROM bank GROUP BY age ORDER BY age")
showDF(resultDF)
+---+--------+
|age|count(1)|
+---+--------+
| 18| 12|
| 19| 35|
| 20| 50|
| 21| 79|
| 22| 129|
| 23| 202|
| 24| 302|
| 25| 527|
| 26| 805|
| 27| 909|
| 28| 1038|
| 29| 1185|
| 30| 1757|
| 31| 1996|
| 32| 2085|
| 33| 1972|
| 34| 1930|
| 35| 1894|
| 36| 1806|
| 37| 1696|
+---+--------+
only showing top 20 rows
Currently, SparkR is not the mainstream development of Spark APIs compared to Scala, Java and Python. For other operations of SparkR, please see the official documentation.
Install additional R libraries
For operations that require additional R packages to process DataFrame on Spark worker nodes, the packages must exist on all worker nodes. So, users are required to manually install the additional packages by the standard installation procedure of R packages on each node.
Scale down RStudio node group if it is no longer used
To shutdown the RStudio node, please scale down the RStudio node group to 0. The KitWai backend service will disconnect it from the cluster cleanly. Attempting to terminate the instance directly may cause undesirable results.