Requirement

First you must have R and java installed. This is a bit out the scope of this note, but you can find lots of good tutorials that cover this. If you have problems please let us know (you can find email address below).

Next you have to see if you can install R interface to java. this can be done by calling

install.packages("rJava")

Depending or your Operating System this can produce an error, and some extra action would be required. Please google something like rJava install error ubuntu etc. Again please let us know if you get stuck.

You can also install few packages we may use in presentation. Simply call:

install.packages(c("rmarkdown", "ggplot2", "pipeR", "whisker", "data.table", "reshape2"))

Downloading Spark

The easiest way to downloading spark is getting pre-bulid version from [http://spark.apache.org/downloads.html].

SparkR test drive

The way we use SparkR here is far from being en example of best practice. Not only because it does not run on more than one computer, but also because we isolate the SparkR package from other packages by hardcoding library path. This should help you set up you Spark(R) fast for test drive.

Lets say you have downloaded it to the folder

/home/bartek/programs/spark-1.5.2-bin-hadoop2.6

So every time you see this path please change it to the one you have (windows users have to probably change also slashes / to backslashes \)

Then try to run the following code:

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
library(SparkR, lib.loc = "/home/bartek/programs/spark-1.5.2-bin-hadoop2.6/R/lib/")
sc <- sparkR.init(master = "local", sparkHome = "/home/bartek/programs/spark-1.5.2-bin-hadoop2.6")
## Launching java with spark-submit command /home/bartek/programs/spark-1.5.2-bin-hadoop2.6/bin/spark-submit   "--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell" /tmp/RtmpotcRqA/backend_portc33355e4884
sqlContext <- sparkRSQL.init(sc)

library('pipeR')
library('ggplot2')

df <- createDataFrame(sqlContext, mtcars)
model <- glm(mpg ~ wt, data = df, family = "gaussian")
summary(model)
## $coefficients
##              Estimate
## (Intercept) 37.285126
## wt          -5.344472
predictions <- predict(model, newData = df)
class(predictions)
## [1] "DataFrame"
## attr(,"package")
## [1] "SparkR"
predictions %>>%
  select("wt", "mpg", "prediction") %>>%
  collect %>>%
  ggplot() + geom_point(aes(wt, prediction - mpg))  +
  geom_hline(xintercept=0) + theme_bw()

sparkR.stop()

Contact

bartekskorulski@gmail.com