First, you must have R and java installed. This is a bit out the scope of this note, but Let me cover few things.
On Ubuntu:
sudo add-apt-repository ppa:webupd8team/jav
sudo apt update
sudo apt install oracle-java8-installer
You can manage java version by calling
sudo update-alternatives --config java
and editing:
sudo nano /etc/environment
by adding line
JAVA_HOME="/usr/lib/jvm/java-8-oracle"
Then test it by calling
source /etc/environment
echo $JAVA_HOME
java -version
https://launchpad.net/~marutter/+archive/ubuntu/c2d4u
sudo add-apt-repository ppa:marutter/c2d4u
sudo apt update
sudo apt install r-base r-base-dev
sudo apt-get install gdebi-core
wget https://download1.rstudio.org/rstudio-xenial-1.1.453-amd64.deb
sudo gdebi rstudio-xenial-1.1.453-amd64.deb
rm rstudio-xenial-1.1.453-amd64.deb
Next, you have to see if you can install R interface to java. This can be done by calling
sudo R CMD javareconf
Now you can install in R it:
install.packages("rJava")
Depending or your Operating System this can produce an error, and some extra action would be required. Please google something like rJava install error ubuntu
etc.
Test it in R by calling, for example:
library(rJava)
.jinit()
## [1] 0
Double <- J("java.lang.Double")
d <- new(Double, "10.2")
d
## [1] "Java-Object{10.2}"
https://github.com/snowflakedb/dplyr-snowflakedb/wiki/Configuring-R-rJava-RJDBC-on-Mac-OS-X
You can also install few packages we may use in presentation. Simply call:
install.packages(c("rmarkdown", "ggplot2", "magrittr", "whisker", "data.table"))
rmarkdown
: package that makes possible to create this documentggplot2
: popular graphing package,magrittr
: nicer functions chainingwhisker
: it’s Movemberdata.table
, reshape2
: an alternative for “data.frame”The easiest way to downloading spark is getting pre-bulid version from [http://spark.apache.org/downloads.html].
The way we use SparkR here is far from being en example of best practice. Not only because it does not run on more than one computer, but also because we isolate the SparkR package from other packages by hardcoding library path. This should help you set up you Spark(R) fast for test drive.
Lets say you have downloaded and uncompress it to the folder
/home/bartek/programs/spark-2.3.0-bin-hadoop2.7
So every time you see this path please change it to the one you have (windows users have to probably change also slashes /
to backslashes \
and add something like /C/
)
Then try to run the following code in R:
spark_path <- '/Users/bartek/programs/spark-2.3.0-bin-hadoop2.7'
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = spark_path)
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
## Java ref type org.apache.spark.sql.SparkSession id 1
library('magrittr')
library('ggplot2')
df <- as.DataFrame(mtcars)
model <- glm(mpg ~ wt, data = df, family = "gaussian")
summary(model)
##
## Deviance Residuals:
## (Note: These are approximate quantiles with relative error <= 0.01)
## Min 1Q Median 3Q Max
## -4.5432 -2.6110 -0.2001 1.2973 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 0.000e+00
## wt -5.3445 0.5591 -9.559 1.294e-10
##
## (Dispersion parameter for gaussian family taken to be 9.277398)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 278.32 on 30 degrees of freedom
## AIC: 166
##
## Number of Fisher Scoring iterations: 1
predictions <- predict(model, newData = df)
class(predictions)
## [1] "SparkDataFrame"
## attr(,"package")
## [1] "SparkR"
predictions %>%
select("wt", "mpg", "prediction") %>%
collect %>%
ggplot() + geom_point(aes(wt, prediction - mpg)) +
geom_hline(yintercept=0) + theme_bw()
One can also install sparklyr. First we install devtools.
install.packages("devtools")
devtools::install_github("hadley/devtools") ## for latest version
devtools::install_github("rstudio/sparklyr")
devtools::install_github("tidyverse/dplyr")
Check available versions:
library(sparklyr)
spark_available_versions()
## spark
## 1 1.6.3
## 2 1.6.2
## 3 1.6.1
## 4 1.6.0
## 5 2.0.0
## 6 2.0.1
## 7 2.0.2
## 8 2.1.0
## 9 2.1.1
## 10 2.2.0
## 11 2.2.1
## 12 2.3.0
spark_install(version = "2.3.0")
## Spark 2.3.0 for Hadoop 2.7 or later already installed.
library(sparklyr)
sc <- spark_connect(master = "local")
spark_path = sc$spark_home
spark_disconnect(sc)
Sys.setenv(
SPARK_HOME=spark_path
)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
## Java ref type org.apache.spark.sql.SparkSession id 1