R is a programming language for statistical computing. It is widely used among statisticians and data scientists. Running applications on a single machine has been sufficient for a long time, but it has become a limiting factor when more data and advanced analysis is required.
That’s why the R community has developed sparklyr to scale data engineering, data science and machine learning using Apache Spark. It supports the Apache Spark use cases: Batch, Streaming, ML and Graph, SQL, in addition to the well-known R packages: dplyr, DBI, broom. More information can be found on sparklyr.ai.
The problem is, the integration between Sparklyr and Apache Spark is brittle, it’s hard to get the right mix of libraries and environment setup. One of our customers tried to get this to work on EMR and described it as “a nightmare”. On the contrary, by building their own Docker images and running them on our Spark-on-Kubernetes platform, he was able to make his SparklyR setup reliably work.
So let’s see how to get your SparklyR applications running at scale using Spark-on-Kubernetes! All of the code for this tutorial is available on this Github repository.
Requirements
You must configure a Docker image. This is the most difficult part but we did it for you!
The following Dockerfile uses one our published image as a base - see this blog post and our dockerhub repository for more details on these images.
You can tune your packages in the RUN install2.r section. Tidyverse contains many well known packages like dplyr, ggplot2.
Once your image is built and available in your registry it contains all your dependencies and takes a few seconds to load when you run your applications.
Develop your SparklyR application
We will show you a few code samples to start with. You can find more examples in the sparklyr github repo.
There are two critical topics:
- Creating the Spark Session
- Understanding when your R object is an interface to a Spark Dataframe or to a R dataset.
Experienced Sparklyr developers can look at the Spark session creation and then switch directly to Submit your Applications.
Create the Spark Session
Create a Spark Dataframe
The sparklyr copy_to function returns a reference to the generated Spark DataFrame as a tbl_spark. The returned object will act as a dplyr-compatible interface to the underlying Spark table.
List available Spark tables
Use dplyr (see documentation)
Apply an R function to a spark Dataframe
Cache
The Spark dataframes can be explicitly cached and uncached:
Query Spark tables with SQL
Writing and Reading Parquet
Creating Plots
Then various R packages can be used to copy the file to a cloud storage: cloudyr, AzureStor.
Don't forget to end the Spark Session
Run your Spark application at scale
You must first define a Data Mechanics configuration through a template or a configOverride.
The code to be executed is in the file RExamples.R which was copied in the Docker Image. Other ways to package your applications are documented here.
The Data Mechanics platform allows you to monitor and optimize your Spark applications with Delight, our new and improved Spark UI and History server.
Conclusion
Special thanks go to our customer running SparklyR workloads on the Data Mechanics platform for sharing the tricks and setup. We hope this tutorial will help you be successful with Spark and R!
by