Using Spark from R for performance with arbitrary code
2020-02-20
Chapter 1 Welcome
Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users.
1.1 What will you find in this book
This short publication attempts to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages.
This publication focuses on exploring the different interfaces available for communication between R and Spark using the sparklyr package, namely:
- Constructing functions by piping dplyr verbs
- Constructing SQL and executing it with Spark
- Using the lower-level invoke API to manipulate Spark’s Java objects from R
- Exploring the invoke API from R with Java reflection and examining invokes with logs
If you are interested in the sparklyr package and working with Spark from R in general, we strongly recommend the very comprehensive Mastering Spark with R book available online for free.
1.2 Book sources
This book is rendered and published automatically from publicly accessible git repositories, you can find the
- Content sources in the sparkfromr GitHub repository
- Rendered version in the sparkfrom_deployed GitHub repository
- Automatically built Docker image used to render the book on DockerHub
- Sources used to build the Docker images in the sparkfrom_docker GitHub repository
All contributions to the above are most welcome.
1.3 Acknowledgments
The creation of this book would not be possible without many openly available resources such as the R packages around the rmarkdown ecosystem created by Yihui Xie, namely the bookdown package via which this publication is rendered. This project also heavily relies on the Rocker Project which provides Docker images for the R environment thanks to Carl Boettiger, Dirk Eddelbuettel, and Noam Ross. Last but not least there would be nothing to write about in this short book if the sparklyr package was not written by Javier Luraschi et al., the R programming language itself maintained by the R core group and the Apache Spark creators and maintainers. My thanks go to the creators and maintainers of all these amazing open-source tools.
Differences of habit and language are nothing at all if our aims are identical and our hearts are open
- Albus Dumbledore