Chapter 4 Connecting and using a local Spark instance
The following code chunks will prepare the R session for us to be able to experiment with the code presented in the book. We will attach the needed R packages, initialize and connect to a local Spark instance and copy the weather
and flights
datasets from the nycflights13 package to the Spark instance such that we can work with them in our examples.
4.1 Packages and data
We will be using the sparklyr package to interface with Spark and since this package works very well with the dplyr package and its vocabulary, we will take advantage of the dplyr syntax and the pipe operator. As a source of data for our code examples, we will use the nycflights13 package that conveniently provides airline data for flights departing New York City in 2013.
# Attach packages
suppressPackageStartupMessages({
library(sparklyr)
library(dplyr)
library(nycflights13)
})
# Add an id column to the datasets
weather <- nycflights13::weather %>%
mutate(id = 1L:nrow(nycflights13::weather)) %>%
select(id, everything())
flights <- nycflights13::flights %>%
mutate(id = 1L:nrow(nycflights13::flights)) %>%
select(id, everything())
4.2 Connecting to Spark and providing it with data
As a second step, we will now connect to a Spark instance which will be running on our local machine and send the prepared data to the instance using the copy_to()
function such that we can work with them in Spark.
We assign the outputs of copy_to()
to objects called tbl_weather
and tbl_flights
, which are references to the DataFrame objects within Spark.
# Connect to a local Spark instance
sc <- sparklyr::spark_connect(master = "local")
# Copy the weather dataset to the instance
tbl_weather <- dplyr::copy_to(
dest = sc,
df = weather,
name = "weather",
overwrite = TRUE
)
# Copy the flights dataset to the instance
tbl_flights <- dplyr::copy_to(
dest = sc,
df = flights,
name = "flights",
overwrite = TRUE
)
4.3 First glance at the data
To make sure our datasets are available in the Spark instance, we can look at the first few rows of the two datasets we have copied to Spark. Notice how the print shows Source: spark<?> [?? x
, telling us that the data indeed comes from a Spark instance and the data frames have an unknown amount of rows. This is because Spark will do minimal work to show us just the first 6 rows, instead of going through the entire dataset. We will talk more about the lazy nature of Spark operations in the later chapters.
## # Source: spark<?> [?? x 20]
## id year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <dbl> <int>
## 1 1 2013 1 1 517 515 2 830
## 2 2 2013 1 1 533 529 4 850
## 3 3 2013 1 1 542 540 2 923
## 4 4 2013 1 1 544 545 -1 1004
## 5 5 2013 1 1 554 600 -6 812
## 6 6 2013 1 1 554 558 -4 740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # Source: spark<?> [?? x 16]
## id origin year month day hour temp dewp humid wind_dir wind_speed
## <int> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## # … with 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>