Chapter 4 Connecting and using a local Spark instance

The following code chunks will prepare the R session for us to be able to experiment with the code presented in the book. We will attach the needed R packages, initialize and connect to a local Spark instance and copy the weather and flights datasets from the nycflights13 package to the Spark instance such that we can work with them in our examples.

4.1 Packages and data

We will be using the sparklyr package to interface with Spark and since this package works very well with the dplyr package and its vocabulary, we will take advantage of the dplyr syntax and the pipe operator. As a source of data for our code examples, we will use the nycflights13 package that conveniently provides airline data for flights departing New York City in 2013.

# Attach packages
suppressPackageStartupMessages({
  library(sparklyr)
  library(dplyr)
  library(nycflights13)
})

# Add an id column to the datasets
weather <- nycflights13::weather %>%
  mutate(id = 1L:nrow(nycflights13::weather)) %>% 
  select(id, everything())

flights <- nycflights13::flights %>%
  mutate(id = 1L:nrow(nycflights13::flights)) %>% 
  select(id, everything())

4.2 Connecting to Spark and providing it with data

As a second step, we will now connect to a Spark instance which will be running on our local machine and send the prepared data to the instance using the copy_to() function such that we can work with them in Spark.

We assign the outputs of copy_to() to objects called tbl_weather and tbl_flights, which are references to the DataFrame objects within Spark.

# Connect to a local Spark instance
sc <- sparklyr::spark_connect(master = "local")

# Copy the weather dataset to the instance
tbl_weather <- dplyr::copy_to(
  dest = sc, 
  df = weather,
  name = "weather",
  overwrite = TRUE
)

# Copy the flights dataset to the instance
tbl_flights <- dplyr::copy_to(
  dest = sc, 
  df = flights,
  name = "flights",
  overwrite = TRUE
)

4.3 First glance at the data

To make sure our datasets are available in the Spark instance, we can look at the first few rows of the two datasets we have copied to Spark. Notice how the print shows Source: spark<?> [?? x, telling us that the data indeed comes from a Spark instance and the data frames have an unknown amount of rows. This is because Spark will do minimal work to show us just the first 6 rows, instead of going through the entire dataset. We will talk more about the lazy nature of Spark operations in the later chapters.

head(tbl_flights)

## # Source: spark<?> [?? x 20]
##      id  year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1     1  2013     1     1      517            515         2      830
## 2     2  2013     1     1      533            529         4      850
## 3     3  2013     1     1      542            540         2      923
## 4     4  2013     1     1      544            545        -1     1004
## 5     5  2013     1     1      554            600        -6      812
## 6     6  2013     1     1      554            558        -4      740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

head(tbl_weather)

## # Source: spark<?> [?? x 16]
##      id origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##   <int> <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
## 1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
## 2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
## 3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
## 4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
## 5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
## 6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
## # … with 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>