What this article covers

A typical flyover workflow has the following steps:

  1. Combine different data sets into a single table.
  2. Apply a plotting function to the columns of the table.
  3. Build a display to navigate the plots.

This vignette will introduce step 1: how flyover’s helper functions can be used to prepare multiple data sets for comparison.

This package generally relies on the conscientiousness of the user to clean up data sets prior to plotting, but the helper functions are provided for convenience as part of the flyover workflow.

Starting from separate data sets

If your goal is to compare distributions of variables across multiple data sets, you must first collect those data sets into a single table prior to plotting. flyover provides basic functions to accomplish this.

Let’s say we have two data sets we would like to compare, one from an old data process and one from a new one. They consist of numeric and categorical data, and they don’t share all the same columns.

old <- data.frame(num1 = rnorm(5),
                  num2 = rexp(5),
                  cat1 = letters[1:5],
                  cat2 = rep("z", 5))

new <- data.frame(num1 = rnorm(5),
                  num2 = rexp(5),
                  cat1 = letters[1:5])

The first step is to create a named list of the data. The names will be used as the levels of a grouping variable to identify observations in later plots, so you want them to be meaningful. By default the names are taken to be the object names you pass, but you can supply a character vector of names to make them more descriptive.

data_list <- enlist_data(old, new, names = c("my old data", "my new data"))

str(data_list)
## List of 2
##  $ my old data: tibble [5 × 4] (S3: tbl_df/tbl/data.frame)
##   ..$ num1: num [1:5] 0.586 0.709 -0.109 -0.453 0.606
##   ..$ num2: num [1:5] 2.878 1.605 0.471 6.402 1.258
##   ..$ cat1: chr [1:5] "a" "b" "c" "d" ...
##   ..$ cat2: chr [1:5] "z" "z" "z" "z" ...
##  $ my new data: tibble [5 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ num1: num [1:5] -0.0942 -0.2469 1.6612 -0.4489 0.546
##   ..$ num2: num [1:5] 1.2525 0.3971 0.0881 1.7222 5.3849
##   ..$ cat1: chr [1:5] "a" "b" "c" "d" ...

Creating a single table from a named list

All that is left is to stack the data in the list together to create one table. In the process, a grouping variable is created, whose name defaults to flyover_id_.

data_stack <- stack_data(data_list)

data_stack
## # A tibble: 10 × 5
##    flyover_id_    num1   num2 cat1  cat2 
##    <chr>         <dbl>  <dbl> <chr> <chr>
##  1 my old data  0.586  2.88   a     z    
##  2 my old data  0.709  1.61   b     z    
##  3 my old data -0.109  0.471  c     z    
##  4 my old data -0.453  6.40   d     z    
##  5 my old data  0.606  1.26   e     z    
##  6 my new data -0.0942 1.25   a     NA   
##  7 my new data -0.247  0.397  b     NA   
##  8 my new data  1.66   0.0881 c     NA   
##  9 my new data -0.449  1.72   d     NA   
## 10 my new data  0.546  5.38   e     NA

Notice that the variable cat2 appears in the old data set but not the new one, so the values for this variable are NA in the records corresponding to the new data. Plotting this field may not be useful to you, since the plot would give you no information about how the distribution changed, and you may already be aware of the column’s absense in newer data. You can remove columns that don’t appear in every data set (and you can also change the grouping variable name):

data_stack_drop <- stack_data(data_list, drop_mismatches = TRUE, group_var = "grp")

data_stack_drop
## # A tibble: 10 × 4
##    grp            num1   num2 cat1 
##    <chr>         <dbl>  <dbl> <chr>
##  1 my old data  0.586  2.88   a    
##  2 my old data  0.709  1.61   b    
##  3 my old data -0.109  0.471  c    
##  4 my old data -0.453  6.40   d    
##  5 my old data  0.606  1.26   e    
##  6 my new data -0.0942 1.25   a    
##  7 my new data -0.247  0.397  b    
##  8 my new data  1.66   0.0881 c    
##  9 my new data -0.449  1.72   d    
## 10 my new data  0.546  5.38   e

Using a pipe

Notice that these functions – indeed, all the functions in this package – are data-first, meaning they are pipe-friendly. Thus you could in theory write your code like so:

data_stack <-
  old %>%
  enlist_data(new, names = c("old data", "new data")) %>%
  stack_data()