A typical flyover
workflow has the following steps:
This vignette will introduce step 1: how
flyover
’s helper functions can be used to prepare multiple
data sets for comparison.
This package generally relies on the conscientiousness of the user to
clean up data sets prior to plotting, but the helper functions are
provided for convenience as part of the flyover
workflow.
If your goal is to compare distributions of variables across multiple
data sets, you must first collect those data sets into a single table
prior to plotting. flyover
provides basic functions to
accomplish this.
Let’s say we have two data sets we would like to compare, one from an old data process and one from a new one. They consist of numeric and categorical data, and they don’t share all the same columns.
old <- data.frame(num1 = rnorm(5),
num2 = rexp(5),
cat1 = letters[1:5],
cat2 = rep("z", 5))
new <- data.frame(num1 = rnorm(5),
num2 = rexp(5),
cat1 = letters[1:5])
The first step is to create a named list of the data. The names will be used as the levels of a grouping variable to identify observations in later plots, so you want them to be meaningful. By default the names are taken to be the object names you pass, but you can supply a character vector of names to make them more descriptive.
data_list <- enlist_data(old, new, names = c("my old data", "my new data"))
str(data_list)
## List of 2
## $ my old data: tibble [5 × 4] (S3: tbl_df/tbl/data.frame)
## ..$ num1: num [1:5] 0.586 0.709 -0.109 -0.453 0.606
## ..$ num2: num [1:5] 2.878 1.605 0.471 6.402 1.258
## ..$ cat1: chr [1:5] "a" "b" "c" "d" ...
## ..$ cat2: chr [1:5] "z" "z" "z" "z" ...
## $ my new data: tibble [5 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ num1: num [1:5] -0.0942 -0.2469 1.6612 -0.4489 0.546
## ..$ num2: num [1:5] 1.2525 0.3971 0.0881 1.7222 5.3849
## ..$ cat1: chr [1:5] "a" "b" "c" "d" ...
All that is left is to stack the data in the list together to create
one table. In the process, a grouping variable is created, whose name
defaults to flyover_id_
.
data_stack <- stack_data(data_list)
data_stack
## # A tibble: 10 × 5
## flyover_id_ num1 num2 cat1 cat2
## <chr> <dbl> <dbl> <chr> <chr>
## 1 my old data 0.586 2.88 a z
## 2 my old data 0.709 1.61 b z
## 3 my old data -0.109 0.471 c z
## 4 my old data -0.453 6.40 d z
## 5 my old data 0.606 1.26 e z
## 6 my new data -0.0942 1.25 a NA
## 7 my new data -0.247 0.397 b NA
## 8 my new data 1.66 0.0881 c NA
## 9 my new data -0.449 1.72 d NA
## 10 my new data 0.546 5.38 e NA
Notice that the variable cat2
appears in the old data
set but not the new one, so the values for this variable are
NA
in the records corresponding to the new data. Plotting
this field may not be useful to you, since the plot would give you no
information about how the distribution changed, and you may already be
aware of the column’s absense in newer data. You can remove columns that
don’t appear in every data set (and you can also change the grouping
variable name):
data_stack_drop <- stack_data(data_list, drop_mismatches = TRUE, group_var = "grp")
data_stack_drop
## # A tibble: 10 × 4
## grp num1 num2 cat1
## <chr> <dbl> <dbl> <chr>
## 1 my old data 0.586 2.88 a
## 2 my old data 0.709 1.61 b
## 3 my old data -0.109 0.471 c
## 4 my old data -0.453 6.40 d
## 5 my old data 0.606 1.26 e
## 6 my new data -0.0942 1.25 a
## 7 my new data -0.247 0.397 b
## 8 my new data 1.66 0.0881 c
## 9 my new data -0.449 1.72 d
## 10 my new data 0.546 5.38 e
Notice that these functions – indeed, all the functions in this package – are data-first, meaning they are pipe-friendly. Thus you could in theory write your code like so:
data_stack <-
old %>%
enlist_data(new, names = c("old data", "new data")) %>%
stack_data()