This article illustrates how the flyover
package can be
useful to monitor data drift over time. In situations where you pull
data on regular intervals, you may be interested in how that data is
changing from one pull to another. A large enough shift in distributions
may mean you need to retrain models or take other actions. Missing
values may also wreak havoc if upstream data processes change without
warning.
We will examine a subset of major storm data that is available from
the dplyr
package. This will be illustrative only.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
my_data <-
dplyr::storms %>%
rename(ts_diameter = tropicalstorm_force_diameter,
hu_diameter = hurricane_force_diameter) %>%
filter(year > 2006) %>%
select(year, ts_diameter, hu_diameter, lat, long, status, category, wind, pressure)
str(my_data)
## tibble [4,309 × 9] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:4309] 2007 2007 2007 2007 2007 ...
## $ ts_diameter: int [1:4309] 0 110 110 120 120 0 0 0 60 240 ...
## $ hu_diameter: int [1:4309] 0 0 0 0 0 0 0 0 0 0 ...
## $ lat : num [1:4309] 22.3 23.6 24.3 25.1 27 27.5 29.7 35.5 37.1 39.1 ...
## $ long : num [1:4309] -85.8 -85.7 -85.2 -84.6 -83.2 -82.7 -82.1 -66.5 -65.5 -64.2 ...
## $ status : chr [1:4309] "tropical depression" "tropical storm" "tropical storm" "tropical storm" ...
## $ category : Ord.factor w/ 7 levels "-1"<"0"<"1"<"2"<..: 1 2 2 2 2 1 1 1 2 2 ...
## $ wind : int [1:4309] 30 40 50 45 40 30 30 30 35 45 ...
## $ pressure : int [1:4309] 1004 1000 997 998 1000 1000 1001 1007 1004 999 ...
When you create data to monitor drift, I recommend using a group
column that is formatted akin to ISO-8601 standards,
e.g. YYYY-MM-DD
, and stacking the data sets in order of
date. This will ensure your output is easy to read.
I recommend using flyover_binline_ridges
or
flyover_density_ridges
to monitor drift of numeric
variables over time.
ridges <- build_plots(my_data, flyover_density_ridges, group_var = "year")
build_display(ridges,
display_name = "ridges",
output_dir = "display-storm-ridges")
I recommend using flyover_bar_fill
to monitor drift of
categorical vairables over time. This will work best if the number of
categories is small.
bar_fill <- build_plots(my_data, flyover_bar_fill, group_var = "year")
build_display(bar_fill,
display_name = "bar fill",
output_dir = "display-storm-bar-fill")
You can also monitor missing values, either as raw counts or as a percent of total observations for the group.
na_count <- build_plots(my_data, flyover_na_count, group_var = "year")
build_display(na_count,
display_name = "NA count",
output_dir = "display-storm-na-count")
na_percent <- build_plots(my_data, flyover_na_percent, group_var = "year")
build_display(na_percent,
display_name = "NA percent",
output_dir = "display-storm-na-percent")