Monitoring data drift

What this article covers

This article illustrates how the flyover package can be useful to monitor data drift over time. In situations where you pull data on regular intervals, you may be interested in how that data is changing from one pull to another. A large enough shift in distributions may mean you need to retrain models or take other actions. Missing values may also wreak havoc if upstream data processes change without warning.

Data for this example

We will examine a subset of major storm data that is available from the dplyr package. This will be illustrative only.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

my_data <-
  dplyr::storms %>%
  rename(ts_diameter = tropicalstorm_force_diameter,
         hu_diameter = hurricane_force_diameter) %>%
  filter(year > 2006) %>%
  select(year, ts_diameter, hu_diameter, lat, long, status, category, wind, pressure)

str(my_data)

## tibble [4,309 × 9] (S3: tbl_df/tbl/data.frame)
##  $ year       : num [1:4309] 2007 2007 2007 2007 2007 ...
##  $ ts_diameter: int [1:4309] 0 110 110 120 120 0 0 0 60 240 ...
##  $ hu_diameter: int [1:4309] 0 0 0 0 0 0 0 0 0 0 ...
##  $ lat        : num [1:4309] 22.3 23.6 24.3 25.1 27 27.5 29.7 35.5 37.1 39.1 ...
##  $ long       : num [1:4309] -85.8 -85.7 -85.2 -84.6 -83.2 -82.7 -82.1 -66.5 -65.5 -64.2 ...
##  $ status     : chr [1:4309] "tropical depression" "tropical storm" "tropical storm" "tropical storm" ...
##  $ category   : Ord.factor w/ 7 levels "-1"<"0"<"1"<"2"<..: 1 2 2 2 2 1 1 1 2 2 ...
##  $ wind       : int [1:4309] 30 40 50 45 40 30 30 30 35 45 ...
##  $ pressure   : int [1:4309] 1004 1000 997 998 1000 1000 1001 1007 1004 999 ...

When you create data to monitor drift, I recommend using a group column that is formatted akin to ISO-8601 standards, e.g. YYYY-MM-DD, and stacking the data sets in order of date. This will ensure your output is easy to read.

Distribution drift

Numeric distributions

I recommend using flyover_binline_ridges or flyover_density_ridges to monitor drift of numeric variables over time.

ridges <- build_plots(my_data, flyover_density_ridges, group_var = "year")
build_display(ridges,
              display_name = "ridges",
              output_dir   = "display-storm-ridges")

Categorical distributions

I recommend using flyover_bar_fill to monitor drift of categorical vairables over time. This will work best if the number of categories is small.

bar_fill <- build_plots(my_data, flyover_bar_fill, group_var = "year")
build_display(bar_fill,
              display_name = "bar fill",
              output_dir   = "display-storm-bar-fill")

Data quality

You can also monitor missing values, either as raw counts or as a percent of total observations for the group.

Missing value counts

na_count <- build_plots(my_data, flyover_na_count, group_var = "year")
build_display(na_count,
              display_name = "NA count",
              output_dir   = "display-storm-na-count")

Missing value percentages

na_percent <- build_plots(my_data, flyover_na_percent, group_var = "year")
build_display(na_percent,
              display_name = "NA percent",
              output_dir   = "display-storm-na-percent")

2024-01-29