7  Missing, missing data: explicit and implicit missings

This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

library(naniar)
library(tidyr)

So far, we have learned how to apply tools and strategies to explore, search for and replace missing values. We know how to search for and replace those recorded missing values masquerading as real values, including sneaky strings like “N/A”, “missing”, and “no record”. But what if an entire row is omitted? Then there is no record of those missing values in our data, but they are still missing. So far, we have only explored missing values that exist in our data. Sounds strange, perhaps? Well these values that exist in our data as a recorded missing value, are called explicit missing values. They are in fact, missing missing values, more often called implicit missings.

More briefly: missing values in a dataset can either be explicit, meaning they are missing but recorded, or implicit, meaning that their presence is only implied based on other information (e.g. existing factor levels) in the data.

7.0.1 Explore implicit missings

Imagine we have tetris scores for three friends: Robin, Sam, and Blair. Their scores are recorded in the morning, afternoon, and evening, as shown below:

set.seed(2020-07-08)
tetris <- data.frame(
  name = c(rep("robin", 3),
           rep("sam", 2),
           rep("blair",3)),
  time = c("morning", "afternoon", "evening",
           "morning", "afternoon",
           "morning", "afternoon", "evening"),
  value = c(floor(runif(3, 0L, 1000L)),
            floor(runif(2, 20L, 1000L)),
            floor(runif(3, 850L, 1000L)))
)
knitr::kable(tetris)
name time value
robin morning 832
robin afternoon 86
robin evening 897
sam morning 93
sam afternoon 688
blair morning 952
blair afternoon 954
blair evening 955

Do you notice something different about one of the friends’ records? Sam’s score is recorded for morning and afternoon, but their evening score is missing entirely. Sam’s evening score is not recorded as missing - the evening record is not even there! This becomes clearer if we spread out the data, so that we have one column for afternoon, evening, and morning.

tetris %>%
  pivot_wider(id_cols = name,
              names_from = time,
              values_from = value) %>% 
  knitr::kable()
name morning afternoon evening
robin 832 86 897
sam 93 688 NA
blair 952 954 955

Notice how there is now an NA indicated for Sam’s evening score? The missing value we see here did not show up before - in long format, it was actually a missing missing value!

In this example, Sam’s evening score is an implicit missing value.

7.0.2 Making implicit missings explicit

It can sometimes be useful to make implicit missing values explicit (even in long format), which we can do using the complete function from tidyr. With the tetris data, that looks like this:

tetris %>%
  tidyr::complete(name, time)
# A tibble: 9 × 3
  name  time      value
  <chr> <chr>     <dbl>
1 blair afternoon   954
2 blair evening     955
3 blair morning     952
4 robin afternoon    86
5 robin evening     897
6 robin morning     832
7 sam   afternoon   688
8 sam   evening      NA
9 sam   morning      93

We see that now an observation has been created for Sam’s evening score, with value recored as NA.

What is the complete function actually doing? Based on the specified variables name and time, the function has identified expected combinations of those two variables across all groups (i.e., because Blair and Robin have an evening score, we expect that Sam should too) - and a new observation is created to make Sam’s implicit missing evening score an explicit one that appears in the data.

Whereas the implicit missing for Sam’s evening score would not be detected using the tools to count, summarize and visualize NA values we have learned so far, when converted to an explicit missing using complete, it would be detected because it has been populated with NA.

7.0.3 Handling explicitly missing values

Sometimes missing data is entered to help make a dataset more readable. For example, imagine if we had the following structure for our tetris data:

tetris_empty <- tibble::tibble(
  name = c("robin", NA, NA,
           "sam", NA, NA,
           "blair",NA, NA),
  time = c("morning", "afternoon", "evening",
           "morning", "afternoon", "evening",
           "morning", "afternoon", "evening"),
  value = c(floor(runif(3, 0L, 1000L)),
            floor(runif(3, 20L, 1000L)),
            floor(runif(3, 850L, 1000L)))
)
knitr::kable(tetris_empty)
name time value
robin morning 340
NA afternoon 376
NA evening 527
sam morning 594
NA afternoon 399
NA evening 26
blair morning 939
NA afternoon 974
NA evening 974

Sometimes this kind of format is used to make something more pleasant to read in a spreadsheet. Now, we happen to know something about the data structure here - that there are three records per person, at morning, afternoon, and evening. What we want to do is fill these missing values by populating each NA with the player’s name that comes before it. The fill function from tidyr does just that: each NA in a variable is populated with the most recent non-NA value before (i.e., above) it.

tetris_empty %>%
  tidyr::fill(name) %>% 
  knitr::kable()
name time value
robin morning 340
robin afternoon 376
robin evening 527
sam morning 594
sam afternoon 399
sam evening 26
blair morning 939
blair afternoon 974
blair evening 974

This method of filling in missing values is referred to as “last observation carried forward” and is sometimes abbreviated as “locf”.

Beware: this requires that your data are carefully organized before using fill! The fill function does NOT predict what the entry should be based on other variable values or factor levels; it simply populates each missing value with the most recent non-missing value for that variable. Be very careful with this method to populate missings, and understand that it is only useful in unique cases and not a generally suggested option to replace missing values.