7 Missing, missing data: explicit and implicit missings

This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

library(naniar)
library(tidyr)

So far, we have learned how to apply tools and strategies to explore, search for and replace missing values. We know how to search for and replace those recorded missing values masquerading as real values, including sneaky strings like “N/A”, “missing”, and “no record”. But what if an entire row is omitted? Then there is no record of those missing values in our data, but they are still missing. So far, we have only explored missing values that exist in our data. Sounds strange, perhaps? Well these values that exist in our data as a recorded missing value, are called explicit missing values. They are in fact, missing missing values, more often called implicit missings.

More briefly: missing values in a dataset can either be explicit, meaning they are missing but recorded, or implicit, meaning that their presence is only implied based on other information (e.g. existing factor levels) in the data.

7.0.1 Explore implicit missings

Imagine we have tetris scores for three friends: Robin, Sam, and Blair. Their scores are recorded in the morning, afternoon, and evening, as shown below:

set.seed(2020-07-08)
tetris <- data.frame(
  name = c(rep("robin", 3),
           rep("sam", 2),
           rep("blair",3)),
  time = c("morning", "afternoon", "evening",
           "morning", "afternoon",
           "morning", "afternoon", "evening"),
  value = c(floor(runif(3, 0L, 1000L)),
            floor(runif(2, 20L, 1000L)),
            floor(runif(3, 850L, 1000L)))
)

knitr::kable(tetris)

name	time	value
robin	morning	832
robin	afternoon	86
robin	evening	897
sam	morning	93
sam	afternoon	688
blair	morning	952
blair	afternoon	954
blair	evening	955

Do you notice something different about one of the friends’ records? Sam’s score is recorded for morning and afternoon, but their evening score is missing entirely. Sam’s evening score is not recorded as missing - the evening record is not even there! This becomes clearer if we spread out the data, so that we have one column for afternoon, evening, and morning.

tetris %>%
  pivot_wider(id_cols = name,
              names_from = time,
              values_from = value) %>% 
  knitr::kable()

name	morning	afternoon	evening
robin	832	86	897
sam	93	688	NA
blair	952	954	955

Notice how there is now an NA indicated for Sam’s evening score? The missing value we see here did not show up before - in long format, it was actually a missing missing value!

In this example, Sam’s evening score is an implicit missing value.

7.0.2 Making implicit missings explicit

It can sometimes be useful to make implicit missing values explicit (even in long format), which we can do using the complete function from tidyr. With the tetris data, that looks like this:

tetris %>%
  tidyr::complete(name, time)

# A tibble: 9 × 3
  name  time      value
  <chr> <chr>     <dbl>
1 blair afternoon   954
2 blair evening     955
3 blair morning     952
4 robin afternoon    86
5 robin evening     897
6 robin morning     832
7 sam   afternoon   688
8 sam   evening      NA
9 sam   morning      93

We see that now an observation has been created for Sam’s evening score, with value recored as NA.

What is the complete function actually doing? Based on the specified variables name and time, the function has identified expected combinations of those two variables across all groups (i.e., because Blair and Robin have an evening score, we expect that Sam should too) - and a new observation is created to make Sam’s implicit missing evening score an explicit one that appears in the data.

Whereas the implicit missing for Sam’s evening score would not be detected using the tools to count, summarize and visualize NA values we have learned so far, when converted to an explicit missing using complete, it would be detected because it has been populated with NA.

7.0.3 Handling explicitly missing values

Sometimes missing data is entered to help make a dataset more readable. For example, imagine if we had the following structure for our tetris data:

tetris_empty <- tibble::tibble(
  name = c("robin", NA, NA,
           "sam", NA, NA,
           "blair",NA, NA),
  time = c("morning", "afternoon", "evening",
           "morning", "afternoon", "evening",
           "morning", "afternoon", "evening"),
  value = c(floor(runif(3, 0L, 1000L)),
            floor(runif(3, 20L, 1000L)),
            floor(runif(3, 850L, 1000L)))
)

knitr::kable(tetris_empty)

name	time	value
robin	morning	340
NA	afternoon	376
NA	evening	527
sam	morning	594
NA	afternoon	399
NA	evening	26
blair	morning	939
NA	afternoon	974
NA	evening	974

Sometimes this kind of format is used to make something more pleasant to read in a spreadsheet. Now, we happen to know something about the data structure here - that there are three records per person, at morning, afternoon, and evening. What we want to do is fill these missing values by populating each NA with the player’s name that comes before it. The fill function from tidyr does just that: each NA in a variable is populated with the most recent non-NA value before (i.e., above) it.

tetris_empty %>%
  tidyr::fill(name) %>% 
  knitr::kable()

name	time	value
robin	morning	340
robin	afternoon	376
robin	evening	527
sam	morning	594
sam	afternoon	399
sam	evening	26
blair	morning	939
blair	afternoon	974
blair	evening	974

This method of filling in missing values is referred to as “last observation carried forward” and is sometimes abbreviated as “locf”.

Beware: this requires that your data are carefully organized before using fill! The fill function does NOT predict what the entry should be based on other variable values or factor levels; it simply populates each missing value with the most recent non-missing value for that variable. Be very careful with this method to populate missings, and understand that it is only useful in unique cases and not a generally suggested option to replace missing values.