library(naniar)
library(tidyr)
7 Missing, missing data: explicit and implicit missings
This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar
works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.
So far, we have learned how to apply tools and strategies to explore, search for and replace missing values. We know how to search for and replace those recorded missing values masquerading as real values, including sneaky strings like “N/A”, “missing”, and “no record”. But what if an entire row is omitted? Then there is no record of those missing values in our data, but they are still missing. So far, we have only explored missing values that exist in our data. Sounds strange, perhaps? Well these values that exist in our data as a recorded missing value, are called explicit missing values. They are in fact, missing missing values, more often called implicit missings.
More briefly: missing values in a dataset can either be explicit, meaning they are missing but recorded, or implicit, meaning that their presence is only implied based on other information (e.g. existing factor levels) in the data.
7.0.1 Explore implicit missings
Imagine we have tetris scores for three friends: Robin, Sam, and Blair. Their scores are recorded in the morning, afternoon, and evening, as shown below:
set.seed(2020-07-08)
<- data.frame(
tetris name = c(rep("robin", 3),
rep("sam", 2),
rep("blair",3)),
time = c("morning", "afternoon", "evening",
"morning", "afternoon",
"morning", "afternoon", "evening"),
value = c(floor(runif(3, 0L, 1000L)),
floor(runif(2, 20L, 1000L)),
floor(runif(3, 850L, 1000L)))
)
::kable(tetris) knitr
name | time | value |
---|---|---|
robin | morning | 832 |
robin | afternoon | 86 |
robin | evening | 897 |
sam | morning | 93 |
sam | afternoon | 688 |
blair | morning | 952 |
blair | afternoon | 954 |
blair | evening | 955 |
Do you notice something different about one of the friends’ records? Sam’s score is recorded for morning and afternoon, but their evening score is missing entirely. Sam’s evening score is not recorded as missing - the evening record is not even there! This becomes clearer if we spread out the data, so that we have one column for afternoon, evening, and morning.
%>%
tetris pivot_wider(id_cols = name,
names_from = time,
values_from = value) %>%
::kable() knitr
name | morning | afternoon | evening |
---|---|---|---|
robin | 832 | 86 | 897 |
sam | 93 | 688 | NA |
blair | 952 | 954 | 955 |
Notice how there is now an NA
indicated for Sam’s evening score? The missing value we see here did not show up before - in long format, it was actually a missing missing value!
In this example, Sam’s evening score is an implicit missing value.
7.0.2 Making implicit missings explicit
It can sometimes be useful to make implicit missing values explicit (even in long format), which we can do using the complete
function from tidyr
. With the tetris
data, that looks like this:
%>%
tetris ::complete(name, time) tidyr
# A tibble: 9 × 3
name time value
<chr> <chr> <dbl>
1 blair afternoon 954
2 blair evening 955
3 blair morning 952
4 robin afternoon 86
5 robin evening 897
6 robin morning 832
7 sam afternoon 688
8 sam evening NA
9 sam morning 93
We see that now an observation has been created for Sam’s evening score, with value recored as NA
.
What is the complete
function actually doing? Based on the specified variables name
and time
, the function has identified expected combinations of those two variables across all groups (i.e., because Blair and Robin have an evening score, we expect that Sam should too) - and a new observation is created to make Sam’s implicit missing evening score an explicit one that appears in the data.
Whereas the implicit missing for Sam’s evening score would not be detected using the tools to count, summarize and visualize NA
values we have learned so far, when converted to an explicit missing using complete
, it would be detected because it has been populated with NA
.
7.0.3 Handling explicitly missing values
Sometimes missing data is entered to help make a dataset more readable. For example, imagine if we had the following structure for our tetris data:
<- tibble::tibble(
tetris_empty name = c("robin", NA, NA,
"sam", NA, NA,
"blair",NA, NA),
time = c("morning", "afternoon", "evening",
"morning", "afternoon", "evening",
"morning", "afternoon", "evening"),
value = c(floor(runif(3, 0L, 1000L)),
floor(runif(3, 20L, 1000L)),
floor(runif(3, 850L, 1000L)))
)
::kable(tetris_empty) knitr
name | time | value |
---|---|---|
robin | morning | 340 |
NA | afternoon | 376 |
NA | evening | 527 |
sam | morning | 594 |
NA | afternoon | 399 |
NA | evening | 26 |
blair | morning | 939 |
NA | afternoon | 974 |
NA | evening | 974 |
Sometimes this kind of format is used to make something more pleasant to read in a spreadsheet. Now, we happen to know something about the data structure here - that there are three records per person, at morning, afternoon, and evening. What we want to do is fill these missing values by populating each NA
with the player’s name that comes before it. The fill
function from tidyr
does just that: each NA
in a variable is populated with the most recent non-NA value before (i.e., above) it.
%>%
tetris_empty ::fill(name) %>%
tidyr::kable() knitr
name | time | value |
---|---|---|
robin | morning | 340 |
robin | afternoon | 376 |
robin | evening | 527 |
sam | morning | 594 |
sam | afternoon | 399 |
sam | evening | 26 |
blair | morning | 939 |
blair | afternoon | 974 |
blair | evening | 974 |
This method of filling in missing values is referred to as “last observation carried forward” and is sometimes abbreviated as “locf”.
Beware: this requires that your data are carefully organized before using
fill
! Thefill
function does NOT predict what the entry should be based on other variable values or factor levels; it simply populates each missing value with the most recent non-missing value for that variable. Be very careful with this method to populate missings, and understand that it is only useful in unique cases and not a generally suggested option to replace missing values.