bird | count | researcher |
---|---|---|
kookaburra | NA | A |
kookaburra | 0 | B |
crow | NA | A |
crow | 1 | B |
pigeon | -999 | A |
pigeon | -9999 | B |
1 Introduction to missing data
This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar
works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.
1.1 What are missing values?
First, we need to define missing values:
Missing values are values that should have been recorded but were not.
Consider these two examples where you are out counting birds in an area:
- You see a bird, but forget to record the observation and leave the value blank.
- You do not record any bird sitings, and record a 0 value.
The first of these is an example of a missing value - the value was intended to be recorded but was not. The second is a record of the absence of birds.
In other words: if you did not see any birds, you should have entered a value indicating no birds were seen. Because there is no record where there should have been one, it is a missing value.
How do we note when a value is missing? There are many ways it might be recorded, depending on a variety of factors, such a the standards in a given field, or the way that the data was collected. Here are a few examples:
- blank records (e.g. empty cells in a spreadsheet)
- Consistently recorded indicators for missing values, such as “NA”, “N/A”, or “-9999”, throughout the data
- A combination of different values meant to indicate a missing value (for example, if a team of researchers are recording urchin diameters, Researcher A might enter “-9999” for a missing measurement, while Researcher B enters “N/A”)
You can imagine how chaotic this might get when we have many different types of missing value all recorded together. Imagine a dataset like this:
To simplify things, we will start by exploring cleaned up missing values - those stored as NA
, which is R’s standard way of representing missing values. Transforming the chaos above, this is how it would be represented if the missing values appropriately.
bird | count | researcher |
---|---|---|
kookaburra | NA | A |
kookaburra | 0 | B |
crow | NA | A |
crow | 1 | B |
pigeon | NA | A |
pigeon | NA | B |
To help explore and understand missing values, we’ll be using the naniar
package, which provides many helpers to make it easier to explore, understand, and visualise missing values.
1.2 How does R deal with missing values?
Before we start exploring missingness, we need to understand how R interprets and processes missing values. R stores missing values as NA
, which stands for Not Available. R deals with NA
s in unique, and sometimes unexpected, ways.
1.2.1 Missing values in basic R operations
What happens when we mix missing values (NA
) with our calculations? We need to know how R deals with missing values in operations so we can recognize these cases and deal with them appropriately.
The general rule for NA
s in R calculations is:
Calculations with
NA
returnNA
.
Several outcomes for common operations that include NA
are:
NA
+ [anything] =NA
NA
- [anything] =NA
NA
* [anything] =NA
NA
/ [anything] =NA
NA
== [anything*] =NA
For example, suppose we have a heights
dataset containing the heights of four friends (Sophie, Dan, Fred, and Liz):
<- tibble::tibble(
heights name = c("Sophie", "Dan", "Fred", "Liz"),
height = c(163, 175, NA, NA)
)
heights
# A tibble: 4 × 2
name height
<chr> <dbl>
1 Sophie 163
2 Dan 175
3 Fred NA
4 Liz NA
The sum of the height
variable returns NA
:
sum(heights$height)
[1] NA
This is because we cannot know the sum of a number and a missing value. Similarly, if we try to find the mean height, NA
is returned:
mean(heights$height)
[1] NA
When an operation on data containing an NA
returns an NA
, it tells us the missing values are not being ignored in the calculation, reflecting the default argument na.rm = FALSE
(read: “Remove NAs? No!”) in many functions.
Always check the default
NA
action (e.g.na.rm = FALSE
) for functions. As we will see later, the default in some functions is to removeNA
- sometimes without warning.
Can we override the default NA
action? Sure! For example, we can calculate the mean of the non-missing heights in our example dataset by updating the action to na.rm = TRUE
(read: “Remove NAs? Yes!”). The mean value is then calculated based on the two existing height values, and any NA
are ignored.
mean(heights$height, na.rm = TRUE)
[1] 169
Now that we know a bit about how R stores and handles missing values, we can start exploring them.
1.3 Do my data contain missing values?
library(naniar)
Missing values don’t jump out and scream “I’m here!”. They’re usually hidden, like a needle in a haystack - especially in large datasets. We need tools (or rather, functions) to quickly identify and count missing values.
Let’s create an example vector x
, which contains missing values encoded as NA
:
<- c(1, NA, 3, NA, NA, 5, 8)
x x
[1] 1 NA 3 NA NA 5 8
In this small vector (n = 7), we can quickly see that the 2nd, 4th and 5th values in the vector are NA
. With larger data, however, we would want tools to identify these for us, instead of manually looking for them. Two functions for identifying NA
s are are_na()
and any_na()
. These are from the naniar
R package.
1.3.1 are_na()
: which values are NA
?
The are_na()
function checks each value in a vector or data frame (i.e., for each value it asks “is this value NA
”?) then returns TRUE (if NA
) or FALSE (if anything besides NA
).
are_na(x)
[1] FALSE TRUE FALSE TRUE TRUE FALSE FALSE
As expected, the three NA
elements in x
return TRUE
.
1.3.2 any_na()
: are there any NA
s?
The are_na()
function tells us which values are NA
. If we instead want to know if any elements in our data are NA
, we can instead use any_na()
. The any_na
function returns TRUE if there are any missing values (stored as NA
s), and FALSE if there are none.
any_na(x)
[1] TRUE
Because x
contains at least one NA
, we see that any_na(x)
returns TRUE, and will return FALSE if there are no NA
values:
any_na(c(1, 2, 3, 4))
[1] FALSE
The any_na
and are_na
functions can give us a “heads up” about whether or not our data contains missing values. To deal with them responsibly, however, we need to dig further into patterns of missingess. The next step is exploring missingness visually.
1.3.3 Your Turn: Exercises
You can complete the exercises in an interactive environment using the learnr exercises for this section at (link).