1  Introduction to missing data

This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

1.1 What are missing values?

First, we need to define missing values:

Missing values are values that should have been recorded but were not.

Consider these two examples where you are out counting birds in an area:

  1. You see a bird, but forget to record the observation and leave the value blank.
  2. You do not record any bird sitings, and record a 0 value.

The first of these is an example of a missing value - the value was intended to be recorded but was not. The second is a record of the absence of birds.

In other words: if you did not see any birds, you should have entered a value indicating no birds were seen. Because there is no record where there should have been one, it is a missing value.

How do we note when a value is missing? There are many ways it might be recorded, depending on a variety of factors, such a the standards in a given field, or the way that the data was collected. Here are a few examples:

  • blank records (e.g. empty cells in a spreadsheet)
  • Consistently recorded indicators for missing values, such as “NA”, “N/A”, or “-9999”, throughout the data
  • A combination of different values meant to indicate a missing value (for example, if a team of researchers are recording urchin diameters, Researcher A might enter “-9999” for a missing measurement, while Researcher B enters “N/A”)

You can imagine how chaotic this might get when we have many different types of missing value all recorded together. Imagine a dataset like this:

bird count researcher
kookaburra NA A
kookaburra 0 B
crow NA A
crow 1 B
pigeon -999 A
pigeon -9999 B

To simplify things, we will start by exploring cleaned up missing values - those stored as NA, which is R’s standard way of representing missing values. Transforming the chaos above, this is how it would be represented if the missing values appropriately.

bird count researcher
kookaburra NA A
kookaburra 0 B
crow NA A
crow 1 B
pigeon NA A
pigeon NA B

To help explore and understand missing values, we’ll be using the naniar package, which provides many helpers to make it easier to explore, understand, and visualise missing values.

1.2 How does R deal with missing values?

Before we start exploring missingness, we need to understand how R interprets and processes missing values. R stores missing values as NA, which stands for Not Available. R deals with NAs in unique, and sometimes unexpected, ways.

1.2.1 Missing values in basic R operations

What happens when we mix missing values (NA) with our calculations? We need to know how R deals with missing values in operations so we can recognize these cases and deal with them appropriately.

The general rule for NAs in R calculations is:

Calculations with NA return NA.

Several outcomes for common operations that include NA are:

  • NA + [anything] = NA
  • NA - [anything] = NA
  • NA * [anything] = NA
  • NA / [anything] = NA
  • NA == [anything*] = NA

For example, suppose we have a heights dataset containing the heights of four friends (Sophie, Dan, Fred, and Liz):

heights <- tibble::tibble(
  name = c("Sophie", "Dan", "Fred", "Liz"),
  height = c(163, 175, NA, NA)
)

heights
# A tibble: 4 × 2
  name   height
  <chr>   <dbl>
1 Sophie    163
2 Dan       175
3 Fred       NA
4 Liz        NA

The sum of the height variable returns NA:

sum(heights$height)
[1] NA

This is because we cannot know the sum of a number and a missing value. Similarly, if we try to find the mean height, NA is returned:

mean(heights$height)
[1] NA

When an operation on data containing an NA returns an NA, it tells us the missing values are not being ignored in the calculation, reflecting the default argument na.rm = FALSE (read: “Remove NAs? No!”) in many functions.

Always check the default NA action (e.g. na.rm = FALSE) for functions. As we will see later, the default in some functions is to remove NA - sometimes without warning.

Can we override the default NA action? Sure! For example, we can calculate the mean of the non-missing heights in our example dataset by updating the action to na.rm = TRUE (read: “Remove NAs? Yes!”). The mean value is then calculated based on the two existing height values, and any NA are ignored.

mean(heights$height, na.rm = TRUE)
[1] 169

Now that we know a bit about how R stores and handles missing values, we can start exploring them.

1.3 Do my data contain missing values?

library(naniar)

Missing values don’t jump out and scream “I’m here!”. They’re usually hidden, like a needle in a haystack - especially in large datasets. We need tools (or rather, functions) to quickly identify and count missing values.

Let’s create an example vector x, which contains missing values encoded as NA:

x <- c(1, NA, 3, NA, NA, 5, 8)
x
[1]  1 NA  3 NA NA  5  8

In this small vector (n = 7), we can quickly see that the 2nd, 4th and 5th values in the vector are NA. With larger data, however, we would want tools to identify these for us, instead of manually looking for them. Two functions for identifying NAs are are_na() and any_na(). These are from the naniar R package.

1.3.1 are_na(): which values are NA?

The are_na() function checks each value in a vector or data frame (i.e., for each value it asks “is this value NA”?) then returns TRUE (if NA) or FALSE (if anything besides NA).

are_na(x)
[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE

As expected, the three NA elements in x return TRUE.

1.3.2 any_na(): are there any NAs?

The are_na() function tells us which values are NA. If we instead want to know if any elements in our data are NA, we can instead use any_na(). The any_na function returns TRUE if there are any missing values (stored as NAs), and FALSE if there are none.

any_na(x)
[1] TRUE

Because x contains at least one NA, we see that any_na(x) returns TRUE, and will return FALSE if there are no NA values:

any_na(c(1, 2, 3, 4))
[1] FALSE

The any_na and are_na functions can give us a “heads up” about whether or not our data contains missing values. To deal with them responsibly, however, we need to dig further into patterns of missingess. The next step is exploring missingness visually.

1.3.3 Your Turn: Exercises

You can complete the exercises in an interactive environment using the learnr exercises for this section at (link).