This book contains both practical guides on exploring missing data, as well as some of the deeper details of how `naniar` works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

## 1.1 What are missing values?

First, we need to define missing values:

Missing values are values that should have been recorded but were not.

Consider these two examples where you are out counting birds in an area:

1. You see a bird, but forget to record the observation and leave the value blank.
2. You do not record any bird sitings, and record a 0 value.

The first of these is an example of a missing value - the value was intended to be recorded but was not. The second is a record of the absence of birds.

In other words: if you did not see any birds, you should have entered a value indicating no birds were seen. Because there is no record where there should have been one, it is a missing value.

How do we note when a value is missing? There are many ways it might be recorded, depending on a variety of factors, such a the standards in a given field, or the way that the data was collected. Here are a few examples:

• blank records (e.g. empty cells in a spreadsheet)
• Consistently recorded indicators for missing values, such as “NA”, “N/A”, or “-9999”, throughout the data
• A combination of different values meant to indicate a missing value (for example, if a team of researchers are recording urchin diameters, Researcher A might enter “-9999” for a missing measurement, while Researcher B enters “N/A”)

You can imagine how chaotic this might get when we have many different types of missing value all recorded together. Imagine a dataset like this:

bird count researcher
kookaburra NA A
kookaburra 0 B
crow NA A
crow 1 B
pigeon -999 A
pigeon -9999 B

To simplify things, we will start by exploring cleaned up missing values - those stored as `NA`, which is R’s standard way of representing missing values. Transforming the chaos above, this is how it would be represented if the missing values appropriately.

bird count researcher
kookaburra NA A
kookaburra 0 B
crow NA A
crow 1 B
pigeon NA A
pigeon NA B

To help explore and understand missing values, we’ll be using the `naniar` package, which provides many helpers to make it easier to explore, understand, and visualise missing values.

## 1.2 How does R deal with missing values?

Before we start exploring missingness, we need to understand how R interprets and processes missing values. R stores missing values as `NA`, which stands for Not Available. R deals with `NA`s in unique, and sometimes unexpected, ways.

### 1.2.1 Missing values in basic R operations

What happens when we mix missing values (`NA`) with our calculations? We need to know how R deals with missing values in operations so we can recognize these cases and deal with them appropriately.

The general rule for `NA`s in R calculations is:

Calculations with `NA` return `NA`.

Several outcomes for common operations that include `NA` are:

• `NA` + [anything] = `NA`
• `NA` - [anything] = `NA`
• `NA` * [anything] = `NA`
• `NA` / [anything] = `NA`
• `NA` == [anything*] = `NA`

For example, suppose we have a `heights` dataset containing the heights of four friends (Sophie, Dan, Fred, and Liz):

``````heights <- tibble::tibble(
name = c("Sophie", "Dan", "Fred", "Liz"),
height = c(163, 175, NA, NA)
)

heights``````
``````# A tibble: 4 × 2
name   height
<chr>   <dbl>
1 Sophie    163
2 Dan       175
3 Fred       NA
4 Liz        NA``````

The sum of the `height` variable returns `NA`:

``sum(heights\$height)``
``[1] NA``

This is because we cannot know the sum of a number and a missing value. Similarly, if we try to find the mean height, `NA` is returned:

``mean(heights\$height)``
``[1] NA``

When an operation on data containing an `NA` returns an `NA`, it tells us the missing values are not being ignored in the calculation, reflecting the default argument `na.rm = FALSE` (read: “Remove NAs? No!”) in many functions.

Always check the default `NA` action (e.g. `na.rm = FALSE`) for functions. As we will see later, the default in some functions is to remove `NA` - sometimes without warning.

Can we override the default `NA` action? Sure! For example, we can calculate the mean of the non-missing heights in our example dataset by updating the action to `na.rm = TRUE` (read: “Remove NAs? Yes!”). The mean value is then calculated based on the two existing height values, and any `NA` are ignored.

``mean(heights\$height, na.rm = TRUE)``
``[1] 169``

Now that we know a bit about how R stores and handles missing values, we can start exploring them.

## 1.3 Do my data contain missing values?

``library(naniar)``

Missing values don’t jump out and scream “I’m here!”. They’re usually hidden, like a needle in a haystack - especially in large datasets. We need tools (or rather, functions) to quickly identify and count missing values.

Let’s create an example vector `x`, which contains missing values encoded as `NA`:

``````x <- c(1, NA, 3, NA, NA, 5, 8)
x``````
``[1]  1 NA  3 NA NA  5  8``

In this small vector (n = 7), we can quickly see that the 2nd, 4th and 5th values in the vector are `NA`. With larger data, however, we would want tools to identify these for us, instead of manually looking for them. Two functions for identifying `NA`s are `are_na()` and `any_na()`. These are from the `naniar` R package.

### 1.3.1`are_na()`: which values are `NA`?

The `are_na()` function checks each value in a vector or data frame (i.e., for each value it asks “is this value `NA`”?) then returns TRUE (if `NA`) or FALSE (if anything besides `NA`).

``are_na(x)``
``[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE``

As expected, the three `NA` elements in `x` return `TRUE`.

### 1.3.2`any_na()`: are there any`NA`s?

The `are_na()` function tells us which values are `NA`. If we instead want to know if any elements in our data are `NA`, we can instead use `any_na()`. The `any_na` function returns TRUE if there are any missing values (stored as `NA`s), and FALSE if there are none.

``any_na(x)``
``[1] TRUE``

Because `x` contains at least one `NA`, we see that `any_na(x)` returns TRUE, and will return FALSE if there are no `NA` values:

``any_na(c(1, 2, 3, 4))``
``[1] FALSE``

The `any_na` and `are_na` functions can give us a “heads up” about whether or not our data contains missing values. To deal with them responsibly, however, we need to dig further into patterns of missingess. The next step is exploring missingness visually.