This book contains both practical guides on exploring missing data, as well as some of the deeper details of how `naniar` works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

``library(naniar)``

Missing data are a special part of R, they are baked right into the software, and aren’t only made available by certain R packages. However, there are some quirks of missing data that mean they can catch you off guard. Let’s call these the “missing data gotchya’s”. Let’s discuss some of these now.

## 2.1 NaN vs NA

In R, there is a special value, `NaN`, which stands for “Not a Number”. A `NaN` will come from operations like the square root of -1:

``sqrt(-1)``
``Warning in sqrt(-1): NaNs produced``
``[1] NaN``

Now, R actually interprets `NaN` as a missing value, treating it the same way it treats `NA`. Even if it is technically not a missing value.

``any_na(NaN)``
``[1] TRUE``

This might come up in a data analysis, if you were to transform some data with the square root and then count the number of missing values, and there is a negative value, you might get caught out.

``library(tidyverse)``
``── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──``
``````✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1``````
``````── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
``````library(naniar)
vec <- c(-1:4)
sqrt(vec)``````
``Warning in sqrt(vec): NaNs produced``
``[1]      NaN 0.000000 1.000000 1.414214 1.732051 2.000000``
``sqrt(vec) %>% n_miss()``
``Warning in sqrt(vec): NaNs produced``
``[1] 1``

## 2.2 NULL vs NA

In R, `NULL` is an empty value. For example, if we create a vector of NULL values, only one appears

``c(NULL, NULL, NULL)``
``NULL``

Compare this to a vector of NA values:

``c(NA, NA, NA)``
``[1] NA NA NA``

Importantly, NULL values are not missing values, but rather just “empty” values. This is subtly different from missing: An empty bucket isn’t missing water.

``any_na(NULL)``
``[1] FALSE``

Another way to think about this is if you were recording features of animals - animals are all quite different! So you record `horn_length` of a mouse as NULL - because mice do not have horns. It’s not that it should have been recorded and wasn’t - it shouldn’t be recorded because it doesn’t exist.

## 2.3 Inf vs NA

`Inf` is an Infinite value, and results from equations like `10/0`:

``10 / 0``
``[1] Inf``

It is not counted as a missing value

``any_na(Inf)``
``[1] FALSE``

# 3 “NA” vs NA

Using the function `is.na()` will return true for `NA`

``is.na(NA)``
``[1] TRUE``

But for a quoted character, “NA”, is not missing.

``is.na("NA")``
``[1] FALSE``

## 3.1 Conditional statements and NA

Beware of conditional statements with missing values. For example:

• NA or TRUE is TRUE
• NA or FALSE is NA
• NA + NaN is NA
• NaN + NA is NaN
``NA | TRUE``
``[1] TRUE``
``NA | FALSE``
``[1] NA``
``NA + NaN``
``[1] NA``
``NaN + NA``
``[1] NaN``

## 3.2 The multiple flavours of NA values

`NA` values represent missing values in R. There are actually many different flavours of NA values in R:

• `NA` for logical
• `NA_character_` for characters
• `NA_integer_` for integer values
• `NA_real_` for doubles (values with decimal points)
• `NA_complex_` for complex values (like `1i`)

So what? What does this mean?

``is.na(NA)``
``[1] TRUE``
``is.na(NA_character_)``
``[1] TRUE``
``is.character(NA_character_)``
``[1] TRUE``
``is.double(NA_character_)``
``[1] FALSE``
``is.integer(NA_integer_)``
``[1] TRUE``
``is.logical(NA)``
``[1] TRUE``

Uhhh-huh. So, neat? Right? NA values are this double entity that have different classes? Yup! And they’re among the special reserved words in R. That’s a fun fact.

OK, so why care about this? Well, in R, when you create a vector, it has to resolve to the same class. Not sure what I mean?

Well, imagine you want to have the values 1:3

``c(1,2,3)``
``[1] 1 2 3``

And then you add one that is in quotes, “hello there”:

``c(1,2,3, "hello there")``
``[1] "1"           "2"           "3"           "hello there"``

They all get converted to “character”.

Well, it turns out that `NA` values need to have that feature as well, they aren’t this amorphous value that magically takes on the class. Well, they kind of are actually, and that’s kind of the point - we don’t notice it, and it’s one of the great things about R, it has native support for NA values.

So, imagine this tiny vector, then:

``````vec <- c("a", NA)
vec``````
``[1] "a" NA ``
``is.character(vec[1])``
``[1] TRUE``
``is.na(vec[1])``
``[1] FALSE``
``is.character(vec[2])``
``[1] TRUE``
``is.na(vec[2])``
``[1] TRUE``

OK, so, what’s the big deal? What’s the deal with this long lead up? Stay with me, we’re nearly there:

``````vec <- c(1:5)
vec``````
``[1] 1 2 3 4 5``

Now, let’s say we want to replace values greater than 4 to be the next line in the song by Feist.

If we use the base R, `ifelse`:

``ifelse(vec > 4, yes = "tell me that you love me more", no = vec)``
``````[1] "1"                             "2"
[3] "3"                             "4"
[5] "tell me that you love me more"``````

It converts everything to a character. We get what we want here.

Now, if we use `dplyr::if_else`:

``dplyr::if_else(vec > 4, true = "tell me that you love me more", false = vec)``
``````Error in `dplyr::if_else()`:
! `false` must be a character vector, not an integer vector.``````

ooo, an error? This is useful because you might have a case where you do something like this:

``dplyr::if_else(vec > 4, true = "5", false = vec)``
``````Error in `dplyr::if_else()`:
! `false` must be a character vector, not an integer vector.``````

Which wouldn’t be protected against in base:

``ifelse(vec > 4, yes = "5", no = vec)``
``[1] "1" "2" "3" "4" "5"``

So why does that matter for NA values?

Well, because if you try and replace values more than 4 with `NA`, you’ll get the same error:

``dplyr::if_else(vec > 4, true = NA, false = vec)``
``````Error in `dplyr::if_else()`:
! `false` must be a logical vector, not an integer vector.``````

But this can be resolved by using the appropriate `NA` type:

``dplyr::if_else(vec > 4, true = NA_integer_, false = vec)``
``[1]  1  2  3  4 NA``

And that’s why it’s important to know about.

It’s one of these somewhat annoying things that you can come across in the tidyverse, but it’s also kind of great. It’s opinionated, and it means that you will almost certainly save yourself a whole world of pain later.

What is kind of fun is that using base R you can get some interesting results playing with the different types of `NA` values, like so:

``ifelse(vec > 4, yes = NA, no = vec)``
``[1]  1  2  3  4 NA``
``ifelse(vec > 4, yes = NA_character_, no = vec)``
``[1] "1" "2" "3" "4" NA ``

It’s also worth knowing that you’ll get the same error appearing in `case_when`:

``````dplyr::case_when(
vec > 4 ~ NA,
TRUE ~ vec
)``````
``Error in names(message) <- `*vtmp*`: 'names' attribute [1] must be the same length as the vector [0]``

But this can be resolved by using the appropriate `NA` value

``````dplyr::case_when(
vec > 4 ~ NA_integer_,
TRUE ~ vec
)``````
``[1]  1  2  3  4 NA``