2  Missing data gotchya’s

This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

library(naniar)

Missing data are a special part of R, they are baked right into the software, and aren’t only made available by certain R packages. However, there are some quirks of missing data that mean they can catch you off guard. Let’s call these the “missing data gotchya’s”. Let’s discuss some of these now.

2.1 NaN vs NA

In R, there is a special value, NaN, which stands for “Not a Number”. A NaN will come from operations like the square root of -1:

sqrt(-1)
Warning in sqrt(-1): NaNs produced
[1] NaN

Now, R actually interprets NaN as a missing value, treating it the same way it treats NA. Even if it is technically not a missing value.

any_na(NaN)
[1] TRUE

This might come up in a data analysis, if you were to transform some data with the square root and then count the number of missing values, and there is a negative value, you might get caught out.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(naniar)
vec <- c(-1:4)
sqrt(vec)
Warning in sqrt(vec): NaNs produced
[1]      NaN 0.000000 1.000000 1.414214 1.732051 2.000000
sqrt(vec) %>% n_miss()
Warning in sqrt(vec): NaNs produced
[1] 1

2.2 NULL vs NA

In R, NULL is an empty value. For example, if we create a vector of NULL values, only one appears

c(NULL, NULL, NULL)
NULL

Compare this to a vector of NA values:

c(NA, NA, NA)
[1] NA NA NA

Importantly, NULL values are not missing values, but rather just “empty” values. This is subtly different from missing: An empty bucket isn’t missing water.

any_na(NULL)
[1] FALSE

Another way to think about this is if you were recording features of animals - animals are all quite different! So you record horn_length of a mouse as NULL - because mice do not have horns. It’s not that it should have been recorded and wasn’t - it shouldn’t be recorded because it doesn’t exist.

2.3 Inf vs NA

Inf is an Infinite value, and results from equations like 10/0:

10 / 0
[1] Inf

It is not counted as a missing value

any_na(Inf)
[1] FALSE

3 “NA” vs NA

Using the function is.na() will return true for NA

is.na(NA)
[1] TRUE

But for a quoted character, “NA”, is not missing.

is.na("NA")
[1] FALSE

3.1 Conditional statements and NA

Beware of conditional statements with missing values. For example:

  • NA or TRUE is TRUE
  • NA or FALSE is NA
  • NA + NaN is NA
  • NaN + NA is NaN
NA | TRUE
[1] TRUE
NA | FALSE
[1] NA
NA + NaN
[1] NA
NaN + NA
[1] NaN

3.2 The multiple flavours of NA values

NA values represent missing values in R. There are actually many different flavours of NA values in R:

  • NA for logical
  • NA_character_ for characters
  • NA_integer_ for integer values
  • NA_real_ for doubles (values with decimal points)
  • NA_complex_ for complex values (like 1i)

So what? What does this mean?

is.na(NA)
[1] TRUE
is.na(NA_character_)
[1] TRUE
is.character(NA_character_)
[1] TRUE
is.double(NA_character_)
[1] FALSE
is.integer(NA_integer_)
[1] TRUE
is.logical(NA)
[1] TRUE

Uhhh-huh. So, neat? Right? NA values are this double entity that have different classes? Yup! And they’re among the special reserved words in R. That’s a fun fact.

OK, so why care about this? Well, in R, when you create a vector, it has to resolve to the same class. Not sure what I mean?

Well, imagine you want to have the values 1:3

c(1,2,3)
[1] 1 2 3

And then you add one that is in quotes, “hello there”:

c(1,2,3, "hello there")
[1] "1"           "2"           "3"           "hello there"

They all get converted to “character”.

Well, it turns out that NA values need to have that feature as well, they aren’t this amorphous value that magically takes on the class. Well, they kind of are actually, and that’s kind of the point - we don’t notice it, and it’s one of the great things about R, it has native support for NA values.

So, imagine this tiny vector, then:

vec <- c("a", NA)
vec
[1] "a" NA 
is.character(vec[1])
[1] TRUE
is.na(vec[1])
[1] FALSE
is.character(vec[2])
[1] TRUE
is.na(vec[2])
[1] TRUE

OK, so, what’s the big deal? What’s the deal with this long lead up? Stay with me, we’re nearly there:

vec <- c(1:5)
vec
[1] 1 2 3 4 5

Now, let’s say we want to replace values greater than 4 to be the next line in the song by Feist.

If we use the base R, ifelse:

ifelse(vec > 4, yes = "tell me that you love me more", no = vec)
[1] "1"                             "2"                            
[3] "3"                             "4"                            
[5] "tell me that you love me more"

It converts everything to a character. We get what we want here.

Now, if we use dplyr::if_else:

dplyr::if_else(vec > 4, true = "tell me that you love me more", false = vec)
Error in `dplyr::if_else()`:
! `false` must be a character vector, not an integer vector.

ooo, an error? This is useful because you might have a case where you do something like this:

dplyr::if_else(vec > 4, true = "5", false = vec)
Error in `dplyr::if_else()`:
! `false` must be a character vector, not an integer vector.

Which wouldn’t be protected against in base:

ifelse(vec > 4, yes = "5", no = vec)
[1] "1" "2" "3" "4" "5"

So why does that matter for NA values?

Well, because if you try and replace values more than 4 with NA, you’ll get the same error:

dplyr::if_else(vec > 4, true = NA, false = vec)
Error in `dplyr::if_else()`:
! `false` must be a logical vector, not an integer vector.

But this can be resolved by using the appropriate NA type:

dplyr::if_else(vec > 4, true = NA_integer_, false = vec)
[1]  1  2  3  4 NA

And that’s why it’s important to know about.

It’s one of these somewhat annoying things that you can come across in the tidyverse, but it’s also kind of great. It’s opinionated, and it means that you will almost certainly save yourself a whole world of pain later.

What is kind of fun is that using base R you can get some interesting results playing with the different types of NA values, like so:

ifelse(vec > 4, yes = NA, no = vec)
[1]  1  2  3  4 NA
ifelse(vec > 4, yes = NA_character_, no = vec)
[1] "1" "2" "3" "4" NA 

It’s also worth knowing that you’ll get the same error appearing in case_when:

dplyr::case_when(
  vec > 4 ~ NA,
  TRUE ~ vec
  )
Error in names(message) <- `*vtmp*`: 'names' attribute [1] must be the same length as the vector [0]

But this can be resolved by using the appropriate NA value

dplyr::case_when(
  vec > 4 ~ NA_integer_,
  TRUE ~ vec
  )
[1]  1  2  3  4 NA