library(naniar)
2 Missing data gotchya’s
This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar
works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.
Missing data are a special part of R, they are baked right into the software, and aren’t only made available by certain R packages. However, there are some quirks of missing data that mean they can catch you off guard. Let’s call these the “missing data gotchya’s”. Let’s discuss some of these now.
2.1 NaN vs NA
In R, there is a special value, NaN
, which stands for “Not a Number”. A NaN
will come from operations like the square root of -1:
sqrt(-1)
Warning in sqrt(-1): NaNs produced
[1] NaN
Now, R actually interprets NaN
as a missing value, treating it the same way it treats NA
. Even if it is technically not a missing value.
any_na(NaN)
[1] TRUE
This might come up in a data analysis, if you were to transform some data with the square root and then count the number of missing values, and there is a negative value, you might get caught out.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(naniar)
<- c(-1:4)
vec sqrt(vec)
Warning in sqrt(vec): NaNs produced
[1] NaN 0.000000 1.000000 1.414214 1.732051 2.000000
sqrt(vec) %>% n_miss()
Warning in sqrt(vec): NaNs produced
[1] 1
2.2 NULL vs NA
In R, NULL
is an empty value. For example, if we create a vector of NULL values, only one appears
c(NULL, NULL, NULL)
NULL
Compare this to a vector of NA values:
c(NA, NA, NA)
[1] NA NA NA
Importantly, NULL values are not missing values, but rather just “empty” values. This is subtly different from missing: An empty bucket isn’t missing water.
any_na(NULL)
[1] FALSE
Another way to think about this is if you were recording features of animals - animals are all quite different! So you record horn_length
of a mouse as NULL - because mice do not have horns. It’s not that it should have been recorded and wasn’t - it shouldn’t be recorded because it doesn’t exist.
2.3 Inf vs NA
Inf
is an Infinite value, and results from equations like 10/0
:
10 / 0
[1] Inf
It is not counted as a missing value
any_na(Inf)
[1] FALSE
3 “NA” vs NA
Using the function is.na()
will return true for NA
is.na(NA)
[1] TRUE
But for a quoted character, “NA”, is not missing.
is.na("NA")
[1] FALSE
3.1 Conditional statements and NA
Beware of conditional statements with missing values. For example:
- NA or TRUE is TRUE
- NA or FALSE is NA
- NA + NaN is NA
- NaN + NA is NaN
NA | TRUE
[1] TRUE
NA | FALSE
[1] NA
NA + NaN
[1] NA
NaN + NA
[1] NaN
3.2 The multiple flavours of NA values
NA
values represent missing values in R. There are actually many different flavours of NA values in R:
NA
for logicalNA_character_
for charactersNA_integer_
for integer valuesNA_real_
for doubles (values with decimal points)NA_complex_
for complex values (like1i
)
So what? What does this mean?
is.na(NA)
[1] TRUE
is.na(NA_character_)
[1] TRUE
is.character(NA_character_)
[1] TRUE
is.double(NA_character_)
[1] FALSE
is.integer(NA_integer_)
[1] TRUE
is.logical(NA)
[1] TRUE
Uhhh-huh. So, neat? Right? NA values are this double entity that have different classes? Yup! And they’re among the special reserved words in R. That’s a fun fact.
OK, so why care about this? Well, in R, when you create a vector, it has to resolve to the same class. Not sure what I mean?
Well, imagine you want to have the values 1:3
c(1,2,3)
[1] 1 2 3
And then you add one that is in quotes, “hello there”:
c(1,2,3, "hello there")
[1] "1" "2" "3" "hello there"
They all get converted to “character”.
Well, it turns out that NA
values need to have that feature as well, they aren’t this amorphous value that magically takes on the class. Well, they kind of are actually, and that’s kind of the point - we don’t notice it, and it’s one of the great things about R, it has native support for NA values.
So, imagine this tiny vector, then:
<- c("a", NA)
vec vec
[1] "a" NA
is.character(vec[1])
[1] TRUE
is.na(vec[1])
[1] FALSE
is.character(vec[2])
[1] TRUE
is.na(vec[2])
[1] TRUE
OK, so, what’s the big deal? What’s the deal with this long lead up? Stay with me, we’re nearly there:
<- c(1:5)
vec vec
[1] 1 2 3 4 5
Now, let’s say we want to replace values greater than 4 to be the next line in the song by Feist.
If we use the base R, ifelse
:
ifelse(vec > 4, yes = "tell me that you love me more", no = vec)
[1] "1" "2"
[3] "3" "4"
[5] "tell me that you love me more"
It converts everything to a character. We get what we want here.
Now, if we use dplyr::if_else
:
::if_else(vec > 4, true = "tell me that you love me more", false = vec) dplyr
Error in `dplyr::if_else()`:
! `false` must be a character vector, not an integer vector.
ooo, an error? This is useful because you might have a case where you do something like this:
::if_else(vec > 4, true = "5", false = vec) dplyr
Error in `dplyr::if_else()`:
! `false` must be a character vector, not an integer vector.
Which wouldn’t be protected against in base:
ifelse(vec > 4, yes = "5", no = vec)
[1] "1" "2" "3" "4" "5"
So why does that matter for NA values?
Well, because if you try and replace values more than 4 with NA
, you’ll get the same error:
::if_else(vec > 4, true = NA, false = vec) dplyr
Error in `dplyr::if_else()`:
! `false` must be a logical vector, not an integer vector.
But this can be resolved by using the appropriate NA
type:
::if_else(vec > 4, true = NA_integer_, false = vec) dplyr
[1] 1 2 3 4 NA
And that’s why it’s important to know about.
It’s one of these somewhat annoying things that you can come across in the tidyverse, but it’s also kind of great. It’s opinionated, and it means that you will almost certainly save yourself a whole world of pain later.
What is kind of fun is that using base R you can get some interesting results playing with the different types of NA
values, like so:
ifelse(vec > 4, yes = NA, no = vec)
[1] 1 2 3 4 NA
ifelse(vec > 4, yes = NA_character_, no = vec)
[1] "1" "2" "3" "4" NA
It’s also worth knowing that you’ll get the same error appearing in case_when
:
::case_when(
dplyr> 4 ~ NA,
vec TRUE ~ vec
)
Error in names(message) <- `*vtmp*`: 'names' attribute [1] must be the same length as the vector [0]
But this can be resolved by using the appropriate NA
value
::case_when(
dplyr> 4 ~ NA_integer_,
vec TRUE ~ vec
)
[1] 1 2 3 4 NA