6  Cleaning missing data

This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

library(naniar)
library(dplyr)

6.1 Find and replace missing values

In previous sections, we learned how to count, summarise and visualise missing values stored as NA. Often, however, raw data contain missing values that have been recorded as something other than NA. These include things like characters (e.g. “missing”, “N/A”, or “no data”) or impossible values (e.g. “-9999” for a dolphin length variable).

Always take care when working with data, especially if you did not collect or record it yourself:

Never assume that all missing values are stored as NA

The problem is most functions assessing missingness only recognize NA, so they will not recognize other missing value inputs, such as “NA”, or “missing”. That means the first thing we often need to do is search for missing values stored as something other than NA in our data, then replace those non-NA missing values with NA so our assessments of missingness, and subsequent analyses, are accurate.

In this section, we introduce tools and strategies to:

  • Search for missing values stored as something other than NA
  • Replace them with NA

We introduce the following functions to help us:

  • miss_scan_count(): search for missing values stored as something other than NA (e.g. “N/A”, “-999”, “.”, etc.)
  • replace_with_na(): replace non-NA values with NA

We will use a dataset called chaos, shown below, which contains gnarly values like plain whitespace, ” “, full-stops (or periods)”.”, “N/A”, and “missing” - all of which, in this case, should be stored as NA.

chaos <- tibble::tibble(
  score =   c(3L, -99L, 4L, -99L, 7L, 10L, 12L, 16L, 9L),
  grade = c("N/A", "E", "missing", "na", "n/a", " ", ".", NA, "N/a"),
  place = c(-99, 97, 95, 92, -98, "missing", 88, ".", 86)
)

knitr::kable(chaos)
score grade place
3 N/A -99
-99 E 97
4 missing 95
-99 na 92
7 n/a -98
10 missing
12 . 88
16 NA .
9 N/a 86

An Aside: Talk to the people who collect or curate the data

If you have access to the people who collect or curate the data, talk to them! It is amazing how much they can tell you about the data that you might not have ever known. You will get the most out of the conversation if you’ve had a look at the data first, and noted any abnormalities. Asking questions like “what did you do with missing data? How are missing values encoded? How did you collect the data? Did you summarise the data before giving it to me? Is this the most raw form of the data? Are good questions to help you get started. Also, remember to be friendly to these people. I know from experience that it can be very frustrating to have data that is poor or low quality, where missing values are deleted, or the data is summarised to the point of no variation. However it is important to keep in mind that these people who collect or curate the data are often trying to help you by saving you time summarising. Be kind, and be curious. And ask for the data in the rawest form.

6.1.1 Search for missing values

Before we can start replacing unexpected missing values with NA, we should get a sense of how big this missing data problem is by searching for those strange missing values. The miss_scan_count function in naniar allows you to search for likely records of missing values stored as something other than NA. For example, if we want to check for missing values that are input as “N/A”, we can use:

chaos %>%
  miss_scan_count(search = list("N/A"))
# A tibble: 3 × 2
  Variable     n
  <chr>    <int>
1 score        0
2 grade        1
3 place        0

This returns a dataframe with two columns: “Variable” - the variables in the chaos data frame, and “n”, the number of times that string appears in each variable. Here, we see that “N/A” appears once in the grade variable, and never in the score or place variables.

The miss_scan_count function accepts multiple arguments in the search, so you can look for all the strangely recorded missing values you like! Here we see that when searching for capital “N/A” and “N/a”, there are two hits for the variable, grade (and still 0 for both score and place).

chaos %>%
  miss_scan_count(search = list("N/A",
                                "N/a"))
# A tibble: 3 × 2
  Variable     n
  <chr>    <int>
1 score        0
2 grade        2
3 place        0

The naniar package also contains two helpful datasets to explore missingness, common_na_numbers, and common_na_strings:

common_na_numbers
[1]    -9   -99  -999 -9999  9999    66    77    88
common_na_strings
 [1] "NA"     "N A"    "N/A"    "#N/A"   "NA "    " NA"    "N /A"   "N / A" 
 [9] " N / A" "N / A " "na"     "n a"    "n/a"    "na "    " na"    "n /a"  
[17] "n / a"  " a / a" "n / a " "NULL"   "null"   ""       "\\?"    "\\*"   
[25] "\\."   

These can be put inside of miss_scan_count and we can see we’ve got even more matches!

chaos %>%
  miss_scan_count(search = common_na_numbers)
# A tibble: 3 × 2
  Variable     n
  <chr>    <int>
1 score        2
2 grade        0
3 place        3
chaos %>%
  miss_scan_count(search = common_na_strings)
# A tibble: 3 × 2
  Variable     n
  <chr>    <int>
1 score        9
2 grade        8
3 place        9

You can also look for both:

chaos %>%
  miss_scan_count(search = c(common_na_numbers,
                             common_na_strings))
# A tibble: 3 × 2
  Variable     n
  <chr>    <int>
1 score        9
2 grade        8
3 place        9

Note that you do still need to carefully explore the data and metadata to get an idea of how missing values were recorded so that you don’t miss an obscure missing record. Also consider that some o these values might have other meanings - finding a match of the numbers in common_na_numbers might not mean they all match missing values, since, for example, you could conceivably have values -99.

An Aside on \\

Note that in common_na_strings, there are some \\ for values such as . and * and ?. This is because under the hood, miss_scan_count uses a thing called “regular expressions” to search for characters in the data. Briefly, regular expressions allow you to find and extract parts of text from collections of text. For example, the regular expression “*.csv$” means “find words that contain anything up until”.csv”, and “.csv” is also the last thing in the word. So these values, *, ., and ? all have special meaning in regular expressions. We use \\ to “escape” the regular expression. It’s our way of saying, “No really, just look for”*“, or”.”, or “?”. Regular expressions are a really powerful tool, but can take some (sometimes a lot) of time to get your head around. Two places that provide a nice way to test out regular expressions is https://regex101.com/ and https://regexr.com/.

6.1.2 Replacing missing values with NA

Once you’ve explored and searched for missing values stored as something other than NA, you can replace them with NA using the replace_with_na() function. For example, in the chaos dataset we can replace “N/A” and “N/a” entries that appear in the grade variable as follows:

chaos %>%
  replace_with_na(replace = list(grade = c("N/A", "N/a")))
# A tibble: 9 × 3
  score grade     place  
  <int> <chr>     <chr>  
1     3  <NA>     -99    
2   -99 "E"       97     
3     4 "missing" 95     
4   -99 "na"      92     
5     7 "n/a"     -98    
6    10 " "       missing
7    12 "."       88     
8    16  <NA>     .      
9     9  <NA>     86     

The above code can be read as follows:

Start with the chaos data, then within the variable grade replace any existing values of “N/A” and “N/a” with NA.

We can see this has replaced some of the missing values, but note that it only replaces the exact specified strings (“N/A” and “N/a”) - even slight variations (“na” and “n/a”) still exist.

We can even use common_na_strings in replace_with_na - but be warned! This should only be done if you really, truly, 100% for sure know that all the values in common_na_strings should be missing values in your data. Do not apply this without careful thought! You have been warned!

chaos %>%
  replace_with_na(replace = list(grade = common_na_strings))
# A tibble: 9 × 3
  score grade     place  
  <int> <chr>     <chr>  
1     3  <NA>     -99    
2   -99 "E"       97     
3     4 "missing" 95     
4   -99  <NA>     92     
5     7  <NA>     -98    
6    10 " "       missing
7    12 "."       88     
8    16  <NA>     .      
9     9 "N/a"     86     

UP TO HERE

6.1.3 Useful variants of replace_with_na

The replace_with_na function can be repetitive if you need to use it across many variables, for many different values. Or, for more complex cases where you might only want to replace values less than -1, or only treat character columns. To account for these situations, naniar borrows from dplyr’s scoped variants and extends replace_with_na to create three useful functions:

  • replace_with_na_all(): operates on all variables.
  • replace_with_na_at(): operates on a subset of selected variables
  • replace_with_na_if(): operates on a subset of variables that fulfil a condition (e.g. only on numeric variables)

Example: replace_with_na_all

The scoped variants of replace_with_na follow a specific syntax. You provide a condition argument, and pass it a special function that starts with the squiggly line, tilde, ~, and when referring to a variable, you use .x. For example, if we want to replace all cases of -99 in a dataset, we use replace_with_na_all, and write:

chaos %>%
  replace_with_na_all(condition = ~.x == -99)
# A tibble: 9 × 3
  score grade     place  
  <int> <chr>     <chr>  
1     3 "N/A"     <NA>   
2    NA "E"       97     
3     4 "missing" 95     
4    NA "na"      92     
5     7 "n/a"     -98    
6    10 " "       missing
7    12 "."       88     
8    16  <NA>     .      
9     9 "N/a"     86     

We can read the above code as:

start with chaos, THEN replace_with_na_all where any variable (~.x) is equal to -99.

Extending this a bit further, we can replace values “N/A”, “missing”, or “na” with NA across all variables in chaos with the following:

chaos %>% 
  replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))
# A tibble: 9 × 3
  score grade place
  <int> <chr> <chr>
1     3  <NA> -99  
2   -99 "E"   97   
3     4  <NA> 95   
4   -99  <NA> 92   
5     7 "n/a" -98  
6    10 " "   <NA> 
7    12 "."   88   
8    16  <NA> .    
9     9 "N/a" 86   

We can read the code above as:

Start with chaos data, THEN replace_with_na_all across all variables where the existing value is “N/A”, “missing”, or “na”

Example: replace_with_na_at

To select specific columns to apply replace_with_na to slected variables by name, use the scoped variant replace_with_na_at. For example, to only replace values with NA in the place column of chaos, we can use:

chaos %>% 
  replace_with_na_at(
    .vars = "place", 
    condition = ~.x %in% c("missing", "na", ".")
    )
# A tibble: 9 × 3
  score grade     place
  <int> <chr>     <chr>
1     3 "N/A"     -99  
2   -99 "E"       97   
3     4 "missing" 95   
4   -99 "na"      92   
5     7 "n/a"     -98  
6    10 " "       <NA> 
7    12 "."       88   
8    16  <NA>     <NA> 
9     9 "N/a"     86   

We can see that those recorded missings have been replaced with NA only in the place variable.

Example: replace_with_na_if

The replace_with_na_if function allows us to replace values with NA in columns that satisfy a condition (e.g. if I only want to replace with NA in a character column).

For example, to replace values with NA in character columns in chaos, we can use the following code:

chaos %>% 
  replace_with_na_if(
    .predicate = is.character, 
    condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
    )
# A tibble: 9 × 3
  score grade place
  <int> <chr> <chr>
1     3  <NA> -99  
2   -99 "E"   97   
3     4  <NA> 95   
4   -99  <NA> 92   
5     7  <NA> -98  
6    10 " "   <NA> 
7    12  <NA> 88   
8    16  <NA> <NA> 
9     9  <NA> 86   

Note that all of those varied records of missingness in the two character columns (grade and place) have been replaced with NA.

It is worthwhile to think about which records were not replaced with NA in the example above. Perhaps these were incorrectly recorded, or indicate a missing value?

Overall, the scoped variants of replace_with_na provide more control over which values in the data are replaced by NA.

6.1.4 Alternatives to replace_with_na

The replace_with_na function, and scoped variants, provide a high degree of control over what you replace, and over which variables. However, they can sometimes be a bit slow for larger datasets. If you do not need that level of control, and would like to have a bit more speed, several options exist to replace values with NA.

6.1.4.1 dplyr::na_if

If you need to replace a single non-NA entry (e.g. “N/A”) throughout the dataset, you can use the dplyr::na_if function. Note that it only works to replace a single value, like “N/A”, and cannot handle vectors of multiple values (e.g. it breaks with c("N/A", "na", ".")).

The following code replaces all “missing” occurrences in chaos with NA:

chaos %>% 
  na_if("missing")
# A tibble: 9 × 3
  score grade place
  <int> <chr> <chr>
1     3 "N/A" -99  
2   -99 "E"   97   
3     4  <NA> 95   
4   -99 "na"  92   
5     7 "n/a" -98  
6    10 " "   <NA> 
7    12 "."   88   
8    16  <NA> .    
9     9 "N/a" 86   

You can also use na_if with across, but not to same flexibility as replace_with_na:

# in across
chaos %>%
  mutate(
    across(
      .cols = everything(),
      .fns = ~na_if(., -99)
        )
  )

chaos %>%
  mutate(
    across(
      .cols = "place",
      # note that you cannot specify multiple values to replace in `place`
      .fns = ~na_if(., c("missing")
        )
  )

chaos %>%
  mutate(
    across(
      .cols = where(is.character),
      # note again that you cannot specify multiple values to replace with NA
      .fns = ~na_if(., "N/A")
        )
  )

6.1.4.2 Argument na = in readr

Similarly, you can replace a specified non-NA throughout the dataset when reading the dataset into R using the optional na = argument in readr.

For example, if we were reading in a theoretical .csv file in our current folder called ‘hiking’ that contains missing values recorded as “no data” throughout, we could read it in and replace all “no data” with NA as follows:

df <- read_csv("hiking.csv", na = "no data")

One workflow here might be to use the tools in naniar, miss_scan_count(search = list("N/A")) and perhaps replace_with_na to understand and check your missing value replacements, then put all of the values that are missing you have found and confirmed into the na argument of read_csv.

An aside: dplyr and across

When naniar was written, dplyr’s scoped variants were a very new feature, but since writing, this feature has become superceded by the new across feature. We are still working through some bugs in naniar to try and make across work with replace_with_na. Although it is possible to use na_if to some extent with across, because it only accepts single values, it does not really work in the same was as replace_with_na. The idea with replace_with_na and across would be for it to look something like the following instead:

replace_with_na_all(
  data = chaos, 
  condition = ~.x == -99
  )

# in across
chaos %>%
  mutate(
    across(
      .cols = everything(),
      .fns = replace_with_na,
      condition = ~.x == -99
        )
  )

replace_with_na_at(
  chaos
  .vars = "place", 
  condition = ~.x %in% c("missing", "na", ".")
  )

chaos %>%
  mutate(
    across(
      .cols = "place",
      .fns = replace_with_na,
      condition = ~.x %in% c("missing", "na", ".")
        )
  )

replace_with_na_if(
  data = chaos,
  .predicate = is.character, 
  condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
  )

chaos %>%
  mutate(
    across(
      .cols = where(is.character),
      .fns = replace_with_na,
      condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
        )
  )