library(naniar)
library(dplyr)
6 Cleaning missing data
This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar
works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.
6.1 Find and replace missing values
In previous sections, we learned how to count, summarise and visualise missing values stored as NA
. Often, however, raw data contain missing values that have been recorded as something other than NA
. These include things like characters (e.g. “missing”, “N/A”, or “no data”) or impossible values (e.g. “-9999” for a dolphin length variable).
Always take care when working with data, especially if you did not collect or record it yourself:
Never assume that all missing values are stored as
NA
The problem is most functions assessing missingness only recognize NA
, so they will not recognize other missing value inputs, such as “NA”, or “missing”. That means the first thing we often need to do is search for missing values stored as something other than NA
in our data, then replace those non-NA
missing values with NA
so our assessments of missingness, and subsequent analyses, are accurate.
In this section, we introduce tools and strategies to:
- Search for missing values stored as something other than
NA
- Replace them with
NA
We introduce the following functions to help us:
miss_scan_count()
: search for missing values stored as something other thanNA
(e.g. “N/A”, “-999”, “.”, etc.)replace_with_na()
: replace non-NA
values withNA
We will use a dataset called chaos
, shown below, which contains gnarly values like plain whitespace, ” “, full-stops (or periods)”.”, “N/A”, and “missing” - all of which, in this case, should be stored as NA
.
<- tibble::tibble(
chaos score = c(3L, -99L, 4L, -99L, 7L, 10L, 12L, 16L, 9L),
grade = c("N/A", "E", "missing", "na", "n/a", " ", ".", NA, "N/a"),
place = c(-99, 97, 95, 92, -98, "missing", 88, ".", 86)
)
::kable(chaos) knitr
score | grade | place |
---|---|---|
3 | N/A | -99 |
-99 | E | 97 |
4 | missing | 95 |
-99 | na | 92 |
7 | n/a | -98 |
10 | missing | |
12 | . | 88 |
16 | NA | . |
9 | N/a | 86 |
An Aside: Talk to the people who collect or curate the data
If you have access to the people who collect or curate the data, talk to them! It is amazing how much they can tell you about the data that you might not have ever known. You will get the most out of the conversation if you’ve had a look at the data first, and noted any abnormalities. Asking questions like “what did you do with missing data? How are missing values encoded? How did you collect the data? Did you summarise the data before giving it to me? Is this the most raw form of the data? Are good questions to help you get started. Also, remember to be friendly to these people. I know from experience that it can be very frustrating to have data that is poor or low quality, where missing values are deleted, or the data is summarised to the point of no variation. However it is important to keep in mind that these people who collect or curate the data are often trying to help you by saving you time summarising. Be kind, and be curious. And ask for the data in the rawest form.
6.1.1 Search for missing values
Before we can start replacing unexpected missing values with NA
, we should get a sense of how big this missing data problem is by searching for those strange missing values. The miss_scan_count
function in naniar
allows you to search for likely records of missing values stored as something other than NA. For example, if we want to check for missing values that are input as “N/A”, we can use:
%>%
chaos miss_scan_count(search = list("N/A"))
# A tibble: 3 × 2
Variable n
<chr> <int>
1 score 0
2 grade 1
3 place 0
This returns a dataframe with two columns: “Variable” - the variables in the chaos
data frame, and “n”, the number of times that string appears in each variable. Here, we see that “N/A” appears once in the grade
variable, and never in the score
or place
variables.
The miss_scan_count
function accepts multiple arguments in the search, so you can look for all the strangely recorded missing values you like! Here we see that when searching for capital “N/A” and “N/a”, there are two hits for the variable, grade
(and still 0 for both score
and place
).
%>%
chaos miss_scan_count(search = list("N/A",
"N/a"))
# A tibble: 3 × 2
Variable n
<chr> <int>
1 score 0
2 grade 2
3 place 0
The naniar
package also contains two helpful datasets to explore missingness, common_na_numbers
, and common_na_strings
:
common_na_numbers
[1] -9 -99 -999 -9999 9999 66 77 88
common_na_strings
[1] "NA" "N A" "N/A" "#N/A" "NA " " NA" "N /A" "N / A"
[9] " N / A" "N / A " "na" "n a" "n/a" "na " " na" "n /a"
[17] "n / a" " a / a" "n / a " "NULL" "null" "" "\\?" "\\*"
[25] "\\."
These can be put inside of miss_scan_count
and we can see we’ve got even more matches!
%>%
chaos miss_scan_count(search = common_na_numbers)
# A tibble: 3 × 2
Variable n
<chr> <int>
1 score 2
2 grade 0
3 place 3
%>%
chaos miss_scan_count(search = common_na_strings)
# A tibble: 3 × 2
Variable n
<chr> <int>
1 score 9
2 grade 8
3 place 9
You can also look for both:
%>%
chaos miss_scan_count(search = c(common_na_numbers,
common_na_strings))
# A tibble: 3 × 2
Variable n
<chr> <int>
1 score 9
2 grade 8
3 place 9
Note that you do still need to carefully explore the data and metadata to get an idea of how missing values were recorded so that you don’t miss an obscure missing record. Also consider that some o these values might have other meanings - finding a match of the numbers in common_na_numbers
might not mean they all match missing values, since, for example, you could conceivably have values -99.
An Aside on \\
Note that in
common_na_strings
, there are some\\
for values such as.
and*
and?
. This is because under the hood,miss_scan_count
uses a thing called “regular expressions” to search for characters in the data. Briefly, regular expressions allow you to find and extract parts of text from collections of text. For example, the regular expression “*.csv$
” means “find words that contain anything up until”.csv”, and “.csv” is also the last thing in the word. So these values,*
,.
, and?
all have special meaning in regular expressions. We use\\
to “escape” the regular expression. It’s our way of saying, “No really, just look for”*“, or”.”, or “?”. Regular expressions are a really powerful tool, but can take some (sometimes a lot) of time to get your head around. Two places that provide a nice way to test out regular expressions is https://regex101.com/ and https://regexr.com/.
6.1.2 Replacing missing values with NA
Once you’ve explored and searched for missing values stored as something other than NA
, you can replace them with NA
using the replace_with_na()
function. For example, in the chaos
dataset we can replace “N/A” and “N/a” entries that appear in the grade
variable as follows:
%>%
chaos replace_with_na(replace = list(grade = c("N/A", "N/a")))
# A tibble: 9 × 3
score grade place
<int> <chr> <chr>
1 3 <NA> -99
2 -99 "E" 97
3 4 "missing" 95
4 -99 "na" 92
5 7 "n/a" -98
6 10 " " missing
7 12 "." 88
8 16 <NA> .
9 9 <NA> 86
The above code can be read as follows:
Start with the
chaos
data, then within the variablegrade
replace any existing values of “N/A” and “N/a” withNA
.
We can see this has replaced some of the missing values, but note that it only replaces the exact specified strings (“N/A” and “N/a”) - even slight variations (“na” and “n/a”) still exist.
We can even use common_na_strings
in replace_with_na
- but be warned! This should only be done if you really, truly, 100% for sure know that all the values in common_na_strings
should be missing values in your data. Do not apply this without careful thought! You have been warned!
%>%
chaos replace_with_na(replace = list(grade = common_na_strings))
# A tibble: 9 × 3
score grade place
<int> <chr> <chr>
1 3 <NA> -99
2 -99 "E" 97
3 4 "missing" 95
4 -99 <NA> 92
5 7 <NA> -98
6 10 " " missing
7 12 "." 88
8 16 <NA> .
9 9 "N/a" 86
UP TO HERE
6.1.3 Useful variants of replace_with_na
The replace_with_na
function can be repetitive if you need to use it across many variables, for many different values. Or, for more complex cases where you might only want to replace values less than -1, or only treat character columns. To account for these situations, naniar
borrows from dplyr
’s scoped variants and extends replace_with_na
to create three useful functions:
replace_with_na_all()
: operates on all variables.replace_with_na_at()
: operates on a subset of selected variablesreplace_with_na_if()
: operates on a subset of variables that fulfil a condition (e.g. only on numeric variables)
Example: replace_with_na_all
The scoped variants of replace_with_na
follow a specific syntax. You provide a condition argument, and pass it a special function that starts with the squiggly line, tilde, ~
, and when referring to a variable, you use .x
. For example, if we want to replace all cases of -99 in a dataset, we use replace_with_na_all
, and write:
%>%
chaos replace_with_na_all(condition = ~.x == -99)
# A tibble: 9 × 3
score grade place
<int> <chr> <chr>
1 3 "N/A" <NA>
2 NA "E" 97
3 4 "missing" 95
4 NA "na" 92
5 7 "n/a" -98
6 10 " " missing
7 12 "." 88
8 16 <NA> .
9 9 "N/a" 86
We can read the above code as:
start with
chaos
, THENreplace_with_na_all
where any variable (~.x) is equal to -99.
Extending this a bit further, we can replace values “N/A”, “missing”, or “na” with NA
across all variables in chaos
with the following:
%>%
chaos replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))
# A tibble: 9 × 3
score grade place
<int> <chr> <chr>
1 3 <NA> -99
2 -99 "E" 97
3 4 <NA> 95
4 -99 <NA> 92
5 7 "n/a" -98
6 10 " " <NA>
7 12 "." 88
8 16 <NA> .
9 9 "N/a" 86
We can read the code above as:
Start with
chaos
data, THENreplace_with_na_all
across all variables where the existing value is “N/A”, “missing”, or “na”
Example: replace_with_na_at
To select specific columns to apply replace_with_na
to slected variables by name, use the scoped variant replace_with_na_at
. For example, to only replace values with NA
in the place
column of chaos
, we can use:
%>%
chaos replace_with_na_at(
.vars = "place",
condition = ~.x %in% c("missing", "na", ".")
)
# A tibble: 9 × 3
score grade place
<int> <chr> <chr>
1 3 "N/A" -99
2 -99 "E" 97
3 4 "missing" 95
4 -99 "na" 92
5 7 "n/a" -98
6 10 " " <NA>
7 12 "." 88
8 16 <NA> <NA>
9 9 "N/a" 86
We can see that those recorded missings have been replaced with NA
only in the place
variable.
Example: replace_with_na_if
The replace_with_na_if
function allows us to replace values with NA
in columns that satisfy a condition (e.g. if I only want to replace with NA
in a character column).
For example, to replace values with NA
in character columns in chaos, we can use the following code:
%>%
chaos replace_with_na_if(
.predicate = is.character,
condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
)
# A tibble: 9 × 3
score grade place
<int> <chr> <chr>
1 3 <NA> -99
2 -99 "E" 97
3 4 <NA> 95
4 -99 <NA> 92
5 7 <NA> -98
6 10 " " <NA>
7 12 <NA> 88
8 16 <NA> <NA>
9 9 <NA> 86
Note that all of those varied records of missingness in the two character columns (grade
and place
) have been replaced with NA
.
It is worthwhile to think about which records were not replaced with NA
in the example above. Perhaps these were incorrectly recorded, or indicate a missing value?
Overall, the scoped variants of replace_with_na
provide more control over which values in the data are replaced by NA
.
6.1.4 Alternatives to replace_with_na
The replace_with_na
function, and scoped variants, provide a high degree of control over what you replace, and over which variables. However, they can sometimes be a bit slow for larger datasets. If you do not need that level of control, and would like to have a bit more speed, several options exist to replace values with NA
.
6.1.4.1 dplyr::na_if
If you need to replace a single non-NA
entry (e.g. “N/A”) throughout the dataset, you can use the dplyr::na_if
function. Note that it only works to replace a single value, like “N/A”, and cannot handle vectors of multiple values (e.g. it breaks with c("N/A", "na", ".")
).
The following code replaces all “missing” occurrences in chaos
with NA
:
%>%
chaos na_if("missing")
# A tibble: 9 × 3
score grade place
<int> <chr> <chr>
1 3 "N/A" -99
2 -99 "E" 97
3 4 <NA> 95
4 -99 "na" 92
5 7 "n/a" -98
6 10 " " <NA>
7 12 "." 88
8 16 <NA> .
9 9 "N/a" 86
You can also use na_if
with across
, but not to same flexibility as replace_with_na
:
# in across
%>%
chaos mutate(
across(
.cols = everything(),
.fns = ~na_if(., -99)
)
)
%>%
chaos mutate(
across(
.cols = "place",
# note that you cannot specify multiple values to replace in `place`
.fns = ~na_if(., c("missing")
)
)
%>%
chaos mutate(
across(
.cols = where(is.character),
# note again that you cannot specify multiple values to replace with NA
.fns = ~na_if(., "N/A")
) )
6.1.4.2 Argument na =
in readr
Similarly, you can replace a specified non-NA
throughout the dataset when reading the dataset into R using the optional na =
argument in readr
.
For example, if we were reading in a theoretical .csv file in our current folder called ‘hiking’ that contains missing values recorded as “no data” throughout, we could read it in and replace all “no data” with NA
as follows:
<- read_csv("hiking.csv", na = "no data") df
One workflow here might be to use the tools in naniar
, miss_scan_count(search = list("N/A"))
and perhaps replace_with_na
to understand and check your missing value replacements, then put all of the values that are missing you have found and confirmed into the na
argument of read_csv
.
An aside: dplyr
and across
When
naniar
was written,dplyr
’s scoped variants were a very new feature, but since writing, this feature has become superceded by the newacross
feature. We are still working through some bugs innaniar
to try and makeacross
work withreplace_with_na
. Although it is possible to usena_if
to some extent withacross
, because it only accepts single values, it does not really work in the same was asreplace_with_na
. The idea withreplace_with_na
andacross
would be for it to look something like the following instead:
replace_with_na_all(
data = chaos,
condition = ~.x == -99
)
# in across
%>%
chaos mutate(
across(
.cols = everything(),
.fns = replace_with_na,
condition = ~.x == -99
)
)
replace_with_na_at(
chaos.vars = "place",
condition = ~.x %in% c("missing", "na", ".")
)
%>%
chaos mutate(
across(
.cols = "place",
.fns = replace_with_na,
condition = ~.x %in% c("missing", "na", ".")
)
)
replace_with_na_if(
data = chaos,
.predicate = is.character,
condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
)
%>%
chaos mutate(
across(
.cols = where(is.character),
.fns = replace_with_na,
condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
) )