This book contains both practical guides on exploring missing data, as well as some of the deeper details of how `naniar` works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

``````library(naniar)
library(dplyr)``````

## 6.1 Find and replace missing values

In previous sections, we learned how to count, summarise and visualise missing values stored as `NA`. Often, however, raw data contain missing values that have been recorded as something other than `NA`. These include things like characters (e.g. “missing”, “N/A”, or “no data”) or impossible values (e.g. “-9999” for a dolphin length variable).

Always take care when working with data, especially if you did not collect or record it yourself:

Never assume that all missing values are stored as `NA`

The problem is most functions assessing missingness only recognize `NA`, so they will not recognize other missing value inputs, such as “NA”, or “missing”. That means the first thing we often need to do is search for missing values stored as something other than `NA` in our data, then replace those non-`NA` missing values with `NA` so our assessments of missingness, and subsequent analyses, are accurate.

In this section, we introduce tools and strategies to:

• Search for missing values stored as something other than `NA`
• Replace them with `NA`

We introduce the following functions to help us:

• `miss_scan_count()`: search for missing values stored as something other than `NA` (e.g. “N/A”, “-999”, “.”, etc.)
• `replace_with_na()`: replace non-`NA` values with `NA`

We will use a dataset called `chaos`, shown below, which contains gnarly values like plain whitespace, ” “, full-stops (or periods)”.”, “N/A”, and “missing” - all of which, in this case, should be stored as `NA`.

``````chaos <- tibble::tibble(
score =   c(3L, -99L, 4L, -99L, 7L, 10L, 12L, 16L, 9L),
grade = c("N/A", "E", "missing", "na", "n/a", " ", ".", NA, "N/a"),
place = c(-99, 97, 95, 92, -98, "missing", 88, ".", 86)
)

knitr::kable(chaos)``````
3 N/A -99
-99 E 97
4 missing 95
-99 na 92
7 n/a -98
10 missing
12 . 88
16 NA .
9 N/a 86

An Aside: Talk to the people who collect or curate the data

If you have access to the people who collect or curate the data, talk to them! It is amazing how much they can tell you about the data that you might not have ever known. You will get the most out of the conversation if you’ve had a look at the data first, and noted any abnormalities. Asking questions like “what did you do with missing data? How are missing values encoded? How did you collect the data? Did you summarise the data before giving it to me? Is this the most raw form of the data? Are good questions to help you get started. Also, remember to be friendly to these people. I know from experience that it can be very frustrating to have data that is poor or low quality, where missing values are deleted, or the data is summarised to the point of no variation. However it is important to keep in mind that these people who collect or curate the data are often trying to help you by saving you time summarising. Be kind, and be curious. And ask for the data in the rawest form.

### 6.1.1 Search for missing values

Before we can start replacing unexpected missing values with `NA`, we should get a sense of how big this missing data problem is by searching for those strange missing values. The `miss_scan_count` function in `naniar` allows you to search for likely records of missing values stored as something other than NA. For example, if we want to check for missing values that are input as “N/A”, we can use:

``````chaos %>%
miss_scan_count(search = list("N/A"))``````
``````# A tibble: 3 × 2
Variable     n
<chr>    <int>
1 score        0
3 place        0``````

This returns a dataframe with two columns: “Variable” - the variables in the `chaos` data frame, and “n”, the number of times that string appears in each variable. Here, we see that “N/A” appears once in the `grade` variable, and never in the `score` or `place` variables.

The `miss_scan_count` function accepts multiple arguments in the search, so you can look for all the strangely recorded missing values you like! Here we see that when searching for capital “N/A” and “N/a”, there are two hits for the variable, `grade` (and still 0 for both `score` and `place`).

``````chaos %>%
miss_scan_count(search = list("N/A",
"N/a"))``````
``````# A tibble: 3 × 2
Variable     n
<chr>    <int>
1 score        0
3 place        0``````

The `naniar` package also contains two helpful datasets to explore missingness, `common_na_numbers`, and `common_na_strings`:

``common_na_numbers``
``[1]    -9   -99  -999 -9999  9999    66    77    88``
``common_na_strings``
`````` [1] "NA"     "N A"    "N/A"    "#N/A"   "NA "    " NA"    "N /A"   "N / A"
[9] " N / A" "N / A " "na"     "n a"    "n/a"    "na "    " na"    "n /a"
[17] "n / a"  " a / a" "n / a " "NULL"   "null"   ""       "\\?"    "\\*"
[25] "\\."   ``````

These can be put inside of `miss_scan_count` and we can see we’ve got even more matches!

``````chaos %>%
miss_scan_count(search = common_na_numbers)``````
``````# A tibble: 3 × 2
Variable     n
<chr>    <int>
1 score        2
3 place        3``````
``````chaos %>%
miss_scan_count(search = common_na_strings)``````
``````# A tibble: 3 × 2
Variable     n
<chr>    <int>
1 score        9
3 place        9``````

You can also look for both:

``````chaos %>%
miss_scan_count(search = c(common_na_numbers,
common_na_strings))``````
``````# A tibble: 3 × 2
Variable     n
<chr>    <int>
1 score        9
3 place        9``````

Note that you do still need to carefully explore the data and metadata to get an idea of how missing values were recorded so that you don’t miss an obscure missing record. Also consider that some o these values might have other meanings - finding a match of the numbers in `common_na_numbers` might not mean they all match missing values, since, for example, you could conceivably have values -99.

An Aside on `\\`

Note that in `common_na_strings`, there are some `\\` for values such as `.` and `*` and `?`. This is because under the hood, `miss_scan_count` uses a thing called “regular expressions” to search for characters in the data. Briefly, regular expressions allow you to find and extract parts of text from collections of text. For example, the regular expression “`*.csv\$`” means “find words that contain anything up until”.csv”, and “.csv” is also the last thing in the word. So these values, `*`, `.`, and `?` all have special meaning in regular expressions. We use `\\` to “escape” the regular expression. It’s our way of saying, “No really, just look for”*“, or”.”, or “?”. Regular expressions are a really powerful tool, but can take some (sometimes a lot) of time to get your head around. Two places that provide a nice way to test out regular expressions is https://regex101.com/ and https://regexr.com/.

### 6.1.2 Replacing missing values with `NA`

Once you’ve explored and searched for missing values stored as something other than `NA`, you can replace them with `NA` using the `replace_with_na()` function. For example, in the `chaos` dataset we can replace “N/A” and “N/a” entries that appear in the `grade` variable as follows:

``````chaos %>%
replace_with_na(replace = list(grade = c("N/A", "N/a")))``````
``````# A tibble: 9 × 3
<int> <chr>     <chr>
1     3  <NA>     -99
2   -99 "E"       97
3     4 "missing" 95
4   -99 "na"      92
5     7 "n/a"     -98
6    10 " "       missing
7    12 "."       88
8    16  <NA>     .
9     9  <NA>     86     ``````

The above code can be read as follows:

Start with the `chaos` data, then within the variable `grade` replace any existing values of “N/A” and “N/a” with `NA`.

We can see this has replaced some of the missing values, but note that it only replaces the exact specified strings (“N/A” and “N/a”) - even slight variations (“na” and “n/a”) still exist.

We can even use `common_na_strings` in `replace_with_na` - but be warned! This should only be done if you really, truly, 100% for sure know that all the values in `common_na_strings` should be missing values in your data. Do not apply this without careful thought! You have been warned!

``````chaos %>%
``````# A tibble: 9 × 3
<int> <chr>     <chr>
1     3  <NA>     -99
2   -99 "E"       97
3     4 "missing" 95
4   -99  <NA>     92
5     7  <NA>     -98
6    10 " "       missing
7    12 "."       88
8    16  <NA>     .
9     9 "N/a"     86     ``````

UP TO HERE

### 6.1.3 Useful variants of `replace_with_na`

The `replace_with_na` function can be repetitive if you need to use it across many variables, for many different values. Or, for more complex cases where you might only want to replace values less than -1, or only treat character columns. To account for these situations, `naniar` borrows from `dplyr`’s scoped variants and extends `replace_with_na` to create three useful functions:

• `replace_with_na_all()`: operates on all variables.
• `replace_with_na_at()`: operates on a subset of selected variables
• `replace_with_na_if()`: operates on a subset of variables that fulfil a condition (e.g. only on numeric variables)

Example: `replace_with_na_all`

The scoped variants of `replace_with_na` follow a specific syntax. You provide a condition argument, and pass it a special function that starts with the squiggly line, tilde, `~`, and when referring to a variable, you use `.x`. For example, if we want to replace all cases of -99 in a dataset, we use `replace_with_na_all`, and write:

``````chaos %>%
replace_with_na_all(condition = ~.x == -99)``````
``````# A tibble: 9 × 3
<int> <chr>     <chr>
1     3 "N/A"     <NA>
2    NA "E"       97
3     4 "missing" 95
4    NA "na"      92
5     7 "n/a"     -98
6    10 " "       missing
7    12 "."       88
8    16  <NA>     .
9     9 "N/a"     86     ``````

We can read the above code as:

start with `chaos`, THEN `replace_with_na_all` where any variable (~.x) is equal to -99.

Extending this a bit further, we can replace values “N/A”, “missing”, or “na” with `NA` across all variables in `chaos` with the following:

``````chaos %>%
replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))``````
``````# A tibble: 9 × 3
<int> <chr> <chr>
1     3  <NA> -99
2   -99 "E"   97
3     4  <NA> 95
4   -99  <NA> 92
5     7 "n/a" -98
6    10 " "   <NA>
7    12 "."   88
8    16  <NA> .
9     9 "N/a" 86   ``````

We can read the code above as:

Start with `chaos` data, THEN `replace_with_na_all` across all variables where the existing value is “N/A”, “missing”, or “na”

Example: `replace_with_na_at`

To select specific columns to apply `replace_with_na` to slected variables by name, use the scoped variant `replace_with_na_at`. For example, to only replace values with `NA` in the `place` column of `chaos`, we can use:

``````chaos %>%
replace_with_na_at(
.vars = "place",
condition = ~.x %in% c("missing", "na", ".")
)``````
``````# A tibble: 9 × 3
<int> <chr>     <chr>
1     3 "N/A"     -99
2   -99 "E"       97
3     4 "missing" 95
4   -99 "na"      92
5     7 "n/a"     -98
6    10 " "       <NA>
7    12 "."       88
8    16  <NA>     <NA>
9     9 "N/a"     86   ``````

We can see that those recorded missings have been replaced with `NA` only in the `place` variable.

Example: `replace_with_na_if`

The `replace_with_na_if` function allows us to replace values with `NA` in columns that satisfy a condition (e.g. if I only want to replace with `NA` in a character column).

For example, to replace values with `NA` in character columns in chaos, we can use the following code:

``````chaos %>%
replace_with_na_if(
.predicate = is.character,
condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
)``````
``````# A tibble: 9 × 3
<int> <chr> <chr>
1     3  <NA> -99
2   -99 "E"   97
3     4  <NA> 95
4   -99  <NA> 92
5     7  <NA> -98
6    10 " "   <NA>
7    12  <NA> 88
8    16  <NA> <NA>
9     9  <NA> 86   ``````

Note that all of those varied records of missingness in the two character columns (`grade` and `place`) have been replaced with `NA`.

It is worthwhile to think about which records were not replaced with `NA` in the example above. Perhaps these were incorrectly recorded, or indicate a missing value?

Overall, the scoped variants of `replace_with_na` provide more control over which values in the data are replaced by `NA`.

### 6.1.4 Alternatives to `replace_with_na`

The `replace_with_na` function, and scoped variants, provide a high degree of control over what you replace, and over which variables. However, they can sometimes be a bit slow for larger datasets. If you do not need that level of control, and would like to have a bit more speed, several options exist to replace values with `NA`.

#### 6.1.4.1`dplyr::na_if`

If you need to replace a single non-`NA` entry (e.g. “N/A”) throughout the dataset, you can use the `dplyr::na_if` function. Note that it only works to replace a single value, like “N/A”, and cannot handle vectors of multiple values (e.g. it breaks with `c("N/A", "na", ".")`).

The following code replaces all “missing” occurrences in `chaos` with `NA`:

``````chaos %>%
na_if("missing")``````
``````# A tibble: 9 × 3
<int> <chr> <chr>
1     3 "N/A" -99
2   -99 "E"   97
3     4  <NA> 95
4   -99 "na"  92
5     7 "n/a" -98
6    10 " "   <NA>
7    12 "."   88
8    16  <NA> .
9     9 "N/a" 86   ``````

You can also use `na_if` with `across`, but not to same flexibility as `replace_with_na`:

``````# in across
chaos %>%
mutate(
across(
.cols = everything(),
.fns = ~na_if(., -99)
)
)

chaos %>%
mutate(
across(
.cols = "place",
# note that you cannot specify multiple values to replace in `place`
.fns = ~na_if(., c("missing")
)
)

chaos %>%
mutate(
across(
.cols = where(is.character),
# note again that you cannot specify multiple values to replace with NA
.fns = ~na_if(., "N/A")
)
)``````

#### 6.1.4.2 Argument `na =` in `readr`

Similarly, you can replace a specified non-`NA` throughout the dataset when reading the dataset into R using the optional `na =` argument in `readr`.

For example, if we were reading in a theoretical .csv file in our current folder called ‘hiking’ that contains missing values recorded as “no data” throughout, we could read it in and replace all “no data” with `NA` as follows:

``df <- read_csv("hiking.csv", na = "no data")``

One workflow here might be to use the tools in `naniar`, `miss_scan_count(search = list("N/A"))` and perhaps `replace_with_na` to understand and check your missing value replacements, then put all of the values that are missing you have found and confirmed into the `na` argument of `read_csv`.

An aside: `dplyr` and `across`

When `naniar` was written, `dplyr`’s scoped variants were a very new feature, but since writing, this feature has become superceded by the new `across` feature. We are still working through some bugs in `naniar` to try and make `across` work with `replace_with_na`. Although it is possible to use `na_if` to some extent with `across`, because it only accepts single values, it does not really work in the same was as `replace_with_na`. The idea with `replace_with_na` and `across` would be for it to look something like the following instead:

``````replace_with_na_all(
data = chaos,
condition = ~.x == -99
)

# in across
chaos %>%
mutate(
across(
.cols = everything(),
.fns = replace_with_na,
condition = ~.x == -99
)
)

replace_with_na_at(
chaos
.vars = "place",
condition = ~.x %in% c("missing", "na", ".")
)

chaos %>%
mutate(
across(
.cols = "place",
.fns = replace_with_na,
condition = ~.x %in% c("missing", "na", ".")
)
)

replace_with_na_if(
data = chaos,
.predicate = is.character,
condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
)

chaos %>%
mutate(
across(
.cols = where(is.character),
.fns = replace_with_na,
condition = ~.x %in% c("N/A", "N/a", "na", "n/a", ".", "", "missing")
)
)``````