This book contains both practical guides on exploring missing data, as well as some of the deeper details of how `naniar` works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

In this section, we are going to focus on two areas:

1. Using imputations to understand data structure
2. Visualising and exploring imputed values.

The goal is to develop skills in imputing data and tracking missing values, and visualising imputed values against data.

Some of these techniques might look familiar. This is one of the benefits to using `naniar`; the methods applied for exploring missing values are similar to exploring imputations.

## 12.1 Performing and tracking imputation

``````library(naniar)
library(tidyverse)``````
One of the goals in exploring missing data is to understand any underlying biases and make the data suitable for analysis. Once we understand our data and the relationships amongst the variables and the missingness, it is a good idea to perform imputation, so that you can conduct analysis with a full dataset.

## 12.2 Using imputations to understand data structure

Previous chapters used `geom_miss_point()` to explore missing values. This “shifted” the missing values below the range of the data so we could see them.

``````ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()``````

This shifting was actually “imputing” the data! Remember, “Impute” means to fill in a missing value. We are going to recreate these visualisations using `impute_below()` from `naniar`. This imputes values below the range of the data. For example, for this vector of numbers 5:10 with one missing value:

``````vec <- c(5,6,7,NA,9,10)
impute_below(vec)``````
``[1]  5.00000  6.00000  7.00000  4.40271  9.00000 10.00000``

it imputes the value 4.4 into the missing value, since this is lower than the lowest value of the data at hand, namely 5.000.

### 12.2.1`impute_below()`

We can use `impute_below()` in combination with `mutate()` to impute specific values.

For example:

``````airquality %>%
mutate(Ozone = impute_below(Ozone))``````
However, sometimes you want to do this across many variables. Using the same approach for all variables in the dataset could be at best repetitive, and at worst lead to unintended mistakes. We can work around this by using `across`.

If we want to impute all variables, we can use `across` like so:

``````airquality %>%
mutate(across(everything(),impute_below))``````
Here we use the `everything()` helper function from dplyr, to select all variables. We can use any type of selection, from `dplyr`s tidy select.

We can impute only those variables that satisfy a condition, like is this column numeric with `is.numeric()` using `where()` like so:

``````airquality %>%
mutate(across(where(is.numeric),impute_below))``````
We can choose specific variables like so:

``````airquality %>%
mutate(across(c(Ozone, Solar.R),impute_below))``````
We can take advantage of selection helpers from `dplyr`s tidy select:

``````airquality %>%
mutate(across(c(Ozone, Solar.R, starts_with("T")),impute_below))``````
## 12.3 Tracking missing values

We need to track the missing values, once we impute them. Otherwise we don’t know what was imputed and what was not. We can see that in this example, once we impute the data, we have no way to recognise which one it is.

``````df <- tibble(var1 = c(5, 6, 7, NA, 9, 10))
df``````
``````df %>%
mutate(across(everything(),impute_below))``````
We can identify missings by using `nabular` to turn the data into `nabular` form.

``nabular(df)``
Now when we impute the data, we can see that the shadow variable, `var1_NA` reveals the imputed value, 4.40.

``````df %>%
nabular() %>%
mutate(across(everything(),impute_below))``````
## 12.4 Visualise imputed values against data values using histograms

Using this imputed data, we can explore the number of missings in a single variable, along with it’s distribution, using a histogram and colouring the missings using `fill = Ozone_NA`.

``````aq_imp <- airquality %>%
nabular() %>%
mutate(across(everything(),impute_below))

ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram()``````
```stat_bin()` using `bins = 30`. Pick better value with `binwidth`.``

Here we see that there are a few missing values - two bars around 20, so just under 40 missing values.

## 12.5 Visualise imputed values against data values using facets

We can take this same plot and visualise it across facets. For example, plot it by month, which shows us that most missing values occur in month 6 - which didn’t have many high values of ozone.

``````ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram() +
facet_wrap(~Month)``````
```stat_bin()` using `bins = 30`. Pick better value with `binwidth`.``

## 12.6 Visualize imputed values using facets

We can split the plot according to the missingness of solar radiation by referring to it as `Solar.R_NA`

``````ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram() +
facet_wrap(~Solar.R_NA)``````
```stat_bin()` using `bins = 30`. Pick better value with `binwidth`.``

This shows us that there aren’t many missing values in ozone when solar radiation is missing.

## 12.7 Visualize imputed values against data values using scatterplots

Previously we could identify imputed values by referring to the shadow variable - e.g., `Ozone_NA`. However, if you want to colour by two variables, you just need to know if any of them were imputed. We can add a column with labels to identify whether there is a missing value in a column. The function `add_label_missings` does this for us, adding a column, `any_missing`.

``````aq_imp <- airquality %>%
nabular() %>%
mutate(across(everything(),impute_below))

aq_imp``````
We can now recreate the same figure as `geom_miss_point()`!

``````ggplot(aq_imp,
aes(x = Ozone,
y = Solar.R,
colour = any_missing)) +
geom_point()``````