10  Visualizing missingness across two variables

This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

We have previously discussed the use of nabular data, a way to represent missing data alongside the data itself. This data structure underpins how naniar performs data visualisation and summaries. This chapter discusses how to use the nabular data structure with data visualisation to further explore why data could be missing, looking across two variables.

If you want to explore two variables in a dataset, a scatterplot is a natural graphic to show. Let’s explore ozone and solar radiation like so:

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_point()
Warning: Removed 42 rows containing missing values (geom_point).

However, note the warning message:

Warning message:
Removed 42 rows containing 
missing values (geom_point). 

What? What does this mean? Why would ggplot do this? Well, it turns out that it’s really nice that ggplot2 provides this warning, since removing missing values is often done in modelling and other graphics without you being made aware of it.

So, how do you visualise those missing values? How does visualising missingness make sense? This is the focus of this chapter.

10.0.1 The problem of visualizing missing data in two dimensions

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_point()
Warning: Removed 42 rows containing missing values (geom_point).

The problem with visualising a scatterplot when the data has missing values is that it removes any observations - entire rows - that have missing values. ggplot2 is actually very nice here and gives a warning that missing values are being dropped. The same cannot be said of other all functions in R!

10.0.2 Introduction to geom_miss_point()

gg_miss_point <- ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point()

To explore the missings in a scatter plot, we can use geom_miss_point(). geom_miss_point() visualises the missing values by placing them in the margins.

airquality_rect <- airquality %>% 
  as_tibble() %>% 
  impute_below_at(.vars = c("Ozone", "Solar.R")) %>% 
  summarise(xmin = min(Ozone) + min(Ozone)*0.1,
            xmax = 0,
            ymin = 0,
            ymax = max(Solar.R) + 10)

gg_miss_point +
  geom_rect(data = airquality_rect, 
            inherit.aes=FALSE,
            aes(xmin=xmin, xmax=xmax,ymin=ymin,ymax=ymax),
            alpha = 0.4,
            fill = "orange")

On the left in the highlighted orange section red we can see the values of solar radiation when ozone is missing. This shows us that the values of solar radiation are reasonably uniform.

airquality_rect <- airquality %>% 
  as_tibble() %>% 
  impute_below_at(.vars = c("Ozone", "Solar.R")) %>% 
  summarise(xmin = 0,
            xmax = max(Ozone),
            ymin = min(Solar.R) - 10,
            ymax = 0)

gg_miss_point +
  geom_rect(data = airquality_rect, 
            inherit.aes=FALSE,
            aes(xmin=xmin, xmax=xmax,ymin=ymin,ymax=ymax),
            alpha = 0.4,
            fill = "orange")

The values of ozone when Solar.R is missing are shown in red on the bottom, this shows us that the missing values tend to occur at lower values of ozone.

airquality_rect <- airquality %>% 
  as_tibble() %>% 
  impute_below_at(.vars = c("Ozone", "Solar.R")) %>% 
  summarise(xmin = min(Ozone) - 10,
            xmax = 0,
            ymin = min(Solar.R) - 10,
            ymax = 0)

gg_miss_point +
  geom_rect(data = airquality_rect, 
            inherit.aes=FALSE,
            aes(xmin=xmin, xmax=xmax,ymin=ymin,ymax=ymax),
            alpha = 0.4,
            fill = "orange")

In the bottom left we show cases where there are missings in both ozone and solar radiation. To explain how and why this visualisation works, we are going to take a brief moment to unpack the data transformation that occurs here.

10.0.2.1 Aside: How geom_miss_point() works

geom_miss_point performs a transformation on the data and actually imputes (fills in, replaces) the values that are missing. Under the hood, the data is represented like so, for the ozone data:

Ozone Ozone_shift Ozone_NA
41 41.00000 !NA
36 36.00000 !NA
12 12.00000 !NA
18 18.00000 !NA
NA -19.72321 NA
28 28.00000 !NA

Notice that we have our nabular data here - with Ozone and Ozone_NA. We also have a new column, Ozone_shift. This contains the imputed data. This data is imputed 10% below the minimum value of ozone. To keep track of which values were imputed, we can use the Ozone_NA column! We’ll come back to this idea of tracking missing values in the next chapter.

10.0.3 Exploring missingness using facets

Because geom_miss_point() is a defined ggplot2 geometry, it behaves like any other ggplot. This means, you can use ggplot features like facets, to further explore your missing data. For example, you can facet by Month, to explore how the missingness changes over month:

ggplot(airquality,
       aes(x = Wind,
           y = Ozone)) + 
  geom_miss_point() + 
  facet_wrap(~Month)

You can even use nabular data from the previous lesson, and explore the missingness by another variable being missing. For example, you can explore how the missingness changes when solar radiation is missing.

airquality %>%
  nabular() %>%
  ggplot(aes(x = Wind,
             y = Ozone)) + 
    geom_miss_point() + 
    facet_wrap(~Solar.R_NA)