12  Single Imputation of missing data

This book contains both practical guides on exploring missing data, as well as some of the deeper details of how naniar works to help you better explore your missing data. A large component of this book are the exercises that accompany each section in each chapter.

In this section, we are going to focus on two areas:

  1. Using imputations to understand data structure
  2. Visualising and exploring imputed values.

The goal is to develop skills in imputing data and tracking missing values, and visualising imputed values against data.

Some of these techniques might look familiar. This is one of the benefits to using naniar; the methods applied for exploring missing values are similar to exploring imputations.

12.1 Performing and tracking imputation

library(naniar)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

One of the goals in exploring missing data is to understand any underlying biases and make the data suitable for analysis. Once we understand our data and the relationships amongst the variables and the missingness, it is a good idea to perform imputation, so that you can conduct analysis with a full dataset.

12.2 Using imputations to understand data structure

Previous chapters used geom_miss_point() to explore missing values. This “shifted” the missing values below the range of the data so we could see them.

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point()

This shifting was actually “imputing” the data! Remember, “Impute” means to fill in a missing value. We are going to recreate these visualisations using impute_below() from naniar. This imputes values below the range of the data. For example, for this vector of numbers 5:10 with one missing value:

vec <- c(5,6,7,NA,9,10)
impute_below(vec)
[1]  5.00000  6.00000  7.00000  4.40271  9.00000 10.00000

it imputes the value 4.4 into the missing value, since this is lower than the lowest value of the data at hand, namely 5.000.

12.2.1 impute_below()

We can use impute_below() in combination with mutate() to impute specific values.

For example:

airquality %>% 
  mutate(Ozone = impute_below(Ozone))
        Ozone Solar.R Wind Temp Month Day
1    41.00000     190  7.4   67     5   1
2    36.00000     118  8.0   72     5   2
3    12.00000     149 12.6   74     5   3
4    18.00000     313 11.5   62     5   4
5   -19.72321      NA 14.3   56     5   5
6    28.00000      NA 14.9   66     5   6
7    23.00000     299  8.6   65     5   7
8    19.00000      99 13.8   59     5   8
9     8.00000      19 20.1   61     5   9
10  -18.51277     194  8.6   69     5  10
11    7.00000      NA  6.9   74     5  11
12   16.00000     256  9.7   69     5  12
13   11.00000     290  9.2   66     5  13
14   14.00000     274 10.9   68     5  14
15   18.00000      65 13.2   58     5  15
16   14.00000     334 11.5   64     5  16
17   34.00000     307 12.0   66     5  17
18    6.00000      78 18.4   57     5  18
19   30.00000     322 11.5   68     5  19
20   11.00000      44  9.7   62     5  20
21    1.00000       8  9.7   59     5  21
22   11.00000     320 16.6   73     5  22
23    4.00000      25  9.7   61     5  23
24   32.00000      92 12.0   61     5  24
25  -17.81863      66 16.6   57     5  25
26  -19.43853     266 14.9   58     5  26
27  -15.14310      NA  8.0   57     5  27
28   23.00000      13 12.0   67     5  28
29   45.00000     252 14.9   81     5  29
30  115.00000     223  5.7   79     5  30
31   37.00000     279  7.4   76     5  31
32  -16.17315     286  8.6   78     6   1
33  -14.65883     287  9.7   74     6   2
34  -17.85609     242 16.1   67     6   3
35  -13.29299     186  9.2   84     6   4
36  -16.16323     220  8.6   85     6   5
37  -19.60935     264 14.3   79     6   6
38   29.00000     127  9.7   82     6   7
39  -19.65780     273  6.9   87     6   8
40   71.00000     291 13.8   90     6   9
41   39.00000     323 11.5   87     6  10
42  -13.40961     259 10.9   93     6  11
43  -13.53728     250  9.2   92     6  12
44   23.00000     148  8.0   82     6  13
45  -19.65993     332 13.8   80     6  14
46  -16.48342     322 11.5   79     6  15
47   21.00000     191 14.9   77     6  16
48   37.00000     284 20.7   72     6  17
49   20.00000      37  9.2   65     6  18
50   12.00000     120 11.5   73     6  19
51   13.00000     137 10.3   76     6  20
52  -17.17718     150  6.3   77     6  21
53  -16.74073      59  1.7   76     6  22
54  -13.65786      91  4.6   76     6  23
55  -16.78786     250  6.3   76     6  24
56  -12.30098     135  8.0   75     6  25
57  -13.33171     127  8.0   78     6  26
58  -16.77414      47 10.3   73     6  27
59  -17.08225      98 11.5   80     6  28
60  -15.98818      31 14.9   77     6  29
61  -19.17558     138  8.0   83     6  30
62  135.00000     269  4.1   84     7   1
63   49.00000     248  9.2   85     7   2
64   32.00000     236  9.2   81     7   3
65  -14.27138     101 10.9   84     7   4
66   64.00000     175  4.6   83     7   5
67   40.00000     314 10.9   83     7   6
68   77.00000     276  5.1   88     7   7
69   97.00000     267  6.3   92     7   8
70   97.00000     272  5.7   92     7   9
71   85.00000     175  7.4   89     7  10
72  -13.51764     139  8.6   82     7  11
73   10.00000     264 14.3   73     7  12
74   27.00000     175 14.9   81     7  13
75  -13.48998     291 14.9   91     7  14
76    7.00000      48 14.3   80     7  15
77   48.00000     260  6.9   81     7  16
78   35.00000     274 10.3   82     7  17
79   61.00000     285  6.3   84     7  18
80   79.00000     187  5.1   87     7  19
81   63.00000     220 11.5   85     7  20
82   16.00000       7  6.9   74     7  21
83  -16.92150     258  9.7   81     7  22
84  -16.60335     295 11.5   82     7  23
85   80.00000     294  8.6   86     7  24
86  108.00000     223  8.0   85     7  25
87   20.00000      81  8.6   82     7  26
88   52.00000      82 12.0   86     7  27
89   82.00000     213  7.4   88     7  28
90   50.00000     275  7.4   86     7  29
91   64.00000     253  7.4   83     7  30
92   59.00000     254  9.2   81     7  31
93   39.00000      83  6.9   81     8   1
94    9.00000      24 13.8   81     8   2
95   16.00000      77  7.4   82     8   3
96   78.00000      NA  6.9   86     8   4
97   35.00000      NA  7.4   85     8   5
98   66.00000      NA  4.6   87     8   6
99  122.00000     255  4.0   89     8   7
100  89.00000     229 10.3   90     8   8
101 110.00000     207  8.0   90     8   9
102 -14.78907     222  8.6   92     8  10
103 -16.19151     137 11.5   86     8  11
104  44.00000     192 11.5   86     8  12
105  28.00000     273 11.5   82     8  13
106  65.00000     157  9.7   80     8  14
107 -19.73591      64 11.5   79     8  15
108  22.00000      71 10.3   77     8  16
109  59.00000      51  6.3   79     8  17
110  23.00000     115  7.4   76     8  18
111  31.00000     244 10.9   78     8  19
112  44.00000     190 10.3   78     8  20
113  21.00000     259 15.5   77     8  21
114   9.00000      36 14.3   72     8  22
115 -18.92235     255 12.6   75     8  23
116  45.00000     212  9.7   79     8  24
117 168.00000     238  3.4   81     8  25
118  73.00000     215  8.0   86     8  26
119 -14.86296     153  5.7   88     8  27
120  76.00000     203  9.7   97     8  28
121 118.00000     225  2.3   94     8  29
122  84.00000     237  6.3   96     8  30
123  85.00000     188  6.3   94     8  31
124  96.00000     167  6.9   91     9   1
125  78.00000     197  5.1   92     9   2
126  73.00000     183  2.8   93     9   3
127  91.00000     189  4.6   93     9   4
128  47.00000      95  7.4   87     9   5
129  32.00000      92 15.5   84     9   6
130  20.00000     252 10.9   80     9   7
131  23.00000     220 10.3   78     9   8
132  21.00000     230 10.9   75     9   9
133  24.00000     259  9.7   73     9  10
134  44.00000     236 14.9   81     9  11
135  21.00000     259 15.5   76     9  12
136  28.00000     238  6.3   77     9  13
137   9.00000      24 10.9   71     9  14
138  13.00000     112 11.5   71     9  15
139  46.00000     237  6.9   78     9  16
140  18.00000     224 13.8   67     9  17
141  13.00000      27 10.3   76     9  18
142  24.00000     238 10.3   68     9  19
143  16.00000     201  8.0   82     9  20
144  13.00000     238 12.6   64     9  21
145  23.00000      14  9.2   71     9  22
146  36.00000     139 10.3   81     9  23
147   7.00000      49 10.3   69     9  24
148  14.00000      20 16.6   63     9  25
149  30.00000     193  6.9   70     9  26
150 -14.83089     145 13.2   77     9  27
151  14.00000     191 14.3   75     9  28
152  18.00000     131  8.0   76     9  29
153  20.00000     223 11.5   68     9  30

However, sometimes you want to do this across many variables. Using the same approach for all variables in the dataset could be at best repetitive, and at worst lead to unintended mistakes. We can work around this by using across.

If we want to impute all variables, we can use across like so:

airquality %>% 
  mutate(across(everything(),impute_below))
        Ozone   Solar.R Wind Temp Month Day
1    41.00000 190.00000  7.4   67     5   1
2    36.00000 118.00000  8.0   72     5   2
3    12.00000 149.00000 12.6   74     5   3
4    18.00000 313.00000 11.5   62     5   4
5   -19.72321 -33.57778 14.3   56     5   5
6    28.00000 -33.07810 14.9   66     5   6
7    23.00000 299.00000  8.6   65     5   7
8    19.00000  99.00000 13.8   59     5   8
9     8.00000  19.00000 20.1   61     5   9
10  -18.51277 194.00000  8.6   69     5  10
11    7.00000 -21.37719  6.9   74     5  11
12   16.00000 256.00000  9.7   69     5  12
13   11.00000 290.00000  9.2   66     5  13
14   14.00000 274.00000 10.9   68     5  14
15   18.00000  65.00000 13.2   58     5  15
16   14.00000 334.00000 11.5   64     5  16
17   34.00000 307.00000 12.0   66     5  17
18    6.00000  78.00000 18.4   57     5  18
19   30.00000 322.00000 11.5   68     5  19
20   11.00000  44.00000  9.7   62     5  20
21    1.00000   8.00000  9.7   59     5  21
22   11.00000 320.00000 16.6   73     5  22
23    4.00000  25.00000  9.7   61     5  23
24   32.00000  92.00000 12.0   61     5  24
25  -17.81863  66.00000 16.6   57     5  25
26  -19.43853 266.00000 14.9   58     5  26
27  -15.14310 -24.60954  8.0   57     5  27
28   23.00000  13.00000 12.0   67     5  28
29   45.00000 252.00000 14.9   81     5  29
30  115.00000 223.00000  5.7   79     5  30
31   37.00000 279.00000  7.4   76     5  31
32  -16.17315 286.00000  8.6   78     6   1
33  -14.65883 287.00000  9.7   74     6   2
34  -17.85609 242.00000 16.1   67     6   3
35  -13.29299 186.00000  9.2   84     6   4
36  -16.16323 220.00000  8.6   85     6   5
37  -19.60935 264.00000 14.3   79     6   6
38   29.00000 127.00000  9.7   82     6   7
39  -19.65780 273.00000  6.9   87     6   8
40   71.00000 291.00000 13.8   90     6   9
41   39.00000 323.00000 11.5   87     6  10
42  -13.40961 259.00000 10.9   93     6  11
43  -13.53728 250.00000  9.2   92     6  12
44   23.00000 148.00000  8.0   82     6  13
45  -19.65993 332.00000 13.8   80     6  14
46  -16.48342 322.00000 11.5   79     6  15
47   21.00000 191.00000 14.9   77     6  16
48   37.00000 284.00000 20.7   72     6  17
49   20.00000  37.00000  9.2   65     6  18
50   12.00000 120.00000 11.5   73     6  19
51   13.00000 137.00000 10.3   76     6  20
52  -17.17718 150.00000  6.3   77     6  21
53  -16.74073  59.00000  1.7   76     6  22
54  -13.65786  91.00000  4.6   76     6  23
55  -16.78786 250.00000  6.3   76     6  24
56  -12.30098 135.00000  8.0   75     6  25
57  -13.33171 127.00000  8.0   78     6  26
58  -16.77414  47.00000 10.3   73     6  27
59  -17.08225  98.00000 11.5   80     6  28
60  -15.98818  31.00000 14.9   77     6  29
61  -19.17558 138.00000  8.0   83     6  30
62  135.00000 269.00000  4.1   84     7   1
63   49.00000 248.00000  9.2   85     7   2
64   32.00000 236.00000  9.2   81     7   3
65  -14.27138 101.00000 10.9   84     7   4
66   64.00000 175.00000  4.6   83     7   5
67   40.00000 314.00000 10.9   83     7   6
68   77.00000 276.00000  5.1   88     7   7
69   97.00000 267.00000  6.3   92     7   8
70   97.00000 272.00000  5.7   92     7   9
71   85.00000 175.00000  7.4   89     7  10
72  -13.51764 139.00000  8.6   82     7  11
73   10.00000 264.00000 14.3   73     7  12
74   27.00000 175.00000 14.9   81     7  13
75  -13.48998 291.00000 14.9   91     7  14
76    7.00000  48.00000 14.3   80     7  15
77   48.00000 260.00000  6.9   81     7  16
78   35.00000 274.00000 10.3   82     7  17
79   61.00000 285.00000  6.3   84     7  18
80   79.00000 187.00000  5.1   87     7  19
81   63.00000 220.00000 11.5   85     7  20
82   16.00000   7.00000  6.9   74     7  21
83  -16.92150 258.00000  9.7   81     7  22
84  -16.60335 295.00000 11.5   82     7  23
85   80.00000 294.00000  8.6   86     7  24
86  108.00000 223.00000  8.0   85     7  25
87   20.00000  81.00000  8.6   82     7  26
88   52.00000  82.00000 12.0   86     7  27
89   82.00000 213.00000  7.4   88     7  28
90   50.00000 275.00000  7.4   86     7  29
91   64.00000 253.00000  7.4   83     7  30
92   59.00000 254.00000  9.2   81     7  31
93   39.00000  83.00000  6.9   81     8   1
94    9.00000  24.00000 13.8   81     8   2
95   16.00000  77.00000  7.4   82     8   3
96   78.00000 -30.94374  6.9   86     8   4
97   35.00000 -33.38707  7.4   85     8   5
98   66.00000 -21.48980  4.6   87     8   6
99  122.00000 255.00000  4.0   89     8   7
100  89.00000 229.00000 10.3   90     8   8
101 110.00000 207.00000  8.0   90     8   9
102 -14.78907 222.00000  8.6   92     8  10
103 -16.19151 137.00000 11.5   86     8  11
104  44.00000 192.00000 11.5   86     8  12
105  28.00000 273.00000 11.5   82     8  13
106  65.00000 157.00000  9.7   80     8  14
107 -19.73591  64.00000 11.5   79     8  15
108  22.00000  71.00000 10.3   77     8  16
109  59.00000  51.00000  6.3   79     8  17
110  23.00000 115.00000  7.4   76     8  18
111  31.00000 244.00000 10.9   78     8  19
112  44.00000 190.00000 10.3   78     8  20
113  21.00000 259.00000 15.5   77     8  21
114   9.00000  36.00000 14.3   72     8  22
115 -18.92235 255.00000 12.6   75     8  23
116  45.00000 212.00000  9.7   79     8  24
117 168.00000 238.00000  3.4   81     8  25
118  73.00000 215.00000  8.0   86     8  26
119 -14.86296 153.00000  5.7   88     8  27
120  76.00000 203.00000  9.7   97     8  28
121 118.00000 225.00000  2.3   94     8  29
122  84.00000 237.00000  6.3   96     8  30
123  85.00000 188.00000  6.3   94     8  31
124  96.00000 167.00000  6.9   91     9   1
125  78.00000 197.00000  5.1   92     9   2
126  73.00000 183.00000  2.8   93     9   3
127  91.00000 189.00000  4.6   93     9   4
128  47.00000  95.00000  7.4   87     9   5
129  32.00000  92.00000 15.5   84     9   6
130  20.00000 252.00000 10.9   80     9   7
131  23.00000 220.00000 10.3   78     9   8
132  21.00000 230.00000 10.9   75     9   9
133  24.00000 259.00000  9.7   73     9  10
134  44.00000 236.00000 14.9   81     9  11
135  21.00000 259.00000 15.5   76     9  12
136  28.00000 238.00000  6.3   77     9  13
137   9.00000  24.00000 10.9   71     9  14
138  13.00000 112.00000 11.5   71     9  15
139  46.00000 237.00000  6.9   78     9  16
140  18.00000 224.00000 13.8   67     9  17
141  13.00000  27.00000 10.3   76     9  18
142  24.00000 238.00000 10.3   68     9  19
143  16.00000 201.00000  8.0   82     9  20
144  13.00000 238.00000 12.6   64     9  21
145  23.00000  14.00000  9.2   71     9  22
146  36.00000 139.00000 10.3   81     9  23
147   7.00000  49.00000 10.3   69     9  24
148  14.00000  20.00000 16.6   63     9  25
149  30.00000 193.00000  6.9   70     9  26
150 -14.83089 145.00000 13.2   77     9  27
151  14.00000 191.00000 14.3   75     9  28
152  18.00000 131.00000  8.0   76     9  29
153  20.00000 223.00000 11.5   68     9  30

Here we use the everything() helper function from dplyr, to select all variables. We can use any type of selection, from dplyrs tidy select.

We can impute only those variables that satisfy a condition, like is this column numeric with is.numeric() using where() like so:

airquality %>% 
  mutate(across(where(is.numeric),impute_below))
        Ozone   Solar.R Wind Temp Month Day
1    41.00000 190.00000  7.4   67     5   1
2    36.00000 118.00000  8.0   72     5   2
3    12.00000 149.00000 12.6   74     5   3
4    18.00000 313.00000 11.5   62     5   4
5   -19.72321 -33.57778 14.3   56     5   5
6    28.00000 -33.07810 14.9   66     5   6
7    23.00000 299.00000  8.6   65     5   7
8    19.00000  99.00000 13.8   59     5   8
9     8.00000  19.00000 20.1   61     5   9
10  -18.51277 194.00000  8.6   69     5  10
11    7.00000 -21.37719  6.9   74     5  11
12   16.00000 256.00000  9.7   69     5  12
13   11.00000 290.00000  9.2   66     5  13
14   14.00000 274.00000 10.9   68     5  14
15   18.00000  65.00000 13.2   58     5  15
16   14.00000 334.00000 11.5   64     5  16
17   34.00000 307.00000 12.0   66     5  17
18    6.00000  78.00000 18.4   57     5  18
19   30.00000 322.00000 11.5   68     5  19
20   11.00000  44.00000  9.7   62     5  20
21    1.00000   8.00000  9.7   59     5  21
22   11.00000 320.00000 16.6   73     5  22
23    4.00000  25.00000  9.7   61     5  23
24   32.00000  92.00000 12.0   61     5  24
25  -17.81863  66.00000 16.6   57     5  25
26  -19.43853 266.00000 14.9   58     5  26
27  -15.14310 -24.60954  8.0   57     5  27
28   23.00000  13.00000 12.0   67     5  28
29   45.00000 252.00000 14.9   81     5  29
30  115.00000 223.00000  5.7   79     5  30
31   37.00000 279.00000  7.4   76     5  31
32  -16.17315 286.00000  8.6   78     6   1
33  -14.65883 287.00000  9.7   74     6   2
34  -17.85609 242.00000 16.1   67     6   3
35  -13.29299 186.00000  9.2   84     6   4
36  -16.16323 220.00000  8.6   85     6   5
37  -19.60935 264.00000 14.3   79     6   6
38   29.00000 127.00000  9.7   82     6   7
39  -19.65780 273.00000  6.9   87     6   8
40   71.00000 291.00000 13.8   90     6   9
41   39.00000 323.00000 11.5   87     6  10
42  -13.40961 259.00000 10.9   93     6  11
43  -13.53728 250.00000  9.2   92     6  12
44   23.00000 148.00000  8.0   82     6  13
45  -19.65993 332.00000 13.8   80     6  14
46  -16.48342 322.00000 11.5   79     6  15
47   21.00000 191.00000 14.9   77     6  16
48   37.00000 284.00000 20.7   72     6  17
49   20.00000  37.00000  9.2   65     6  18
50   12.00000 120.00000 11.5   73     6  19
51   13.00000 137.00000 10.3   76     6  20
52  -17.17718 150.00000  6.3   77     6  21
53  -16.74073  59.00000  1.7   76     6  22
54  -13.65786  91.00000  4.6   76     6  23
55  -16.78786 250.00000  6.3   76     6  24
56  -12.30098 135.00000  8.0   75     6  25
57  -13.33171 127.00000  8.0   78     6  26
58  -16.77414  47.00000 10.3   73     6  27
59  -17.08225  98.00000 11.5   80     6  28
60  -15.98818  31.00000 14.9   77     6  29
61  -19.17558 138.00000  8.0   83     6  30
62  135.00000 269.00000  4.1   84     7   1
63   49.00000 248.00000  9.2   85     7   2
64   32.00000 236.00000  9.2   81     7   3
65  -14.27138 101.00000 10.9   84     7   4
66   64.00000 175.00000  4.6   83     7   5
67   40.00000 314.00000 10.9   83     7   6
68   77.00000 276.00000  5.1   88     7   7
69   97.00000 267.00000  6.3   92     7   8
70   97.00000 272.00000  5.7   92     7   9
71   85.00000 175.00000  7.4   89     7  10
72  -13.51764 139.00000  8.6   82     7  11
73   10.00000 264.00000 14.3   73     7  12
74   27.00000 175.00000 14.9   81     7  13
75  -13.48998 291.00000 14.9   91     7  14
76    7.00000  48.00000 14.3   80     7  15
77   48.00000 260.00000  6.9   81     7  16
78   35.00000 274.00000 10.3   82     7  17
79   61.00000 285.00000  6.3   84     7  18
80   79.00000 187.00000  5.1   87     7  19
81   63.00000 220.00000 11.5   85     7  20
82   16.00000   7.00000  6.9   74     7  21
83  -16.92150 258.00000  9.7   81     7  22
84  -16.60335 295.00000 11.5   82     7  23
85   80.00000 294.00000  8.6   86     7  24
86  108.00000 223.00000  8.0   85     7  25
87   20.00000  81.00000  8.6   82     7  26
88   52.00000  82.00000 12.0   86     7  27
89   82.00000 213.00000  7.4   88     7  28
90   50.00000 275.00000  7.4   86     7  29
91   64.00000 253.00000  7.4   83     7  30
92   59.00000 254.00000  9.2   81     7  31
93   39.00000  83.00000  6.9   81     8   1
94    9.00000  24.00000 13.8   81     8   2
95   16.00000  77.00000  7.4   82     8   3
96   78.00000 -30.94374  6.9   86     8   4
97   35.00000 -33.38707  7.4   85     8   5
98   66.00000 -21.48980  4.6   87     8   6
99  122.00000 255.00000  4.0   89     8   7
100  89.00000 229.00000 10.3   90     8   8
101 110.00000 207.00000  8.0   90     8   9
102 -14.78907 222.00000  8.6   92     8  10
103 -16.19151 137.00000 11.5   86     8  11
104  44.00000 192.00000 11.5   86     8  12
105  28.00000 273.00000 11.5   82     8  13
106  65.00000 157.00000  9.7   80     8  14
107 -19.73591  64.00000 11.5   79     8  15
108  22.00000  71.00000 10.3   77     8  16
109  59.00000  51.00000  6.3   79     8  17
110  23.00000 115.00000  7.4   76     8  18
111  31.00000 244.00000 10.9   78     8  19
112  44.00000 190.00000 10.3   78     8  20
113  21.00000 259.00000 15.5   77     8  21
114   9.00000  36.00000 14.3   72     8  22
115 -18.92235 255.00000 12.6   75     8  23
116  45.00000 212.00000  9.7   79     8  24
117 168.00000 238.00000  3.4   81     8  25
118  73.00000 215.00000  8.0   86     8  26
119 -14.86296 153.00000  5.7   88     8  27
120  76.00000 203.00000  9.7   97     8  28
121 118.00000 225.00000  2.3   94     8  29
122  84.00000 237.00000  6.3   96     8  30
123  85.00000 188.00000  6.3   94     8  31
124  96.00000 167.00000  6.9   91     9   1
125  78.00000 197.00000  5.1   92     9   2
126  73.00000 183.00000  2.8   93     9   3
127  91.00000 189.00000  4.6   93     9   4
128  47.00000  95.00000  7.4   87     9   5
129  32.00000  92.00000 15.5   84     9   6
130  20.00000 252.00000 10.9   80     9   7
131  23.00000 220.00000 10.3   78     9   8
132  21.00000 230.00000 10.9   75     9   9
133  24.00000 259.00000  9.7   73     9  10
134  44.00000 236.00000 14.9   81     9  11
135  21.00000 259.00000 15.5   76     9  12
136  28.00000 238.00000  6.3   77     9  13
137   9.00000  24.00000 10.9   71     9  14
138  13.00000 112.00000 11.5   71     9  15
139  46.00000 237.00000  6.9   78     9  16
140  18.00000 224.00000 13.8   67     9  17
141  13.00000  27.00000 10.3   76     9  18
142  24.00000 238.00000 10.3   68     9  19
143  16.00000 201.00000  8.0   82     9  20
144  13.00000 238.00000 12.6   64     9  21
145  23.00000  14.00000  9.2   71     9  22
146  36.00000 139.00000 10.3   81     9  23
147   7.00000  49.00000 10.3   69     9  24
148  14.00000  20.00000 16.6   63     9  25
149  30.00000 193.00000  6.9   70     9  26
150 -14.83089 145.00000 13.2   77     9  27
151  14.00000 191.00000 14.3   75     9  28
152  18.00000 131.00000  8.0   76     9  29
153  20.00000 223.00000 11.5   68     9  30

This reads as:

Use airquality then across variables where they are numeric, impute below

We can choose specific variables like so:

airquality %>% 
  mutate(across(c(Ozone, Solar.R),impute_below))
        Ozone   Solar.R Wind Temp Month Day
1    41.00000 190.00000  7.4   67     5   1
2    36.00000 118.00000  8.0   72     5   2
3    12.00000 149.00000 12.6   74     5   3
4    18.00000 313.00000 11.5   62     5   4
5   -19.72321 -33.57778 14.3   56     5   5
6    28.00000 -33.07810 14.9   66     5   6
7    23.00000 299.00000  8.6   65     5   7
8    19.00000  99.00000 13.8   59     5   8
9     8.00000  19.00000 20.1   61     5   9
10  -18.51277 194.00000  8.6   69     5  10
11    7.00000 -21.37719  6.9   74     5  11
12   16.00000 256.00000  9.7   69     5  12
13   11.00000 290.00000  9.2   66     5  13
14   14.00000 274.00000 10.9   68     5  14
15   18.00000  65.00000 13.2   58     5  15
16   14.00000 334.00000 11.5   64     5  16
17   34.00000 307.00000 12.0   66     5  17
18    6.00000  78.00000 18.4   57     5  18
19   30.00000 322.00000 11.5   68     5  19
20   11.00000  44.00000  9.7   62     5  20
21    1.00000   8.00000  9.7   59     5  21
22   11.00000 320.00000 16.6   73     5  22
23    4.00000  25.00000  9.7   61     5  23
24   32.00000  92.00000 12.0   61     5  24
25  -17.81863  66.00000 16.6   57     5  25
26  -19.43853 266.00000 14.9   58     5  26
27  -15.14310 -24.60954  8.0   57     5  27
28   23.00000  13.00000 12.0   67     5  28
29   45.00000 252.00000 14.9   81     5  29
30  115.00000 223.00000  5.7   79     5  30
31   37.00000 279.00000  7.4   76     5  31
32  -16.17315 286.00000  8.6   78     6   1
33  -14.65883 287.00000  9.7   74     6   2
34  -17.85609 242.00000 16.1   67     6   3
35  -13.29299 186.00000  9.2   84     6   4
36  -16.16323 220.00000  8.6   85     6   5
37  -19.60935 264.00000 14.3   79     6   6
38   29.00000 127.00000  9.7   82     6   7
39  -19.65780 273.00000  6.9   87     6   8
40   71.00000 291.00000 13.8   90     6   9
41   39.00000 323.00000 11.5   87     6  10
42  -13.40961 259.00000 10.9   93     6  11
43  -13.53728 250.00000  9.2   92     6  12
44   23.00000 148.00000  8.0   82     6  13
45  -19.65993 332.00000 13.8   80     6  14
46  -16.48342 322.00000 11.5   79     6  15
47   21.00000 191.00000 14.9   77     6  16
48   37.00000 284.00000 20.7   72     6  17
49   20.00000  37.00000  9.2   65     6  18
50   12.00000 120.00000 11.5   73     6  19
51   13.00000 137.00000 10.3   76     6  20
52  -17.17718 150.00000  6.3   77     6  21
53  -16.74073  59.00000  1.7   76     6  22
54  -13.65786  91.00000  4.6   76     6  23
55  -16.78786 250.00000  6.3   76     6  24
56  -12.30098 135.00000  8.0   75     6  25
57  -13.33171 127.00000  8.0   78     6  26
58  -16.77414  47.00000 10.3   73     6  27
59  -17.08225  98.00000 11.5   80     6  28
60  -15.98818  31.00000 14.9   77     6  29
61  -19.17558 138.00000  8.0   83     6  30
62  135.00000 269.00000  4.1   84     7   1
63   49.00000 248.00000  9.2   85     7   2
64   32.00000 236.00000  9.2   81     7   3
65  -14.27138 101.00000 10.9   84     7   4
66   64.00000 175.00000  4.6   83     7   5
67   40.00000 314.00000 10.9   83     7   6
68   77.00000 276.00000  5.1   88     7   7
69   97.00000 267.00000  6.3   92     7   8
70   97.00000 272.00000  5.7   92     7   9
71   85.00000 175.00000  7.4   89     7  10
72  -13.51764 139.00000  8.6   82     7  11
73   10.00000 264.00000 14.3   73     7  12
74   27.00000 175.00000 14.9   81     7  13
75  -13.48998 291.00000 14.9   91     7  14
76    7.00000  48.00000 14.3   80     7  15
77   48.00000 260.00000  6.9   81     7  16
78   35.00000 274.00000 10.3   82     7  17
79   61.00000 285.00000  6.3   84     7  18
80   79.00000 187.00000  5.1   87     7  19
81   63.00000 220.00000 11.5   85     7  20
82   16.00000   7.00000  6.9   74     7  21
83  -16.92150 258.00000  9.7   81     7  22
84  -16.60335 295.00000 11.5   82     7  23
85   80.00000 294.00000  8.6   86     7  24
86  108.00000 223.00000  8.0   85     7  25
87   20.00000  81.00000  8.6   82     7  26
88   52.00000  82.00000 12.0   86     7  27
89   82.00000 213.00000  7.4   88     7  28
90   50.00000 275.00000  7.4   86     7  29
91   64.00000 253.00000  7.4   83     7  30
92   59.00000 254.00000  9.2   81     7  31
93   39.00000  83.00000  6.9   81     8   1
94    9.00000  24.00000 13.8   81     8   2
95   16.00000  77.00000  7.4   82     8   3
96   78.00000 -30.94374  6.9   86     8   4
97   35.00000 -33.38707  7.4   85     8   5
98   66.00000 -21.48980  4.6   87     8   6
99  122.00000 255.00000  4.0   89     8   7
100  89.00000 229.00000 10.3   90     8   8
101 110.00000 207.00000  8.0   90     8   9
102 -14.78907 222.00000  8.6   92     8  10
103 -16.19151 137.00000 11.5   86     8  11
104  44.00000 192.00000 11.5   86     8  12
105  28.00000 273.00000 11.5   82     8  13
106  65.00000 157.00000  9.7   80     8  14
107 -19.73591  64.00000 11.5   79     8  15
108  22.00000  71.00000 10.3   77     8  16
109  59.00000  51.00000  6.3   79     8  17
110  23.00000 115.00000  7.4   76     8  18
111  31.00000 244.00000 10.9   78     8  19
112  44.00000 190.00000 10.3   78     8  20
113  21.00000 259.00000 15.5   77     8  21
114   9.00000  36.00000 14.3   72     8  22
115 -18.92235 255.00000 12.6   75     8  23
116  45.00000 212.00000  9.7   79     8  24
117 168.00000 238.00000  3.4   81     8  25
118  73.00000 215.00000  8.0   86     8  26
119 -14.86296 153.00000  5.7   88     8  27
120  76.00000 203.00000  9.7   97     8  28
121 118.00000 225.00000  2.3   94     8  29
122  84.00000 237.00000  6.3   96     8  30
123  85.00000 188.00000  6.3   94     8  31
124  96.00000 167.00000  6.9   91     9   1
125  78.00000 197.00000  5.1   92     9   2
126  73.00000 183.00000  2.8   93     9   3
127  91.00000 189.00000  4.6   93     9   4
128  47.00000  95.00000  7.4   87     9   5
129  32.00000  92.00000 15.5   84     9   6
130  20.00000 252.00000 10.9   80     9   7
131  23.00000 220.00000 10.3   78     9   8
132  21.00000 230.00000 10.9   75     9   9
133  24.00000 259.00000  9.7   73     9  10
134  44.00000 236.00000 14.9   81     9  11
135  21.00000 259.00000 15.5   76     9  12
136  28.00000 238.00000  6.3   77     9  13
137   9.00000  24.00000 10.9   71     9  14
138  13.00000 112.00000 11.5   71     9  15
139  46.00000 237.00000  6.9   78     9  16
140  18.00000 224.00000 13.8   67     9  17
141  13.00000  27.00000 10.3   76     9  18
142  24.00000 238.00000 10.3   68     9  19
143  16.00000 201.00000  8.0   82     9  20
144  13.00000 238.00000 12.6   64     9  21
145  23.00000  14.00000  9.2   71     9  22
146  36.00000 139.00000 10.3   81     9  23
147   7.00000  49.00000 10.3   69     9  24
148  14.00000  20.00000 16.6   63     9  25
149  30.00000 193.00000  6.9   70     9  26
150 -14.83089 145.00000 13.2   77     9  27
151  14.00000 191.00000 14.3   75     9  28
152  18.00000 131.00000  8.0   76     9  29
153  20.00000 223.00000 11.5   68     9  30

We can take advantage of selection helpers from dplyrs tidy select:

airquality %>% 
  mutate(across(c(Ozone, Solar.R, starts_with("T")),impute_below))
        Ozone   Solar.R Wind Temp Month Day
1    41.00000 190.00000  7.4   67     5   1
2    36.00000 118.00000  8.0   72     5   2
3    12.00000 149.00000 12.6   74     5   3
4    18.00000 313.00000 11.5   62     5   4
5   -19.72321 -33.57778 14.3   56     5   5
6    28.00000 -33.07810 14.9   66     5   6
7    23.00000 299.00000  8.6   65     5   7
8    19.00000  99.00000 13.8   59     5   8
9     8.00000  19.00000 20.1   61     5   9
10  -18.51277 194.00000  8.6   69     5  10
11    7.00000 -21.37719  6.9   74     5  11
12   16.00000 256.00000  9.7   69     5  12
13   11.00000 290.00000  9.2   66     5  13
14   14.00000 274.00000 10.9   68     5  14
15   18.00000  65.00000 13.2   58     5  15
16   14.00000 334.00000 11.5   64     5  16
17   34.00000 307.00000 12.0   66     5  17
18    6.00000  78.00000 18.4   57     5  18
19   30.00000 322.00000 11.5   68     5  19
20   11.00000  44.00000  9.7   62     5  20
21    1.00000   8.00000  9.7   59     5  21
22   11.00000 320.00000 16.6   73     5  22
23    4.00000  25.00000  9.7   61     5  23
24   32.00000  92.00000 12.0   61     5  24
25  -17.81863  66.00000 16.6   57     5  25
26  -19.43853 266.00000 14.9   58     5  26
27  -15.14310 -24.60954  8.0   57     5  27
28   23.00000  13.00000 12.0   67     5  28
29   45.00000 252.00000 14.9   81     5  29
30  115.00000 223.00000  5.7   79     5  30
31   37.00000 279.00000  7.4   76     5  31
32  -16.17315 286.00000  8.6   78     6   1
33  -14.65883 287.00000  9.7   74     6   2
34  -17.85609 242.00000 16.1   67     6   3
35  -13.29299 186.00000  9.2   84     6   4
36  -16.16323 220.00000  8.6   85     6   5
37  -19.60935 264.00000 14.3   79     6   6
38   29.00000 127.00000  9.7   82     6   7
39  -19.65780 273.00000  6.9   87     6   8
40   71.00000 291.00000 13.8   90     6   9
41   39.00000 323.00000 11.5   87     6  10
42  -13.40961 259.00000 10.9   93     6  11
43  -13.53728 250.00000  9.2   92     6  12
44   23.00000 148.00000  8.0   82     6  13
45  -19.65993 332.00000 13.8   80     6  14
46  -16.48342 322.00000 11.5   79     6  15
47   21.00000 191.00000 14.9   77     6  16
48   37.00000 284.00000 20.7   72     6  17
49   20.00000  37.00000  9.2   65     6  18
50   12.00000 120.00000 11.5   73     6  19
51   13.00000 137.00000 10.3   76     6  20
52  -17.17718 150.00000  6.3   77     6  21
53  -16.74073  59.00000  1.7   76     6  22
54  -13.65786  91.00000  4.6   76     6  23
55  -16.78786 250.00000  6.3   76     6  24
56  -12.30098 135.00000  8.0   75     6  25
57  -13.33171 127.00000  8.0   78     6  26
58  -16.77414  47.00000 10.3   73     6  27
59  -17.08225  98.00000 11.5   80     6  28
60  -15.98818  31.00000 14.9   77     6  29
61  -19.17558 138.00000  8.0   83     6  30
62  135.00000 269.00000  4.1   84     7   1
63   49.00000 248.00000  9.2   85     7   2
64   32.00000 236.00000  9.2   81     7   3
65  -14.27138 101.00000 10.9   84     7   4
66   64.00000 175.00000  4.6   83     7   5
67   40.00000 314.00000 10.9   83     7   6
68   77.00000 276.00000  5.1   88     7   7
69   97.00000 267.00000  6.3   92     7   8
70   97.00000 272.00000  5.7   92     7   9
71   85.00000 175.00000  7.4   89     7  10
72  -13.51764 139.00000  8.6   82     7  11
73   10.00000 264.00000 14.3   73     7  12
74   27.00000 175.00000 14.9   81     7  13
75  -13.48998 291.00000 14.9   91     7  14
76    7.00000  48.00000 14.3   80     7  15
77   48.00000 260.00000  6.9   81     7  16
78   35.00000 274.00000 10.3   82     7  17
79   61.00000 285.00000  6.3   84     7  18
80   79.00000 187.00000  5.1   87     7  19
81   63.00000 220.00000 11.5   85     7  20
82   16.00000   7.00000  6.9   74     7  21
83  -16.92150 258.00000  9.7   81     7  22
84  -16.60335 295.00000 11.5   82     7  23
85   80.00000 294.00000  8.6   86     7  24
86  108.00000 223.00000  8.0   85     7  25
87   20.00000  81.00000  8.6   82     7  26
88   52.00000  82.00000 12.0   86     7  27
89   82.00000 213.00000  7.4   88     7  28
90   50.00000 275.00000  7.4   86     7  29
91   64.00000 253.00000  7.4   83     7  30
92   59.00000 254.00000  9.2   81     7  31
93   39.00000  83.00000  6.9   81     8   1
94    9.00000  24.00000 13.8   81     8   2
95   16.00000  77.00000  7.4   82     8   3
96   78.00000 -30.94374  6.9   86     8   4
97   35.00000 -33.38707  7.4   85     8   5
98   66.00000 -21.48980  4.6   87     8   6
99  122.00000 255.00000  4.0   89     8   7
100  89.00000 229.00000 10.3   90     8   8
101 110.00000 207.00000  8.0   90     8   9
102 -14.78907 222.00000  8.6   92     8  10
103 -16.19151 137.00000 11.5   86     8  11
104  44.00000 192.00000 11.5   86     8  12
105  28.00000 273.00000 11.5   82     8  13
106  65.00000 157.00000  9.7   80     8  14
107 -19.73591  64.00000 11.5   79     8  15
108  22.00000  71.00000 10.3   77     8  16
109  59.00000  51.00000  6.3   79     8  17
110  23.00000 115.00000  7.4   76     8  18
111  31.00000 244.00000 10.9   78     8  19
112  44.00000 190.00000 10.3   78     8  20
113  21.00000 259.00000 15.5   77     8  21
114   9.00000  36.00000 14.3   72     8  22
115 -18.92235 255.00000 12.6   75     8  23
116  45.00000 212.00000  9.7   79     8  24
117 168.00000 238.00000  3.4   81     8  25
118  73.00000 215.00000  8.0   86     8  26
119 -14.86296 153.00000  5.7   88     8  27
120  76.00000 203.00000  9.7   97     8  28
121 118.00000 225.00000  2.3   94     8  29
122  84.00000 237.00000  6.3   96     8  30
123  85.00000 188.00000  6.3   94     8  31
124  96.00000 167.00000  6.9   91     9   1
125  78.00000 197.00000  5.1   92     9   2
126  73.00000 183.00000  2.8   93     9   3
127  91.00000 189.00000  4.6   93     9   4
128  47.00000  95.00000  7.4   87     9   5
129  32.00000  92.00000 15.5   84     9   6
130  20.00000 252.00000 10.9   80     9   7
131  23.00000 220.00000 10.3   78     9   8
132  21.00000 230.00000 10.9   75     9   9
133  24.00000 259.00000  9.7   73     9  10
134  44.00000 236.00000 14.9   81     9  11
135  21.00000 259.00000 15.5   76     9  12
136  28.00000 238.00000  6.3   77     9  13
137   9.00000  24.00000 10.9   71     9  14
138  13.00000 112.00000 11.5   71     9  15
139  46.00000 237.00000  6.9   78     9  16
140  18.00000 224.00000 13.8   67     9  17
141  13.00000  27.00000 10.3   76     9  18
142  24.00000 238.00000 10.3   68     9  19
143  16.00000 201.00000  8.0   82     9  20
144  13.00000 238.00000 12.6   64     9  21
145  23.00000  14.00000  9.2   71     9  22
146  36.00000 139.00000 10.3   81     9  23
147   7.00000  49.00000 10.3   69     9  24
148  14.00000  20.00000 16.6   63     9  25
149  30.00000 193.00000  6.9   70     9  26
150 -14.83089 145.00000 13.2   77     9  27
151  14.00000 191.00000 14.3   75     9  28
152  18.00000 131.00000  8.0   76     9  29
153  20.00000 223.00000 11.5   68     9  30

12.3 Tracking missing values

We need to track the missing values, once we impute them. Otherwise we don’t know what was imputed and what was not. We can see that in this example, once we impute the data, we have no way to recognise which one it is.

df <- tibble(var1 = c(5, 6, 7, NA, 9, 10))
df
# A tibble: 6 × 1
   var1
  <dbl>
1     5
2     6
3     7
4    NA
5     9
6    10
df %>% 
  mutate(across(everything(),impute_below))
# A tibble: 6 × 1
   var1
  <dbl>
1  5   
2  6   
3  7   
4  4.40
5  9   
6 10   

We can identify missings by using nabular to turn the data into nabular form.

nabular(df)
# A tibble: 6 × 2
   var1 var1_NA
  <dbl> <fct>  
1     5 !NA    
2     6 !NA    
3     7 !NA    
4    NA NA     
5     9 !NA    
6    10 !NA    

Now when we impute the data, we can see that the shadow variable, var1_NA reveals the imputed value, 4.40.

df %>% 
  nabular() %>% 
  mutate(across(everything(),impute_below))
# A tibble: 6 × 2
   var1 var1_NA
  <dbl> <fct>  
1  5    !NA    
2  6    !NA    
3  7    !NA    
4  4.40 NA     
5  9    !NA    
6 10    !NA    

12.4 Visualise imputed values against data values using histograms

Using this imputed data, we can explore the number of missings in a single variable, along with it’s distribution, using a histogram and colouring the missings using fill = Ozone_NA.

aq_imp <- airquality %>%
   nabular() %>% 
   mutate(across(everything(),impute_below))

ggplot(aq_imp,
        aes(x = Ozone,
            fill = Ozone_NA)) + 
   geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here we see that there are a few missing values - two bars around 20, so just under 40 missing values.

12.5 Visualise imputed values against data values using facets

We can take this same plot and visualise it across facets. For example, plot it by month, which shows us that most missing values occur in month 6 - which didn’t have many high values of ozone.

ggplot(aq_imp,
       aes(x = Ozone,
           fill = Ozone_NA)) + 
  geom_histogram() + 
  facet_wrap(~Month)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

12.6 Visualize imputed values using facets

We can split the plot according to the missingness of solar radiation by referring to it as Solar.R_NA

ggplot(aq_imp,
       aes(x = Ozone,
           fill = Ozone_NA)) + 
  geom_histogram() + 
  facet_wrap(~Solar.R_NA)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This shows us that there aren’t many missing values in ozone when solar radiation is missing.

12.7 Visualize imputed values against data values using scatterplots

Previously we could identify imputed values by referring to the shadow variable - e.g., Ozone_NA. However, if you want to colour by two variables, you just need to know if any of them were imputed. We can add a column with labels to identify whether there is a missing value in a column. The function add_label_missings does this for us, adding a column, any_missing.

aq_imp <- airquality %>%
  nabular() %>% 
  add_label_missings() %>%
  mutate(across(everything(),impute_below))

aq_imp
# A tibble: 153 × 13
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
   <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1  41     190     7.4    67     5     1 !NA      !NA        !NA     !NA    
 2  36     118     8      72     5     2 !NA      !NA        !NA     !NA    
 3  12     149    12.6    74     5     3 !NA      !NA        !NA     !NA    
 4  18     313    11.5    62     5     4 !NA      !NA        !NA     !NA    
 5 -19.7   -33.6  14.3    56     5     5 NA       NA         !NA     !NA    
 6  28     -33.1  14.9    66     5     6 !NA      NA         !NA     !NA    
 7  23     299     8.6    65     5     7 !NA      !NA        !NA     !NA    
 8  19      99    13.8    59     5     8 !NA      !NA        !NA     !NA    
 9   8      19    20.1    61     5     9 !NA      !NA        !NA     !NA    
10 -18.5   194     8.6    69     5    10 NA       !NA        !NA     !NA    
# … with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
#   any_missing <chr>

We can now recreate the same figure as geom_miss_point()!

ggplot(aq_imp,
       aes(x = Ozone,
           y = Solar.R,
           colour = any_missing)) + 
  geom_point()