Missing values

Let’s take another look at the stocks data.

head(stocks)
  YEAR TBONDS SPSTOCK TBONDS_D SPSTOCK_D
1 1928      1      44        1         1
2 1929      4      -8        1         0
3 1930     NA     -25       NA         0
4 1931     -3     -44        0         0
5 1932      9      -9        1         0
6 1933      2      50        1         1

You will notice that there are missing values on the TBONDS variable. In R, missing values can be presented in different ways.

You can easily search an entire data for missing values in standard format using anyNA(). If missing values are present this will return TRUE, else FALSE if there are no missing values.

anyNA(stocks)
[1] TRUE

So we know there is at least one missing value in stocks, but we don’t know where. You can find the position of missing values in the data using is.na(). This returns a TRUE or FALSE response rowwise and columwise.

is.na(stocks)
       YEAR TBONDS SPSTOCK TBONDS_D SPSTOCK_D
 [1,] FALSE  FALSE   FALSE    FALSE     FALSE
 [2,] FALSE  FALSE   FALSE    FALSE     FALSE
 [3,] FALSE   TRUE   FALSE     TRUE     FALSE
 [4,] FALSE  FALSE   FALSE    FALSE     FALSE
 [5,] FALSE  FALSE   FALSE    FALSE     FALSE
 [6,] FALSE  FALSE   FALSE    FALSE     FALSE
 [7,] FALSE  FALSE   FALSE    FALSE     FALSE
 [8,] FALSE  FALSE   FALSE    FALSE     FALSE
 [9,] FALSE  FALSE   FALSE    FALSE     FALSE
[10,] FALSE  FALSE   FALSE    FALSE     FALSE
[11,] FALSE  FALSE   FALSE    FALSE     FALSE
[12,] FALSE  FALSE   FALSE    FALSE     FALSE
[13,] FALSE  FALSE   FALSE    FALSE     FALSE
[14,] FALSE  FALSE   FALSE    FALSE     FALSE
[15,] FALSE   TRUE   FALSE     TRUE     FALSE
[16,] FALSE  FALSE   FALSE    FALSE     FALSE
[17,] FALSE  FALSE   FALSE    FALSE     FALSE
[18,] FALSE  FALSE   FALSE    FALSE     FALSE
[19,] FALSE  FALSE   FALSE    FALSE     FALSE
[20,] FALSE  FALSE   FALSE    FALSE     FALSE
[21,] FALSE  FALSE   FALSE    FALSE     FALSE
[22,] FALSE  FALSE   FALSE    FALSE     FALSE
[23,] FALSE   TRUE   FALSE     TRUE     FALSE
[24,] FALSE  FALSE   FALSE    FALSE     FALSE
[25,] FALSE  FALSE   FALSE    FALSE     FALSE
[26,] FALSE  FALSE   FALSE    FALSE     FALSE
[27,] FALSE  FALSE   FALSE    FALSE     FALSE
[28,] FALSE  FALSE   FALSE    FALSE     FALSE
[29,] FALSE  FALSE   FALSE    FALSE     FALSE
[30,] FALSE  FALSE   FALSE    FALSE     FALSE
[31,] FALSE  FALSE   FALSE    FALSE     FALSE
[32,] FALSE   TRUE   FALSE     TRUE     FALSE
[33,] FALSE  FALSE   FALSE    FALSE     FALSE
[34,] FALSE  FALSE   FALSE    FALSE     FALSE
[35,] FALSE  FALSE   FALSE    FALSE     FALSE
[36,] FALSE  FALSE   FALSE    FALSE     FALSE
[37,] FALSE  FALSE   FALSE    FALSE     FALSE
[38,] FALSE   TRUE   FALSE     TRUE     FALSE
[39,] FALSE  FALSE   FALSE    FALSE     FALSE
[40,] FALSE  FALSE   FALSE    FALSE     FALSE
[41,] FALSE  FALSE   FALSE    FALSE     FALSE
[42,] FALSE  FALSE   FALSE    FALSE     FALSE
[43,] FALSE  FALSE   FALSE    FALSE     FALSE
[44,] FALSE  FALSE   FALSE    FALSE     FALSE
[45,] FALSE  FALSE   FALSE    FALSE     FALSE
[46,] FALSE  FALSE   FALSE    FALSE     FALSE
[47,] FALSE  FALSE   FALSE    FALSE     FALSE
[48,] FALSE  FALSE   FALSE    FALSE     FALSE
[49,] FALSE  FALSE   FALSE    FALSE     FALSE
[50,] FALSE  FALSE   FALSE    FALSE     FALSE
[51,] FALSE  FALSE   FALSE    FALSE     FALSE
[52,] FALSE  FALSE   FALSE    FALSE     FALSE
[53,] FALSE  FALSE   FALSE    FALSE     FALSE
[54,] FALSE  FALSE   FALSE    FALSE     FALSE
[55,] FALSE  FALSE   FALSE    FALSE     FALSE
[56,] FALSE  FALSE   FALSE    FALSE     FALSE
[57,] FALSE  FALSE   FALSE    FALSE     FALSE
[58,] FALSE  FALSE   FALSE    FALSE     FALSE
[59,] FALSE  FALSE   FALSE    FALSE     FALSE
[60,] FALSE  FALSE   FALSE    FALSE     FALSE
[61,] FALSE  FALSE   FALSE    FALSE     FALSE
[62,] FALSE  FALSE   FALSE    FALSE     FALSE
[63,] FALSE  FALSE   FALSE    FALSE     FALSE
[64,] FALSE  FALSE   FALSE    FALSE     FALSE
[65,] FALSE  FALSE   FALSE    FALSE     FALSE
[66,] FALSE  FALSE   FALSE    FALSE     FALSE
[67,] FALSE  FALSE   FALSE    FALSE     FALSE
[68,] FALSE  FALSE   FALSE    FALSE     FALSE
[69,] FALSE  FALSE   FALSE    FALSE     FALSE
[70,] FALSE  FALSE   FALSE    FALSE     FALSE
[71,] FALSE  FALSE   FALSE    FALSE     FALSE
[72,] FALSE  FALSE   FALSE    FALSE     FALSE
[73,] FALSE  FALSE   FALSE    FALSE     FALSE
[74,] FALSE   TRUE   FALSE     TRUE     FALSE
[75,] FALSE  FALSE   FALSE    FALSE     FALSE
[76,] FALSE  FALSE   FALSE    FALSE     FALSE
[77,] FALSE  FALSE   FALSE    FALSE     FALSE
[78,] FALSE  FALSE   FALSE    FALSE     FALSE
[79,] FALSE  FALSE   FALSE    FALSE     FALSE
[80,] FALSE  FALSE   FALSE    FALSE     FALSE
[81,] FALSE  FALSE   FALSE    FALSE     FALSE
[82,] FALSE  FALSE   FALSE    FALSE     FALSE
[83,] FALSE  FALSE   FALSE    FALSE     FALSE
[84,] FALSE  FALSE   FALSE    FALSE     FALSE
[85,] FALSE  FALSE   FALSE    FALSE     FALSE
[86,] FALSE  FALSE   FALSE    FALSE     FALSE
[87,] FALSE  FALSE   FALSE    FALSE     FALSE
[88,] FALSE  FALSE   FALSE    FALSE     FALSE

Both anyNA() and is.na() are generic methods of detecting standard missing values, and the output is not very informative. Non-standard missing values are more difficult to find because R doesn’t know that they are missing. Depending on your research question, you may want to convert non-standard missing data responses to NA for smoother data manipulation. You may also decide that these types of responses are informative and that you want to keep them as they are.

Accounting for missing values when calculating summary statistics

If we calculate summary statistics such as the mean or the standard deviation, we have to take missing values into account.

The parameter na.rm in R stands for "NA remove" and ignores standard missing values (those that are set to NA) during calculations. By setting na.rm = TRUE, functions like mean() or sd() compute results without being affected by missing values.

The na.rm parameter,includes a Boolean value: TRUE or FALSE. When we set na.rm = TRUE, R excludes NA values from the calculations. Without this parameter, functions would return NA if missing values are present in the data. Take a look.

First, we calculate the mean of stocks without using na.rm:

mean(stocks$TBONDS)
[1] NA

As there are missing values on the variable, the results of the calculation is NA.

In the next example, we add na.rm:

mean(stocks$TBONDS, na.rm=TRUE)
[1] 5.463415

Now we receive our desired result, the mean of TBONDS for all available values.

Practice

Try this yourself. Request the mean of TBONDS_D using na.rm.

mean(stocks$TBONDS_D, na.rm=TRUE)
[1] 0.804878
Back to top