Motivation

When you perform several transformations in R, your code can quickly become cluttered with nested parentheses or a series of temporary variables. Pipes offer a cleaner, more intuitive alternative: they let you write code that mirrors the logical flow of your analysis. Each step passes its output directly into the next, creating a readable sequence of operations that tells the story of your data transformation from start to finish.

For example, consider using this sequence of operations to compute a vector of standard deviations with the bfi dataset:

Select five personality scale scores.
Compute their covariance matrix.
Extract the variances from this matrix.
Take the square roots to obtain standard deviations.

We could implement these calculations as the following nested function calls.

sd <- sqrt(
  diag(
    cov(
      bfi[c("agree", "consc", "extra", "neuro", "open")],
      use = "pairwise"
    )
  )
)
sd

    agree     consc     extra     neuro      open 
0.8984019 0.9513469 1.0609041 1.1963314 0.8083739

This approach does the job, but it’s certainly not ideal (unless you happen to be a big fan of lisp). The main issue is readability: the code is difficult to parse because you need to read the nested function calls from the inside out to understand the sequence of operations. The situation gets even worse if we’re not as careful about formatting our code.

sd <- sqrt(diag(cov(bfi[c("agree", "consc", "extra", "neuro", "open")],
  use = "pairwise")))
sd

    agree     consc     extra     neuro      open 
0.8984019 0.9513469 1.0609041 1.1963314 0.8083739

To make things clearer, we could break the analysis into several steps and save the intermediate results as a temporary object that we repeatedly overwrite.

tmp <- bfi[c("agree", "consc", "extra", "neuro", "open")]
tmp <- cov(tmp, use = "pairwise")
tmp <- diag(tmp)
sd <- sqrt(tmp)
sd

    agree     consc     extra     neuro      open 
0.8984019 0.9513469 1.0609041 1.1963314 0.8083739

This approach is clearer than using deeply nested functions, but we still have notable problems. The repeated assignments increase the opportunities for typos and bugs. For example, consider a situation where we ran the following code earlier in our analysis.

tmp <- bfi[ , 1:5]

Then we tried to execute our focal calculation using the temporary object approach, but we made a small typo.

Tmp <- bfi[c("agree", "consc", "extra", "neuro", "open")]
tmp <- cov(tmp, use = "pairwise")
tmp <- diag(tmp)
sd2 <- sqrt(tmp)
sd2

      a1       a2       a3       a4       a5 
1.407737 1.172020 1.301834 1.479633 1.258512

Now, our vector contains the standard deviations of the five raw agreeableness items, not the intended scale scores. However, our results is still a length-five numeric vector, so we could easily miss our mistake and use the wrong vector in subsequent operations. For example, if we were intending to use these standard deviations to standardize the personality scores, R would be happy to use either version.

# Standardize using the correct SDs:
zDat1 <- scale(bfi[c("agree", "consc", "extra", "neuro", "open")], scale = sd)

# Standardize using the wrong SDs:
zDat2 <- scale(bfi[c("agree", "consc", "extra", "neuro", "open")], scale = sd2)

# Obviously, we get very different results:
head(zDat1 - zDat2, 10)

           agree       consc        extra        neuro        open
61617 -0.2626168 -0.29008810 -0.060198053 -0.057979301 -0.70203221
61618 -0.1820713 -0.05259196  0.149135917  0.102066091 -0.25956996
61620 -0.3431624 -0.05259196  0.009579937  0.070057012  0.09439985
61621 -0.0209802 -0.25050541 -0.095087048 -0.057979301 -0.61353976
61622 -0.2626168  0.02657341  0.114246922  0.006038856 -0.43655486
61623 -0.0209802  0.26406955  0.253802902 -0.025970223  0.18289230
61624 -0.0209802  0.02657341  0.009579937 -0.282042850  0.35987720
61629 -0.8264357 -0.17134003 -0.304421019  0.166084247 -0.17107750
61630 -0.4237079 -0.05259196 -0.156142790  0.070057012  0.18289230
61633  0.3012020  0.26406955  0.114246922  0.166084247  0.27138475

This type of error can be difficult to catch in real-world projects, and R won’t do anything to warn you. So, we want to minimize our opportunities for such mistakes, and pipes are a great way to do just that. Using a pipe, we can express the same sequence of calculations in a single expression with a clear, logical structure.

library(dplyr)

sd <- bfi |>
  select(agree, consc, extra, neuro, open) |>
  cov(use = "pairwise") |>
  diag() |>
  sqrt()

When we compose multiple function call using pipes, we call the resulting command a pipeline. Such pipelines solve both the issues noted above.

It’s easy to understand the sequence of calculations, because we can read off the analytic steps from left-to-right, top-to-bottom (just as you’d read text written in a European language, for example).
We do not need to make any temporary assignments, so we minimize our chances of typo-induced bugs.

In practice, pipes are most useful for implementing sequences of complex data processing steps. For example, suppose we want to apply the following data processing steps to the bfi data.

Center the variable age on 18.
Create mean scores for extraversion and neuroticism.
Keep only participants aged 18 or older.
Select only the scale scores and demographic variables.
Sort the data by extraversion in ascending order.

Using pipes and dplyr functions, we can write:

tmp1 <- bfi %>%
  mutate(age = age - 18,
         extra = rowMeans(across(matches("^e\\d$")), na.rm = TRUE),
         neuro = rowMeans(across(matches("^n\\d$")), na.rm = TRUE)) |>
  filter(age >= 0) |>
  select(extra, neuro, age, gender, education) |>
  arrange(extra)

head(tmp1, 20)

   extra neuro age gender            education
1    1.0   1.0   5   male         some_college
2    1.0   1.0   1   male         some_college
3    1.6   3.2   0 female         some_college
4    1.8   4.0  18 female high_school_graduate
5    2.0   6.0  32 female         some_college
6    2.2   4.4   5 female         some_college
7    2.2   4.6  22   male                 <NA>
8    2.2   4.0  17 female         some_college
9    2.2   2.6  22 female         some_college
10   2.2   4.2   9 female      graduate_degree
11   2.2   3.8  36 female     college_graduate
12   2.2   2.8  11 female         some_college
13   2.2   6.0   9 female         some_college
14   2.2   1.2   3 female         some_college
15   2.2   4.2  20 female         some_college
16   2.2   1.4  32 female     college_graduate
17   2.4   2.0  50   male      graduate_degree
18   2.4   3.6   3   male     college_graduate
19   2.4   2.8  36 female         some_college
20   2.4   1.0   0 female     some_high_school

When we implement these computations as a pipeline, each step is explicit, readable, and self-contained: it’s as clear as possible what happens and in what order. The same data processing written without pipes would be much harder to follow:

tmp2 <- bfi
tmp2$age   <- tmp2$age - 18
tmp2$extra <- rowMeans(tmp2[grep("^e\\d$", colnames(tmp2))], na.rm = TRUE)
tmp2$neuro <- rowMeans(tmp2[grep("^n\\d$", colnames(tmp2))], na.rm = TRUE)
tmp2 <- tmp2[tmp2$age >= 0, c("extra", "neuro", "age", "gender", "education")]
tmp2 <- tmp2[order(tmp2$extra), ]

head(tmp2, 20)

      extra neuro age gender            education
64642   1.0   1.0   5   male         some_college
65974   1.0   1.0   1   male         some_college
66676   1.6   3.2   0 female         some_college
62336   1.8   4.0  18 female high_school_graduate
65392   2.0   6.0  32 female         some_college
63026   2.2   4.4   5 female         some_college
63324   2.2   4.6  22   male                 <NA>
63668   2.2   4.0  17 female         some_college
64050   2.2   2.6  22 female         some_college
64158   2.2   4.2   9 female      graduate_degree
64621   2.2   3.8  36 female     college_graduate
64724   2.2   2.8  11 female         some_college
64753   2.2   6.0   9 female         some_college
66440   2.2   1.2   3 female         some_college
67357   2.2   4.2  20 female         some_college
67560   2.2   1.4  32 female     college_graduate
61661   2.4   2.0  50   male      graduate_degree
61761   2.4   3.6   3   male     college_graduate
61782   2.4   2.8  36 female         some_college
63027   2.4   1.0   0 female     some_high_school

all.equal(tmp1, tmp2, check.attributes = FALSE)

[1] TRUE

Both approaches produce the same result, but the pipeline is far clearer.