sd <- sqrt(
diag(
cov(
bfi[c("agree", "consc", "extra", "neuro", "open")],
use = "pairwise"
)
)
)
sd agree consc extra neuro open
0.8984019 0.9513469 1.0609041 1.1963314 0.8083739
When you perform several transformations in R, your code can quickly become cluttered with nested parentheses or a series of temporary variables. Pipes offer a cleaner, more intuitive alternative: they let you write code that mirrors the logical flow of your analysis. Each step passes its output directly into the next, creating a readable sequence of operations that tells the story of your data transformation from start to finish.
For example, consider using this sequence of operations to compute a vector of standard deviations with the bfi dataset:
We could implement these calculations as the following nested function calls.
sd <- sqrt(
diag(
cov(
bfi[c("agree", "consc", "extra", "neuro", "open")],
use = "pairwise"
)
)
)
sd agree consc extra neuro open
0.8984019 0.9513469 1.0609041 1.1963314 0.8083739
This approach does the job, but it’s certainly not ideal (unless you happen to be a big fan of lisp). The main issue is readability: the code is difficult to parse because you need to read the nested function calls from the inside out to understand the sequence of operations. The situation gets even worse if we’re not as careful about formatting our code.
agree consc extra neuro open
0.8984019 0.9513469 1.0609041 1.1963314 0.8083739
To make things clearer, we could break the analysis into several steps and save the intermediate results as a temporary object that we repeatedly overwrite.
tmp <- bfi[c("agree", "consc", "extra", "neuro", "open")]
tmp <- cov(tmp, use = "pairwise")
tmp <- diag(tmp)
sd <- sqrt(tmp)
sd agree consc extra neuro open
0.8984019 0.9513469 1.0609041 1.1963314 0.8083739
This approach is clearer than using deeply nested functions, but we still have notable problems. The repeated assignments increase the opportunities for typos and bugs. For example, consider a situation where we ran the following code earlier in our analysis.
Then we tried to execute our focal calculation using the temporary object approach, but we made a small typo.
Tmp <- bfi[c("agree", "consc", "extra", "neuro", "open")]
tmp <- cov(tmp, use = "pairwise")
tmp <- diag(tmp)
sd2 <- sqrt(tmp)
sd2 a1 a2 a3 a4 a5
1.407737 1.172020 1.301834 1.479633 1.258512
Now, our vector contains the standard deviations of the five raw agreeableness items, not the intended scale scores. However, our results is still a length-five numeric vector, so we could easily miss our mistake and use the wrong vector in subsequent operations. For example, if we were intending to use these standard deviations to standardize the personality scores, R would be happy to use either version.
# Standardize using the correct SDs:
zDat1 <- scale(bfi[c("agree", "consc", "extra", "neuro", "open")], scale = sd)
# Standardize using the wrong SDs:
zDat2 <- scale(bfi[c("agree", "consc", "extra", "neuro", "open")], scale = sd2)
# Obviously, we get very different results:
head(zDat1 - zDat2, 10) agree consc extra neuro open
61617 -0.2626168 -0.29008810 -0.060198053 -0.057979301 -0.70203221
61618 -0.1820713 -0.05259196 0.149135917 0.102066091 -0.25956996
61620 -0.3431624 -0.05259196 0.009579937 0.070057012 0.09439985
61621 -0.0209802 -0.25050541 -0.095087048 -0.057979301 -0.61353976
61622 -0.2626168 0.02657341 0.114246922 0.006038856 -0.43655486
61623 -0.0209802 0.26406955 0.253802902 -0.025970223 0.18289230
61624 -0.0209802 0.02657341 0.009579937 -0.282042850 0.35987720
61629 -0.8264357 -0.17134003 -0.304421019 0.166084247 -0.17107750
61630 -0.4237079 -0.05259196 -0.156142790 0.070057012 0.18289230
61633 0.3012020 0.26406955 0.114246922 0.166084247 0.27138475
This type of error can be difficult to catch in real-world projects, and R won’t do anything to warn you. So, we want to minimize our opportunities for such mistakes, and pipes are a great way to do just that. Using a pipe, we can express the same sequence of calculations in a single expression with a clear, logical structure.
When we compose multiple function call using pipes, we call the resulting command a pipeline. Such pipelines solve both the issues noted above.
In practice, pipes are most useful for implementing sequences of complex data processing steps. For example, suppose we want to apply the following data processing steps to the bfi data.
age on 18.extraversion and neuroticism.extraversion in ascending order.Using pipes and dplyr functions, we can write:
tmp1 <- bfi %>%
mutate(age = age - 18,
extra = rowMeans(across(matches("^e\\d$")), na.rm = TRUE),
neuro = rowMeans(across(matches("^n\\d$")), na.rm = TRUE)) |>
filter(age >= 0) |>
select(extra, neuro, age, gender, education) |>
arrange(extra)
head(tmp1, 20) extra neuro age gender education
1 1.0 1.0 5 male some_college
2 1.0 1.0 1 male some_college
3 1.6 3.2 0 female some_college
4 1.8 4.0 18 female high_school_graduate
5 2.0 6.0 32 female some_college
6 2.2 4.4 5 female some_college
7 2.2 4.6 22 male <NA>
8 2.2 4.0 17 female some_college
9 2.2 2.6 22 female some_college
10 2.2 4.2 9 female graduate_degree
11 2.2 3.8 36 female college_graduate
12 2.2 2.8 11 female some_college
13 2.2 6.0 9 female some_college
14 2.2 1.2 3 female some_college
15 2.2 4.2 20 female some_college
16 2.2 1.4 32 female college_graduate
17 2.4 2.0 50 male graduate_degree
18 2.4 3.6 3 male college_graduate
19 2.4 2.8 36 female some_college
20 2.4 1.0 0 female some_high_school
When we implement these computations as a pipeline, each step is explicit, readable, and self-contained: it’s as clear as possible what happens and in what order. The same data processing written without pipes would be much harder to follow:
tmp2 <- bfi
tmp2$age <- tmp2$age - 18
tmp2$extra <- rowMeans(tmp2[grep("^e\\d$", colnames(tmp2))], na.rm = TRUE)
tmp2$neuro <- rowMeans(tmp2[grep("^n\\d$", colnames(tmp2))], na.rm = TRUE)
tmp2 <- tmp2[tmp2$age >= 0, c("extra", "neuro", "age", "gender", "education")]
tmp2 <- tmp2[order(tmp2$extra), ]
head(tmp2, 20) extra neuro age gender education
64642 1.0 1.0 5 male some_college
65974 1.0 1.0 1 male some_college
66676 1.6 3.2 0 female some_college
62336 1.8 4.0 18 female high_school_graduate
65392 2.0 6.0 32 female some_college
63026 2.2 4.4 5 female some_college
63324 2.2 4.6 22 male <NA>
63668 2.2 4.0 17 female some_college
64050 2.2 2.6 22 female some_college
64158 2.2 4.2 9 female graduate_degree
64621 2.2 3.8 36 female college_graduate
64724 2.2 2.8 11 female some_college
64753 2.2 6.0 9 female some_college
66440 2.2 1.2 3 female some_college
67357 2.2 4.2 20 female some_college
67560 2.2 1.4 32 female college_graduate
61661 2.4 2.0 50 male graduate_degree
61761 2.4 3.6 3 male college_graduate
61782 2.4 2.8 36 female some_college
63027 2.4 1.0 0 female some_high_school
[1] TRUE
Both approaches produce the same result, but the pipeline is far clearer.
Back to top