Usage

How Pipes Work

The pipe operator simply passes the output from the upstream function into the first argument of the downstream function. Hence, these two expressions are equivalent:

mean(bfi$age)

[1] 28.78214

bfi$age |> mean()

[1] 28.78214

Practice

Create a pipeline to complete the following operations on the mtcars dataset.

Standardize each variable in the mtcars dataset.
Compute the mean of each variable in the standardized mtcars data.

Your solution should be a single pipeline that produces a length-one logical vector.

You can use the following Base R functions in your solution:

scale(): Standardize variables
colMeans(): Compute the mean of each column in a matrix or data frame.

mtcars |> scale() |> colMeans()

          mpg           cyl          disp            hp          drat 
 7.112366e-17 -1.474515e-17 -9.085614e-17  1.040834e-17 -2.918672e-16 
           wt          qsec            vs            am          gear 
 4.682398e-17  5.299580e-16  6.938894e-18  4.510281e-17 -3.469447e-18 
         carb 
 3.165870e-17

The data piped into a function only fill the first argument. We’re free to specify any additional inputs in the normal way. Hence, the following to expressions are also equivalent.

var(bfi$a1, na.rm = TRUE)

[1] 1.981724

bfi$a1 |> var(na.rm = TRUE)

[1] 1.981724

Practice

Create a pipeline to complete the following operations in one expression.

Create a random sample of 1000 normally distributed values with:
- Mean of 10
- SD of 5
Compute the range of your sample.
- I.e., the difference between the largest and smallest value.
Check if the range of your sample is at least 1.5 times the population variance.

Your solution should be a single pipeline that produces a length-one logical vector.

You probably want to use the following functions somewhere in your pipeline.

rnorm()
range()
diff()

m <- 10
s <- 5

rnorm(1000, mean = m, sd = s) |>
  range() |>
  diff() > (1.5 * s^2)

[1] FALSE

What if the dataset isn’t the first argument?

Many R functions, particularly those that use a so-called formula interface, don’t take the input data as their first argument. If you try to include such a function in a normal pipeline, it won’t work. For example, in the following code, we try to use the bfi dataset to estimate a linear regression model wherein age and open predict extra.

bfi |> lm(extra ~ age + open)

Error in `as.data.frame.default()`:
! cannot coerce class '"formula"' to a data.frame

In these cases, you can use the special placeholder token _ to tell R explicitly where to insert the piped object.

bfi |> lm(extra ~ age + open, data = _)


Call:
lm(formula = extra ~ age + open, data = bfi)

Coefficients:
(Intercept)          age         open  
   2.752007     0.004406     0.276073

Note that we must name the argument to which we assign _. The following won’t work.

bfi |> lm(extra ~ age + open, _)

Error in lm(extra ~ age + open, "_"): pipe placeholder can only be used as a named argument (<input>:1:8)

This trick makes it possible to use nearly any function in a pipeline, even when the dataset isn’t the first argument.

Practice

Create a pipeline to perform the following operations with the bfi dataset:

Randomly sample \(N = 500\) rows.
- Use dplyr::slice_sample().
Use the sampled rows to regress agree onto extra, open, and gender.
- Use lm().
Extract the residuals from the linear regression model.
Compute the sum of the absolute values of the residuals.

Interactive Editor
Solution

library(dplyr)
set.seed(235711)

bfi |> 
  slice_sample(n = 500) |>
  lm(agree ~ extra + open + gender, data = _) |>
  resid() |>
  abs() |>
  sum()

[1] 318.6229

NOTES:

If you don’t set the same (or any) seed, your code will produce a different value than the solution shown here because your code will sample a different set of rows.
If you set the same seed by running set.seed(235711) before your pipeline, you should get exactly the same result because slice_sample() will use the same sequence of pseudo random numbers to pick the rows it “randomly” samples.