Manipulating Data Frames

Accessing Data Frame Elements

Because data frames are just lists, we can access a data frame’s columns using the same methods we would use for lists. To access a single column by name, the most efficient method is typically the $ operator.

d1 <- data.frame(a = sample(c(TRUE, FALSE), 10, replace = TRUE),
                 b = sample(c("foo", "bar"), 10, replace = TRUE),
                 c = runif(10)
                 )
d1$b

 [1] "bar" "foo" "foo" "foo" "bar" "foo" "foo" "bar" "bar" "foo"

data(iris)
iris$Petal.Length

  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1

We can also use the single, [], or double, [[]], square bracket operators. As with lists, these operators differ in two respects how many columns they can select and how they format resulting selection.

[]: Can select multiple elements and always returns a data frame.
[[]]: Can select only one element and returns the column contents as a vector (or whatever type of object the column contained).

# Return a one-column data frame comprising the 'b' column from d1
d1["b"]

     b
1  bar
2  foo
3  foo
4  foo
5  bar
6  foo
7  foo
8  bar
9  bar
10 foo

# The same as above, but using the column index instead of the column name
d1[2]

     b
1  bar
2  foo
3  foo
4  foo
5  bar
6  foo
7  foo
8  bar
9  bar
10 foo

# Return a two-column data frame comprising the 'a' and 'b' columns from d1
d1[c("a", "b")]

       a   b
1   TRUE bar
2  FALSE foo
3   TRUE foo
4   TRUE foo
5   TRUE bar
6  FALSE foo
7  FALSE foo
8  FALSE bar
9  FALSE bar
10  TRUE foo

# The same as above, but using the column indices instead of the column names
d1[1:2]

       a   b
1   TRUE bar
2  FALSE foo
3   TRUE foo
4   TRUE foo
5   TRUE bar
6  FALSE foo
7  FALSE foo
8  FALSE bar
9  FALSE bar
10  TRUE foo

# Return the 'b' column from d1 as a character vector
d1[["b"]]

 [1] "bar" "foo" "foo" "foo" "bar" "foo" "foo" "bar" "bar" "foo"

# The same as above, but using the column index instead of the column name
d1[[2]]

 [1] "bar" "foo" "foo" "foo" "bar" "foo" "foo" "bar" "bar" "foo"

Matrix-Style Selection

Data frames also support matrix-style subsetting, where we define the selection by specifying both the row and column indices.

d1[1:2, 2:3]

    b         c
1 bar 0.9977583
2 foo 0.4694688

d1[ , 1:2]

       a   b
1   TRUE bar
2  FALSE foo
3   TRUE foo
4   TRUE foo
5   TRUE bar
6  FALSE foo
7  FALSE foo
8  FALSE bar
9  FALSE bar
10  TRUE foo

d1[2:3, ]

      a   b         c
2 FALSE foo 0.4694688
3  TRUE foo 0.1587081

In most cases, matrix style subsetting behaves the same was as the [] list-style operator—you can select any number of elements, and the selection is returned as a data frame—but there is one exception If you select a single column using matrix-style subsetting, the selection will be converted to a vector.

d1[ , 1]

 [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE

Modifying Data Frame Elements

Naturally, we can overwrite the columns of a data frame using the same procedures that we use to modify list slots. When modifying one column at a time, we directly apply the intuitive operations.

# View the original data frame
d1

       a   b         c
1   TRUE bar 0.9977583
2  FALSE foo 0.4694688
3   TRUE foo 0.1587081
4   TRUE foo 0.8263225
5   TRUE bar 0.4116577
6  FALSE foo 0.1169888
7  FALSE foo 0.2909142
8  FALSE bar 0.3785526
9  FALSE bar 0.9349888
10  TRUE foo 0.2315520

## Modify some list elements
d1$a <- LETTERS[1:10]
d1[[2]] <- rnorm(10)
d1["c"] <- rep(c(TRUE, FALSE), each = 5)

# View the modified data frame
d1

   a          b     c
1  A  0.6267156  TRUE
2  B  1.5459290  TRUE
3  C  0.1925901  TRUE
4  D  0.8329911  TRUE
5  E -0.0529080  TRUE
6  F  1.2524460 FALSE
7  G -0.4795803 FALSE
8  H -0.7851923 FALSE
9  I -1.5581979 FALSE
10 J -0.7658068 FALSE

When modifying multiple columns with the [], operator, it’s best to supply the replacement values as a data frame or list with the same size as the selected columns.

# Replace the first two columns of d1 with an equivalently sized data frame
# extracted from the 'iris' dataset
d1[1:2] <- iris[1:10, 1:2]
d1

     a   b     c
1  5.1 3.5  TRUE
2  4.9 3.0  TRUE
3  4.7 3.2  TRUE
4  4.6 3.1  TRUE
5  5.0 3.6  TRUE
6  5.4 3.9 FALSE
7  4.6 3.4 FALSE
8  5.0 3.4 FALSE
9  4.4 2.9 FALSE
10 4.9 3.1 FALSE

# Replace the 'a' and 'c' columns in d1 with an equivalently sized list
d1[c("a", "c")] <- list(rnorm(10), runif(10))
d1

             a   b          c
1   0.64522752 3.5 0.64463832
2  -0.94403280 3.0 0.68292331
3   1.20767989 3.2 0.39321446
4   1.13161260 3.1 0.06890338
5  -0.94918157 3.6 0.22412153
6  -1.94258303 3.9 0.06160546
7   0.08470501 3.4 0.79903592
8  -0.78728475 3.4 0.31985047
9  -1.82277772 2.9 0.34235694
10  2.04021870 3.1 0.61118128

Matrix-Style Selection

If we only want to replace part of a column, we can use matrix-style selection to choose the target cells.

d1[1:5, 2] <- 41:45
d1

             a    b          c
1   0.64522752 41.0 0.64463832
2  -0.94403280 42.0 0.68292331
3   1.20767989 43.0 0.39321446
4   1.13161260 44.0 0.06890338
5  -0.94918157 45.0 0.22412153
6  -1.94258303  3.9 0.06160546
7   0.08470501  3.4 0.79903592
8  -0.78728475  3.4 0.31985047
9  -1.82277772  2.9 0.34235694
10  2.04021870  3.1 0.61118128

d1[3:6, c("a", "c")] <- list(-99, 888)
d1

              a    b           c
1    0.64522752 41.0   0.6446383
2   -0.94403280 42.0   0.6829233
3  -99.00000000 43.0 888.0000000
4  -99.00000000 44.0 888.0000000
5  -99.00000000 45.0 888.0000000
6  -99.00000000  3.9 888.0000000
7    0.08470501  3.4   0.7990359
8   -0.78728475  3.4   0.3198505
9   -1.82277772  2.9   0.3423569
10   2.04021870  3.1   0.6111813

Recycling

When the replacement size doesn’t match the selection size, R will use recycling to resolve the discrepancy, but it’s not always easy to predict how the replacement will behave.

# Replace the first two columns of d1 by recycling the vector `1:5`
d1[1:2] <- 1:5
d1

   a b           c
1  1 1   0.6446383
2  2 2   0.6829233
3  3 3 888.0000000
4  4 4 888.0000000
5  5 5 888.0000000
6  1 1 888.0000000
7  2 2   0.7990359
8  3 3   0.3198505
9  4 4   0.3423569
10 5 5   0.6111813

# Replace the 'a' and 'c' columns in d1 with a list containing vectors that
# will need to be recycled
d1[c("a", "c")] <- list(c("yes", "no"), 7:8)
d1

     a b c
1  yes 1 7
2   no 2 8
3  yes 3 7
4   no 4 8
5  yes 5 7
6   no 1 8
7  yes 2 7
8   no 3 8
9  yes 4 7
10  no 5 8

As with matrices, R is oddly specific (in a slightly different way) about the kinds of size discrepancies it will automatically resolved when modifying data frames.

OK
- Replacement length > Selection length
- Replacement length cleanly divides the selection length
- Replacement length exceeds selection length
- Replacement list contains more slots than columns selected
- Replacement data frame contains more slots than columns selected
Not OK
- Replacement length does not cleanly divide selection length

# Works: Replace the first two columns of d1 by using the first 20 elements
# from the `100:500`
d1[1:2] <- 100:500
d1

     a   b c
1  100 110 7
2  101 111 8
3  102 112 7
4  103 113 8
5  104 114 7
6  105 115 8
7  106 116 7
8  107 117 8
9  108 118 7
10 109 119 8

# Works: Replace the 'a' and 'c' columns in d1 with a list containing vectors
# that are too long
d1[c("a", "c")] <- list(rnorm(100), runif(100))
d1

            a   b          c
1  -1.7190793 110 0.06899411
2   1.4432727 111 0.69843387
3  -0.2982533 112 0.63054671
4   0.3895335 113 0.85015793
5  -0.8727988 114 0.12361847
6  -0.1689203 115 0.94526449
7  -0.9592595 116 0.86566085
8   0.5259255 117 0.64610659
9   1.8416340 118 0.54993571
10  1.0975011 119 0.61342601

# Works: Replace the 'a' and 'c' columns in d1 with a the first two slots in a
# length-3 list
d1[c("a", "c")] <- list(1, 2, 3)
d1

# Works: Replace the 'a' and 'c' columns in d1 with a the first two columns
# from a three-column data frame
d1[c("a", "c")] <- data.frame("foo", "bar", "baz")
d1

     a   b   c
1  foo 110 bar
2  foo 111 bar
3  foo 112 bar
4  foo 113 bar
5  foo 114 bar
6  foo 115 bar
7  foo 116 bar
8  foo 117 bar
9  foo 118 bar
10 foo 119 bar

# Fails: Replace the first two columns of d1 by using non-conformable vector 1:3
d1[1:2] <- 1:3

Error in `[<-.data.frame`(`*tmp*`, 1:2, value = 1:3): replacement has 3 items, need 20

# Fails: Replace the 'a' and 'c' columns in d1 with a list containing
# non-conformable vectors
d1[c("a", "c")] <- list(letters[4], 1:8)

Error in `[<-.data.frame`(`*tmp*`, c("a", "c"), value = list("d", 1:8)): replacement element 2 has 8 rows, need 10

Adding Columns

As with lists, we can add new columns to an existing data frame using the $ or [[]] operators.

# Create an empty list
(d2 <- data.frame(a = 1:10, b = "bob"))

    a   b
1   1 bob
2   2 bob
3   3 bob
4   4 bob
5   5 bob
6   6 bob
7   7 bob
8   8 bob
9   9 bob
10 10 bob

## Various ways of adding new single columns
d2$c <- letters[1:10]
d2[["d"]] <- runif(10)
d2[[5]] <- rnorm(10)
d2["alice"] <- TRUE
d2

    a   b c          d         V5 alice
1   1 bob a 0.61442041 -0.0105563  TRUE
2   2 bob b 0.99336024 -0.2395205  TRUE
3   3 bob c 0.35083721 -0.6678822  TRUE
4   4 bob d 0.58937418 -1.4522268  TRUE
5   5 bob e 0.09605802  0.3873236  TRUE
6   6 bob f 0.09893925 -0.5918080  TRUE
7   7 bob g 0.51361014 -0.8701455  TRUE
8   8 bob h 0.77551488 -1.3784525  TRUE
9   9 bob i 0.88055553 -0.7888512  TRUE
10 10 bob j 0.44399214  0.5668728  TRUE

We can add multiple columns using the [] operator.

d2[7:8] <- rnorm(20)
d2[c("foo", "bar")] <- list(TRUE, FALSE)
d2

    a   b c          d         V5 alice         V7         V8  foo   bar
1   1 bob a 0.61442041 -0.0105563  TRUE -1.5939374 -1.0793539 TRUE FALSE
2   2 bob b 0.99336024 -0.2395205  TRUE  1.4772136 -1.1600501 TRUE FALSE
3   3 bob c 0.35083721 -0.6678822  TRUE -1.1498589 -1.2191487 TRUE FALSE
4   4 bob d 0.58937418 -1.4522268  TRUE  0.4329690 -0.9700584 TRUE FALSE
5   5 bob e 0.09605802  0.3873236  TRUE  0.5142483 -0.2994792 TRUE FALSE
6   6 bob f 0.09893925 -0.5918080  TRUE -0.5240982  0.1553770 TRUE FALSE
7   7 bob g 0.51361014 -0.8701455  TRUE -0.1236745 -0.4210625 TRUE FALSE
8   8 bob h 0.77551488 -1.3784525  TRUE  0.8173379 -0.3661413 TRUE FALSE
9   9 bob i 0.88055553 -0.7888512  TRUE  0.2495192 -0.1392587 TRUE FALSE
10 10 bob j 0.44399214  0.5668728  TRUE  0.6023962  1.1628371 TRUE FALSE

Practice

Run the following code to create an empty data frame containing 10 observations of the 3 variables: a, b, c. Then populate the data frame as described below.

Fill column a with the integer sequence from -9 to 0.
- Use the column name to assign the new values.
Fill column b with the even integers between 1 and 20 (inclusive).
- Use the numeric column index to assign the new values.
Replace the odd rows in column c with the odd integers between 11 and 20 (inclusive).
- Do not overwrite the missing values in the even rows.

Interactive Editor
Solution

df <- data.frame(a = rep(NA, 10),
                 b = rep(NA, 10),
                 c = rep(NA, 10)
                 )
df

    a  b  c
1  NA NA NA
2  NA NA NA
3  NA NA NA
4  NA NA NA
5  NA NA NA
6  NA NA NA
7  NA NA NA
8  NA NA NA
9  NA NA NA
10 NA NA NA

df$a <- -9:0
df[2] <- seq(2,20,2)
df[seq(1, 9, 2), "c"] <- seq(11, 19, 2)
df

    a  b  c
1  -9  2 11
2  -8  4 NA
3  -7  6 13
4  -6  8 NA
5  -5 10 15
6  -4 12 NA
7  -3 14 17
8  -2 16 NA
9  -1 18 19
10  0 20 NA