Because data frames are just lists, we can access a data frame’s columns using the same methods we would use for lists. To access a single column by name, the most efficient method is typically the $ operator.
We can also use the single, [], or double, [[]], square bracket operators. As with lists, these operators differ in two respects how many columns they can select and how they format resulting selection.
[]: Can select multiple elements and always returns a data frame.
[[]]: Can select only one element and returns the column contents as a vector (or whatever type of object the column contained).
# Return a one-column data frame comprising the 'b' column from d1d1["b"]
b
1 bar
2 foo
3 foo
4 foo
5 bar
6 foo
7 foo
8 bar
9 bar
10 foo
# The same as above, but using the column index instead of the column named1[2]
b
1 bar
2 foo
3 foo
4 foo
5 bar
6 foo
7 foo
8 bar
9 bar
10 foo
# Return a two-column data frame comprising the 'a' and 'b' columns from d1d1[c("a", "b")]
a b
1 TRUE bar
2 FALSE foo
3 TRUE foo
4 TRUE foo
5 TRUE bar
6 FALSE foo
7 FALSE foo
8 FALSE bar
9 FALSE bar
10 TRUE foo
# The same as above, but using the column indices instead of the column namesd1[1:2]
a b
1 TRUE bar
2 FALSE foo
3 TRUE foo
4 TRUE foo
5 TRUE bar
6 FALSE foo
7 FALSE foo
8 FALSE bar
9 FALSE bar
10 TRUE foo
# Return the 'b' column from d1 as a character vectord1[["b"]]
Data frames also support matrix-style subsetting, where we define the selection by specifying both the row and column indices.
d1[1:2, 2:3]
b c
1 bar 0.9977583
2 foo 0.4694688
d1[ , 1:2]
a b
1 TRUE bar
2 FALSE foo
3 TRUE foo
4 TRUE foo
5 TRUE bar
6 FALSE foo
7 FALSE foo
8 FALSE bar
9 FALSE bar
10 TRUE foo
d1[2:3, ]
a b c
2 FALSE foo 0.4694688
3 TRUE foo 0.1587081
In most cases, matrix style subsetting behaves the same was as the [] list-style operator—you can select any number of elements, and the selection is returned as a data frame—but there is one exception If you select a single column using matrix-style subsetting, the selection will be converted to a vector.
Naturally, we can overwrite the columns of a data frame using the same procedures that we use to modify list slots. When modifying one column at a time, we directly apply the intuitive operations.
# View the original data framed1
a b c
1 TRUE bar 0.9977583
2 FALSE foo 0.4694688
3 TRUE foo 0.1587081
4 TRUE foo 0.8263225
5 TRUE bar 0.4116577
6 FALSE foo 0.1169888
7 FALSE foo 0.2909142
8 FALSE bar 0.3785526
9 FALSE bar 0.9349888
10 TRUE foo 0.2315520
## Modify some list elementsd1$a <- LETTERS[1:10]d1[[2]] <-rnorm(10)d1["c"] <-rep(c(TRUE, FALSE), each =5)# View the modified data framed1
a b c
1 A 0.6267156 TRUE
2 B 1.5459290 TRUE
3 C 0.1925901 TRUE
4 D 0.8329911 TRUE
5 E -0.0529080 TRUE
6 F 1.2524460 FALSE
7 G -0.4795803 FALSE
8 H -0.7851923 FALSE
9 I -1.5581979 FALSE
10 J -0.7658068 FALSE
When modifying multiple columns with the [], operator, it’s best to supply the replacement values as a data frame or list with the same size as the selected columns.
# Replace the first two columns of d1 with an equivalently sized data frame# extracted from the 'iris' datasetd1[1:2] <- iris[1:10, 1:2]d1
When the replacement size doesn’t match the selection size, R will use recycling to resolve the discrepancy, but it’s not always easy to predict how the replacement will behave.
# Replace the first two columns of d1 by recycling the vector `1:5`d1[1:2] <-1:5d1
# Replace the 'a' and 'c' columns in d1 with a list containing vectors that# will need to be recycledd1[c("a", "c")] <-list(c("yes", "no"), 7:8)d1
a b c
1 yes 1 7
2 no 2 8
3 yes 3 7
4 no 4 8
5 yes 5 7
6 no 1 8
7 yes 2 7
8 no 3 8
9 yes 4 7
10 no 5 8
As with matrices, R is oddly specific (in a slightly different way) about the kinds of size discrepancies it will automatically resolved when modifying data frames.
OK
Replacement length > Selection length
Replacement length cleanly divides the selection length
Replacement length exceeds selection length
Replacement list contains more slots than columns selected
Replacement data frame contains more slots than columns selected
Not OK
Replacement length does not cleanly divide selection length
# Works: Replace the first two columns of d1 by using the first 20 elements# from the `100:500`d1[1:2] <-100:500d1
# Works: Replace the 'a' and 'c' columns in d1 with a the first two columns# from a three-column data framed1[c("a", "c")] <-data.frame("foo", "bar", "baz")d1
a b c
1 foo 110 bar
2 foo 111 bar
3 foo 112 bar
4 foo 113 bar
5 foo 114 bar
6 foo 115 bar
7 foo 116 bar
8 foo 117 bar
9 foo 118 bar
10 foo 119 bar
# Fails: Replace the first two columns of d1 by using non-conformable vector 1:3d1[1:2] <-1:3
Error in `[<-.data.frame`(`*tmp*`, 1:2, value = 1:3): replacement has 3 items, need 20
# Fails: Replace the 'a' and 'c' columns in d1 with a list containing# non-conformable vectorsd1[c("a", "c")] <-list(letters[4], 1:8)
Error in `[<-.data.frame`(`*tmp*`, c("a", "c"), value = list("d", 1:8)): replacement element 2 has 8 rows, need 10
Adding Columns
As with lists, we can add new columns to an existing data frame using the $ or [[]] operators.
# Create an empty list(d2 <-data.frame(a =1:10, b ="bob"))
a b
1 1 bob
2 2 bob
3 3 bob
4 4 bob
5 5 bob
6 6 bob
7 7 bob
8 8 bob
9 9 bob
10 10 bob
## Various ways of adding new single columnsd2$c <- letters[1:10]d2[["d"]] <-runif(10)d2[[5]] <-rnorm(10)d2["alice"] <-TRUEd2
a b c d V5 alice
1 1 bob a 0.61442041 -0.0105563 TRUE
2 2 bob b 0.99336024 -0.2395205 TRUE
3 3 bob c 0.35083721 -0.6678822 TRUE
4 4 bob d 0.58937418 -1.4522268 TRUE
5 5 bob e 0.09605802 0.3873236 TRUE
6 6 bob f 0.09893925 -0.5918080 TRUE
7 7 bob g 0.51361014 -0.8701455 TRUE
8 8 bob h 0.77551488 -1.3784525 TRUE
9 9 bob i 0.88055553 -0.7888512 TRUE
10 10 bob j 0.44399214 0.5668728 TRUE
We can add multiple columns using the [] operator.
a b c d V5 alice V7 V8 foo bar
1 1 bob a 0.61442041 -0.0105563 TRUE -1.5939374 -1.0793539 TRUE FALSE
2 2 bob b 0.99336024 -0.2395205 TRUE 1.4772136 -1.1600501 TRUE FALSE
3 3 bob c 0.35083721 -0.6678822 TRUE -1.1498589 -1.2191487 TRUE FALSE
4 4 bob d 0.58937418 -1.4522268 TRUE 0.4329690 -0.9700584 TRUE FALSE
5 5 bob e 0.09605802 0.3873236 TRUE 0.5142483 -0.2994792 TRUE FALSE
6 6 bob f 0.09893925 -0.5918080 TRUE -0.5240982 0.1553770 TRUE FALSE
7 7 bob g 0.51361014 -0.8701455 TRUE -0.1236745 -0.4210625 TRUE FALSE
8 8 bob h 0.77551488 -1.3784525 TRUE 0.8173379 -0.3661413 TRUE FALSE
9 9 bob i 0.88055553 -0.7888512 TRUE 0.2495192 -0.1392587 TRUE FALSE
10 10 bob j 0.44399214 0.5668728 TRUE 0.6023962 1.1628371 TRUE FALSE
Practice
Run the following code to create an empty data frame containing 10 observations of the 3 variables: a, b, c. Then populate the data frame as described below.
Fill column a with the integer sequence from -9 to 0.
Use the column name to assign the new values.
Fill column b with the even integers between 1 and 20 (inclusive).
Use the numeric column index to assign the new values.
Replace the odd rows in column c with the odd integers between 11 and 20 (inclusive).
Do not overwrite the missing values in the even rows.