Delimited text files are the simplest type of file that youβre likely to encounter when loading datasets into R. The files represent data as plain text where the data values are separated by a specific delimiter character: comma, tab, space.
While base R provides functions for reading these kinds of files (e.g., read.table(), read.csv()), the readr package from the tidyverse provides a superior alternative. Relative to their Base R counterparts, the data-ingest functions in readr are faster, easier to configure, and more transparent in how they parse data.
Space-Delimited
Weβll first consider space-delimited files wherein the data values are separated by a single white space character. We can use the readr::read_delim() function to load space-delimited files. When calling read_delim() we have two options for specifying the delimiter:
Let the function detect the delimiter automatically (the default)
Specify the delimiter via the delim argument
library(readr)# Read the `boys.dat` file using default arguments and store the ingested values# in a data frame called 'boys'boys <-read_delim("data/boys.dat")
Rows: 748 Columns: 9
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When we using read_delim() read a file, the function prints an informative message telling us about the delimiter and the data types that it detected/used. We should check this message to make sure the auto-detection worked correctly.
# Check the resulthead(boys)
# A tibble: 6 Γ 9
age hgt wgt bmi hc gen phb tv reg
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035 50.1 3.65 14.5 33.7 -999 -999 -999 south
2 0.038 53.5 3.37 11.8 35 -999 -999 -999 south
3 0.057 50 3.14 12.6 35.2 -999 -999 -999 south
4 0.06 54.5 4.27 14.4 36.7 -999 -999 -999 south
5 0.062 57.5 5.03 15.2 37.3 -999 -999 -999 south
6 0.068 55.5 4.66 15.1 37 -999 -999 -999 south
Notice that read_delim() returns a tibble and not an ordinary data frame. Tibbles are just slightly fancier flavor of data frame used by tidyverse packages. For our purposes, we can treat tibbles and data frames as equivalent objects.
# Read the `boys.dat` file with the delimiter specifically definedboys <-read_delim("data/boys.dat", delim =" ")
Rows: 748 Columns: 9
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the resulthead(boys)
# A tibble: 6 Γ 9
age hgt wgt bmi hc gen phb tv reg
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035 50.1 3.65 14.5 33.7 -999 -999 -999 south
2 0.038 53.5 3.37 11.8 35 -999 -999 -999 south
3 0.057 50 3.14 12.6 35.2 -999 -999 -999 south
4 0.06 54.5 4.27 14.4 36.7 -999 -999 -999 south
5 0.062 57.5 5.03 15.2 37.3 -999 -999 -999 south
6 0.068 55.5 4.66 15.1 37 -999 -999 -999 south
Good Path Habits
In the example above, we explicitly wrote out the path to the data file we wanted to read (i.e., "data/boys.dat"). That approach will work, but we can make our lives a lot easier by taking a few steps to improve the portability of our code. In the following code chunk, we make two important changes:
Create a variable called dataDir that contains the relative file path from the root directory of our project to the data folder.
Use the here() function from the here package to resolve the file path.
Rows: 748 Columns: 9
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(boys)
# A tibble: 6 Γ 9
age hgt wgt bmi hc gen phb tv reg
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035 50.1 3.65 14.5 33.7 -999 -999 -999 south
2 0.038 53.5 3.37 11.8 35 -999 -999 -999 south
3 0.057 50 3.14 12.6 35.2 -999 -999 -999 south
4 0.06 54.5 4.27 14.4 36.7 -999 -999 -999 south
5 0.062 57.5 5.03 15.2 37.3 -999 -999 -999 south
6 0.068 55.5 4.66 15.1 37 -999 -999 -999 south
Tab-Delimited
In a tab-delimited file the data values are separated by tab characters (\t). In the following code chunk, we load a file named diabetes.txt.
# Let the function auto-detect the delimiterdiabetes <-read_delim(here(dataDir, "diabetes.txt"))
Rows: 442 Columns: 11
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: "\t"
chr (1): sex
dbl (10): age, bmi, bp, tc, ldl, hdl, tch, ltg, glu, progress
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Manually explicate the delimiterdiabetes <-read_delim(here(dataDir, "diabetes.txt"), delim ="\t")
Rows: 442 Columns: 11
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: "\t"
chr (1): sex
dbl (10): age, bmi, bp, tc, ldl, hdl, tch, ltg, glu, progress
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Alternatively, readr provides a dedicated function for reading tab-delimited files: read_tsv(). This function is just a wrapper around read_delim() with the delimiter preset to the tab character.
Rows: 442 Columns: 11
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: "\t"
chr (1): sex
dbl (10): age, bmi, bp, tc, ldl, hdl, tch, ltg, glu, progress
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 150 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: "\t"
chr (1): Species
dbl (5): ID, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The comma-separated values (CSV) format is one of the most popular delimited file types for storing tabular data. As the name implies, the data values are separated by commas. We can use the readr::read_delim() function with delim = ",", but itβs more convenient to use the readr::read_csv() wrapper function.
New names:
Rows: 2535 Columns: 33
ββ Column specification
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Delimiter: "," chr
(6): name, team, category, nationality, time, timediff dbl (3): ...1, bib, rank
time (24): Delevret, St-Gervais, Contamines, La Balme, Bonhomme, Chapieux, C...
βΉ Use `spec()` to retrieve the full column specification for this data. βΉ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
β’ `` -> `...1`
New names:
Rows: 2535 Columns: 33
ββ Column specification
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Delimiter: "," chr
(4): name, team, category, nationality dbl (3): ...1, bib, rank time (26):
time, timediff, Delevret, St-Gervais, Contamines, La Balme, Bonho...
βΉ Use `spec()` to retrieve the full column specification for this data. βΉ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
β’ `` -> `...1`
The CSV format was developed in the United States where the period/full-stop character is used as the decimal separator. In countries that use the comma character to denote decimals (e.g., most European countries), it doesnβt make much sense to separate data fields with commas. In these countries, CSV files use semicolons as the delimiter (though the file type is still called comma-separated, unfortunately).
boys <-read_csv(here(dataDir, "boys_eu.csv"))
Rows: 748 Columns: 1
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (1): age;hgt;wgt;bmi;hc;gen;phb;tv;reg
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To read these type of EU-formatted CSV files, we can use the read_csv2() function.
boys <-read_csv2(here(dataDir, "boys_eu.csv"))
βΉ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 748 Columns: 9
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ";"
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(boys)
# A tibble: 6 Γ 9
age hgt wgt bmi hc gen phb tv reg
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035 50.1 3.65 14.5 33.7 <NA> <NA> NA south
2 0.038 53.5 3.37 11.8 35 <NA> <NA> NA south
3 0.057 50 3.14 12.6 35.2 <NA> <NA> NA south
4 0.06 54.5 4.27 14.4 36.7 <NA> <NA> NA south
5 0.062 57.5 5.03 15.2 37.3 <NA> <NA> NA south
6 0.068 55.5 4.66 15.1 37 <NA> <NA> NA south
Formatting Options
Some files contain known formatting issues that weβd like to correct as quickly as possible. The readr data-ingest functions contain many options that we can use to apply various formatting corrections when reading the data file.
Missing Values
In many datasets, missing values are represented by placeholder codes, such as -999. We can instruct read_delim() to interpret such codes as NA by supply a vector of missing data codes for the na argument:
boys <-read_delim(here(dataDir, "boys.dat"), na ="-999")
Rows: 748 Columns: 9
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(boys)
# A tibble: 6 Γ 9
age hgt wgt bmi hc gen phb tv reg
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035 50.1 3.65 14.5 33.7 <NA> <NA> NA south
2 0.038 53.5 3.37 11.8 35 <NA> <NA> NA south
3 0.057 50 3.14 12.6 35.2 <NA> <NA> NA south
4 0.06 54.5 4.27 14.4 36.7 <NA> <NA> NA south
5 0.062 57.5 5.03 15.2 37.3 <NA> <NA> NA south
6 0.068 55.5 4.66 15.1 37 <NA> <NA> NA south
Notice how all the β-999β values have been replaced by NA (Rβs native missing data code). R will now correctly recognize these cells as missing values and treat them appropriately.
Selecting Columns
Some files contain columns that we really donβt care about and would rather just throw out immediately. For example, the first data column often contains an unnecessary row index. With readr, we can use the col_select argument to drop these columns directly when we read the data.
# Drop the first column when reading the fileutmb <-read_csv(here(dataDir, "utmb_2017.csv"), col_select =-1)
New names:
Rows: 2535 Columns: 32
ββ Column specification
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Delimiter: "," chr
(4): name, team, category, nationality dbl (2): bib, rank time (26): time,
timediff, Delevret, St-Gervais, Contamines, La Balme, Bonho...
βΉ Use `spec()` to retrieve the full column specification for this data. βΉ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
β’ `` -> `...1`
The col_select argument can do much more than we show in the above example. You can use any feature of the tidyselect selection language to include or exclude columns.
Column Data Types
By default, readr tries to guess the appropriate data type for each column based on the first 1,000 rows. We can use the spec() function to check how any readr data-ingest function parsed each column (i.e., which data types the function assigned).
While the auto-typing process is often correct, itβs not infallible. For more control, we can explicitly define column types using the col_types argument, which accepts a specification created using cols() and type constructors such as col_character(), col_integer(), and col_factor():
We can also use a compact string format, where each character represents the type of a column (e.g., "d" = double, "f" = factor).
boys <-read_csv2(here(dataDir, "boys.csv"), col_types ="dddddffdf")
βΉ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
head(boys)
# A tibble: 6 Γ 9
age hgt wgt bmi hc gen phb tv reg
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct>
1 0.035 50.1 3.65 14.5 33.7 <NA> <NA> NA south
2 0.038 53.5 3.37 11.8 35 <NA> <NA> NA south
3 0.057 50 3.14 12.6 35.2 <NA> <NA> NA south
4 0.06 54.5 4.27 14.4 36.7 <NA> <NA> NA south
5 0.062 57.5 5.03 15.2 37.3 <NA> <NA> NA south
6 0.068 55.5 4.66 15.1 37 <NA> <NA> NA south
str(boys)
spc_tbl_ [748 Γ 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ age: num [1:748] 0.035 0.038 0.057 0.06 0.062 0.068 0.068 0.071 0.071 0.073 ...
$ hgt: num [1:748] 50.1 53.5 50 54.5 57.5 55.5 52.5 53 55.1 54.5 ...
$ wgt: num [1:748] 3.65 3.37 3.14 4.27 5.03 ...
$ bmi: num [1:748] 14.5 11.8 12.6 14.4 15.2 ...
$ hc : num [1:748] 33.7 35 35.2 36.7 37.3 37 34.9 35.8 36.8 38 ...
$ gen: Factor w/ 5 levels "G1","G2","G3",..: NA NA NA NA NA NA NA NA NA NA ...
$ phb: Factor w/ 6 levels "P1","P2","P3",..: NA NA NA NA NA NA NA NA NA NA ...
$ tv : num [1:748] NA NA NA NA NA NA NA NA NA NA ...
$ reg: Factor w/ 5 levels "south","west",..: 1 1 1 1 1 1 1 2 2 3 ...
- attr(*, "spec")=
.. cols(
.. age = col_double(),
.. hgt = col_double(),
.. wgt = col_double(),
.. bmi = col_double(),
.. hc = col_double(),
.. gen = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
.. phb = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
.. tv = col_double(),
.. reg = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
.. )
- attr(*, "problems")=<externalptr>
Practice
Use readr::read_csv() to load the data stored in β./data/iris.csvβ.
Format the Species column as a factor
Remove the Petal.Width column while when reading the file.
Convert β-88β and β-99β to missing values.
# A tibble: 6 Γ 4
Sepal.Length Sepal.Width Petal.Length Species
<dbl> <dbl> <dbl> <fct>
1 5.1 3.5 NA setosa
2 4.9 3 1.4 setosa
3 4.7 3.2 1.3 setosa
4 4.6 3.1 1.5 setosa
5 5 3.6 NA setosa
6 NA 3.9 1.7 setosa
Base R Options
Base R also includes built-in functions for reading delimited files. The Base R analogue of readr::read_delim() is read.table(). The read.table() function wonβt attempt to auto-detect the delimiter, so we need to explicitly specify the delimiter character via the sep argument.
# CSV with semicolons as delimitersboys2 <-read.csv2(here(dataDir, "boys_eu.csv"))head(boys2)
age hgt wgt bmi hc gen phb tv reg
1 0.035 50.1 3.650 14.54 33.7 NA south
2 0.038 53.5 3.370 11.77 35.0 NA south
3 0.057 50.0 3.140 12.56 35.2 NA south
4 0.060 54.5 4.270 14.37 36.7 NA south
5 0.062 57.5 5.030 15.21 37.3 NA south
6 0.068 55.5 4.655 15.11 37.0 NA south