Delimited Data

Delimited text files are the simplest type of file that you’re likely to encounter when loading datasets into R. The files represent data as plain text where the data values are separated by a specific delimiter character: comma, tab, space.

While base R provides functions for reading these kinds of files (e.g., read.table(), read.csv()), the readr package from the tidyverse provides a superior alternative. Relative to their Base R counterparts, the data-ingest functions in readr are faster, easier to configure, and more transparent in how they parse data.

Space-Delimited

We’ll first consider space-delimited files wherein the data values are separated by a single white space character. We can use the readr::read_delim() function to load space-delimited files. When calling read_delim() we have two options for specifying the delimiter:

Let the function detect the delimiter automatically (the default)
Specify the delimiter via the delim argument

library(readr)

# Read the `boys.dat` file using default arguments and store the ingested values
# in a data frame called 'boys'
boys <- read_delim("data/boys.dat")

Rows: 748 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

When we using read_delim() read a file, the function prints an informative message telling us about the delimiter and the data types that it detected/used. We should check this message to make sure the auto-detection worked correctly.

# Check the result
head(boys)

# A tibble: 6 × 9
    age   hgt   wgt   bmi    hc gen   phb      tv reg  
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035  50.1  3.65  14.5  33.7 -999  -999   -999 south
2 0.038  53.5  3.37  11.8  35   -999  -999   -999 south
3 0.057  50    3.14  12.6  35.2 -999  -999   -999 south
4 0.06   54.5  4.27  14.4  36.7 -999  -999   -999 south
5 0.062  57.5  5.03  15.2  37.3 -999  -999   -999 south
6 0.068  55.5  4.66  15.1  37   -999  -999   -999 south

Notice that read_delim() returns a tibble and not an ordinary data frame. Tibbles are just slightly fancier flavor of data frame used by tidyverse packages. For our purposes, we can treat tibbles and data frames as equivalent objects.

# Read the `boys.dat` file with the delimiter specifically defined
boys <- read_delim("data/boys.dat", delim = " ")

Rows: 748 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check the result
head(boys)

# A tibble: 6 × 9
    age   hgt   wgt   bmi    hc gen   phb      tv reg  
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035  50.1  3.65  14.5  33.7 -999  -999   -999 south
2 0.038  53.5  3.37  11.8  35   -999  -999   -999 south
3 0.057  50    3.14  12.6  35.2 -999  -999   -999 south
4 0.06   54.5  4.27  14.4  36.7 -999  -999   -999 south
5 0.062  57.5  5.03  15.2  37.3 -999  -999   -999 south
6 0.068  55.5  4.66  15.1  37   -999  -999   -999 south

Good Path Habits

In the example above, we explicitly wrote out the path to the data file we wanted to read (i.e., "data/boys.dat"). That approach will work, but we can make our lives a lot easier by taking a few steps to improve the portability of our code. In the following code chunk, we make two important changes:

Create a variable called dataDir that contains the relative file path from the root directory of our project to the data folder.
Use the here() function from the here package to resolve the file path.

library(here)

dataDir <- "data"

boys <- read_delim(here("data", "boys.dat"))

Rows: 748 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(boys)

# A tibble: 6 × 9
    age   hgt   wgt   bmi    hc gen   phb      tv reg  
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035  50.1  3.65  14.5  33.7 -999  -999   -999 south
2 0.038  53.5  3.37  11.8  35   -999  -999   -999 south
3 0.057  50    3.14  12.6  35.2 -999  -999   -999 south
4 0.06   54.5  4.27  14.4  36.7 -999  -999   -999 south
5 0.062  57.5  5.03  15.2  37.3 -999  -999   -999 south
6 0.068  55.5  4.66  15.1  37   -999  -999   -999 south

Tab-Delimited

In a tab-delimited file the data values are separated by tab characters (\t). In the following code chunk, we load a file named diabetes.txt.

# Let the function auto-detect the delimiter
diabetes <- read_delim(here(dataDir, "diabetes.txt"))

Rows: 442 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): sex
dbl (10): age, bmi, bp, tc, ldl, hdl, tch, ltg, glu, progress

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(diabetes)

# A tibble: 6 × 11
    age   bmi    bp    tc   ldl   hdl   tch   ltg   glu progress sex   
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <chr> 
1    59  32.1   101   157  93.2    38     4  4.86    87      151 male  
2    48  21.6    87   183 103.     70     3  3.89    69       75 female
3    72  30.5    93   156  93.6    41     4  4.67    85      141 male  
4    24  25.3    84   198 131.     40     5  4.89    89      206 female
5    50  23     101   192 125.     52     4  4.29    80      135 female
6    23  22.6    89   139  64.8    61     2  4.19    68       97 female

# Manually explicate the delimiter
diabetes <- read_delim(here(dataDir, "diabetes.txt"), delim = "\t")

Rows: 442 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): sex
dbl (10): age, bmi, bp, tc, ldl, hdl, tch, ltg, glu, progress

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(diabetes)

# A tibble: 6 × 11
    age   bmi    bp    tc   ldl   hdl   tch   ltg   glu progress sex   
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <chr> 
1    59  32.1   101   157  93.2    38     4  4.86    87      151 male  
2    48  21.6    87   183 103.     70     3  3.89    69       75 female
3    72  30.5    93   156  93.6    41     4  4.67    85      141 male  
4    24  25.3    84   198 131.     40     5  4.89    89      206 female
5    50  23     101   192 125.     52     4  4.29    80      135 female
6    23  22.6    89   139  64.8    61     2  4.19    68       97 female

Alternatively, readr provides a dedicated function for reading tab-delimited files: read_tsv(). This function is just a wrapper around read_delim() with the delimiter preset to the tab character.

diabetes <- read_tsv(here(dataDir, "diabetes.txt"))

Rows: 442 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): sex
dbl (10): age, bmi, bp, tc, ldl, hdl, tch, ltg, glu, progress

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(diabetes)

# A tibble: 6 × 11
    age   bmi    bp    tc   ldl   hdl   tch   ltg   glu progress sex   
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <chr> 
1    59  32.1   101   157  93.2    38     4  4.86    87      151 male  
2    48  21.6    87   183 103.     70     3  3.89    69       75 female
3    72  30.5    93   156  93.6    41     4  4.67    85      141 male  
4    24  25.3    84   198 131.     40     5  4.89    89      206 female
5    50  23     101   192 125.     52     4  4.29    80      135 female
6    23  22.6    89   139  64.8    61     2  4.19    68       97 female

Practice

Use readr::read_delim() to load the data stored in the tab-delimited file “./data/iris.txt”.

The following dendrogram illustrates the structure of the working directory for this webr session.

iris <- read_delim(here::here("data", "iris.txt"), delim = "\t")

Rows: 150 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Species
dbl (5): ID, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(iris)

# A tibble: 6 × 6
     ID Sepal.Length Sepal.Width Petal.Length Petal.Width Species
  <dbl>        <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1    43          5.1         3.5        -99           0.2 setosa 
2    44          4.9         3            1.4         0.2 setosa 
3    45          4.7         3.2          1.3         0.2 setosa 
4    46          4.6         3.1          1.5         0.2 setosa 
5    47          5           3.6        -99           0.2 setosa 
6    48        -99           3.9          1.7       -99   setosa

Comma-Separated Values

The comma-separated values (CSV) format is one of the most popular delimited file types for storing tabular data. As the name implies, the data values are separated by commas. We can use the readr::read_delim() function with delim = ",", but it’s more convenient to use the readr::read_csv() wrapper function.

utmb <- read_delim(here(dataDir, "utmb_2017.csv"), delim = ",")

New names:
Rows: 2535 Columns: 33
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): name, team, category, nationality, time, timediff dbl (3): ...1, bib, rank
time (24): Delevret, St-Gervais, Contamines, La Balme, Bonhomme, Chapieux, C...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

head(utmb)

# A tibble: 6 × 33
   ...1   bib name      team  category  rank nationality time  timediff Delevret
  <dbl> <dbl> <chr>     <chr> <chr>    <dbl> <chr>       <chr> <chr>    <time>  
1     0     4 D'HAENE … Salo… SE H         1 FR          19:0… 00:00:00 01:11:50
2     1     2 JORNET B… Salo… SE H         2 ES          19:1… 00:15:05 01:10:00
3     2    14 TOLLEFSO… Hoka  SE H         3 US          19:5… 00:51:06 01:15:24
4     3     7 THEVENAR… Asics SE H         4 FR          20:0… 01:01:45 01:11:51
5     4     1 WALMSLEY… Hoka  SE H         5 US          20:1… 01:09:44 01:09:59
6     5    17 CAPELL P… The … SE H         6 ES          20:1… 01:10:49 01:13:16
# ℹ 23 more variables: `St-Gervais` <time>, Contamines <time>,
#   `La Balme` <time>, Bonhomme <time>, Chapieux <time>, `Col Seigne` <time>,
#   `Lac Combal` <time>, `Mt-Favre` <time>, Checruit <time>, Courmayeur <time>,
#   Bertone <time>, Bonatti <time>, Arnouvaz <time>, `Col Ferret` <time>,
#   `La Fouly` <time>, `Champex La` <time>, `La Giète` <time>, Trient <time>,
#   `Les Tseppe` <time>, Vallorcine <time>, `Col Montet` <time>,
#   Flégère <time>, Arrivée <time>

utmb <- read_csv(here(dataDir, "utmb_2017.csv"))

New names:
Rows: 2535 Columns: 33
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): name, team, category, nationality dbl (3): ...1, bib, rank time (26):
time, timediff, Delevret, St-Gervais, Contamines, La Balme, Bonho...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

head(utmb)

# A tibble: 6 × 33
   ...1   bib name   team  category  rank nationality time     timediff Delevret
  <dbl> <dbl> <chr>  <chr> <chr>    <dbl> <chr>       <time>   <time>   <time>  
1     0     4 D'HAE… Salo… SE H         1 FR          19:01:54 00:00:00 01:11:50
2     1     2 JORNE… Salo… SE H         2 ES          19:16:59 00:15:05 01:10:00
3     2    14 TOLLE… Hoka  SE H         3 US          19:53:00 00:51:06 01:15:24
4     3     7 THEVE… Asics SE H         4 FR          20:03:39 01:01:45 01:11:51
5     4     1 WALMS… Hoka  SE H         5 US          20:11:38 01:09:44 01:09:59
6     5    17 CAPEL… The … SE H         6 ES          20:12:43 01:10:49 01:13:16
# ℹ 23 more variables: `St-Gervais` <time>, Contamines <time>,
#   `La Balme` <time>, Bonhomme <time>, Chapieux <time>, `Col Seigne` <time>,
#   `Lac Combal` <time>, `Mt-Favre` <time>, Checruit <time>, Courmayeur <time>,
#   Bertone <time>, Bonatti <time>, Arnouvaz <time>, `Col Ferret` <time>,
#   `La Fouly` <time>, `Champex La` <time>, `La Giète` <time>, Trient <time>,
#   `Les Tseppe` <time>, Vallorcine <time>, `Col Montet` <time>,
#   Flégère <time>, Arrivée <time>

Trouble with Locales

The CSV format was developed in the United States where the period/full-stop character is used as the decimal separator. In countries that use the comma character to denote decimals (e.g., most European countries), it doesn’t make much sense to separate data fields with commas. In these countries, CSV files use semicolons as the delimiter (though the file type is still called comma-separated, unfortunately).

boys <- read_csv(here(dataDir, "boys_eu.csv"))

Rows: 748 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): age;hgt;wgt;bmi;hc;gen;phb;tv;reg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(boys) # Oops...that doesn't look right

# A tibble: 6 × 1
  `age;hgt;wgt;bmi;hc;gen;phb;tv;reg` 
  <chr>                               
1 0,035;50,1;3,650;14,54;33,7;;;;south
2 0,038;53,5;3,370;11,77;35,0;;;;south
3 0,057;50,0;3,140;12,56;35,2;;;;south
4 0,060;54,5;4,270;14,37;36,7;;;;south
5 0,062;57,5;5,030;15,21;37,3;;;;south
6 0,068;55,5;4,655;15,11;37,0;;;;south

To read these type of EU-formatted CSV files, we can use the read_csv2() function.

boys <- read_csv2(here(dataDir, "boys_eu.csv"))

ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

Rows: 748 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(boys)

# A tibble: 6 × 9
    age   hgt   wgt   bmi    hc gen   phb      tv reg  
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035  50.1  3.65  14.5  33.7 <NA>  <NA>     NA south
2 0.038  53.5  3.37  11.8  35   <NA>  <NA>     NA south
3 0.057  50    3.14  12.6  35.2 <NA>  <NA>     NA south
4 0.06   54.5  4.27  14.4  36.7 <NA>  <NA>     NA south
5 0.062  57.5  5.03  15.2  37.3 <NA>  <NA>     NA south
6 0.068  55.5  4.66  15.1  37   <NA>  <NA>     NA south

Formatting Options

Some files contain known formatting issues that we’d like to correct as quickly as possible. The readr data-ingest functions contain many options that we can use to apply various formatting corrections when reading the data file.

Missing Values

In many datasets, missing values are represented by placeholder codes, such as -999. We can instruct read_delim() to interpret such codes as NA by supply a vector of missing data codes for the na argument:

boys <- read_delim(here(dataDir, "boys.dat"), na = "-999")

Rows: 748 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (3): gen, phb, reg
dbl (6): age, hgt, wgt, bmi, hc, tv

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(boys)

# A tibble: 6 × 9
    age   hgt   wgt   bmi    hc gen   phb      tv reg  
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0.035  50.1  3.65  14.5  33.7 <NA>  <NA>     NA south
2 0.038  53.5  3.37  11.8  35   <NA>  <NA>     NA south
3 0.057  50    3.14  12.6  35.2 <NA>  <NA>     NA south
4 0.06   54.5  4.27  14.4  36.7 <NA>  <NA>     NA south
5 0.062  57.5  5.03  15.2  37.3 <NA>  <NA>     NA south
6 0.068  55.5  4.66  15.1  37   <NA>  <NA>     NA south

Notice how all the “-999” values have been replaced by NA (R’s native missing data code). R will now correctly recognize these cells as missing values and treat them appropriately.

Selecting Columns

Some files contain columns that we really don’t care about and would rather just throw out immediately. For example, the first data column often contains an unnecessary row index. With readr, we can use the col_select argument to drop these columns directly when we read the data.

# Drop the first column when reading the file
utmb <- read_csv(here(dataDir, "utmb_2017.csv"), col_select = -1)

New names:
Rows: 2535 Columns: 32
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): name, team, category, nationality dbl (2): bib, rank time (26): time,
timediff, Delevret, St-Gervais, Contamines, La Balme, Bonho...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

head(utmb)

# A tibble: 6 × 32
    bib name         team  category  rank nationality time     timediff Delevret
  <dbl> <chr>        <chr> <chr>    <dbl> <chr>       <time>   <time>   <time>  
1     4 D'HAENE Fra… Salo… SE H         1 FR          19:01:54 00:00:00 01:11:50
2     2 JORNET BURG… Salo… SE H         2 ES          19:16:59 00:15:05 01:10:00
3    14 TOLLEFSON T… Hoka  SE H         3 US          19:53:00 00:51:06 01:15:24
4     7 THEVENARD X… Asics SE H         4 FR          20:03:39 01:01:45 01:11:51
5     1 WALMSLEY Jim Hoka  SE H         5 US          20:11:38 01:09:44 01:09:59
6    17 CAPELL Pau   The … SE H         6 ES          20:12:43 01:10:49 01:13:16
# ℹ 23 more variables: `St-Gervais` <time>, Contamines <time>,
#   `La Balme` <time>, Bonhomme <time>, Chapieux <time>, `Col Seigne` <time>,
#   `Lac Combal` <time>, `Mt-Favre` <time>, Checruit <time>, Courmayeur <time>,
#   Bertone <time>, Bonatti <time>, Arnouvaz <time>, `Col Ferret` <time>,
#   `La Fouly` <time>, `Champex La` <time>, `La Giète` <time>, Trient <time>,
#   `Les Tseppe` <time>, Vallorcine <time>, `Col Montet` <time>,
#   Flégère <time>, Arrivée <time>

The col_select argument can do much more than we show in the above example. You can use any feature of the tidyselect selection language to include or exclude columns.

Column Data Types

By default, readr tries to guess the appropriate data type for each column based on the first 1,000 rows. We can use the spec() function to check how any readr data-ingest function parsed each column (i.e., which data types the function assigned).

spec(utmb)

cols(
  ...1 = col_skip(),
  bib = col_double(),
  name = col_character(),
  team = col_character(),
  category = col_character(),
  rank = col_double(),
  nationality = col_character(),
  time = col_time(format = ""),
  timediff = col_time(format = ""),
  Delevret = col_time(format = ""),
  `St-Gervais` = col_time(format = ""),
  Contamines = col_time(format = ""),
  `La Balme` = col_time(format = ""),
  Bonhomme = col_time(format = ""),
  Chapieux = col_time(format = ""),
  `Col Seigne` = col_time(format = ""),
  `Lac Combal` = col_time(format = ""),
  `Mt-Favre` = col_time(format = ""),
  Checruit = col_time(format = ""),
  Courmayeur = col_time(format = ""),
  Bertone = col_time(format = ""),
  Bonatti = col_time(format = ""),
  Arnouvaz = col_time(format = ""),
  `Col Ferret` = col_time(format = ""),
  `La Fouly` = col_time(format = ""),
  `Champex La` = col_time(format = ""),
  `La Giète` = col_time(format = ""),
  Trient = col_time(format = ""),
  `Les Tseppe` = col_time(format = ""),
  Vallorcine = col_time(format = ""),
  `Col Montet` = col_time(format = ""),
  Flégère = col_time(format = ""),
  Arrivée = col_time(format = "")
)

While the auto-typing process is often correct, it’s not infallible. For more control, we can explicitly define column types using the col_types argument, which accepts a specification created using cols() and type constructors such as col_character(), col_integer(), and col_factor():

utmb <- read_csv(
  file       = here(dataDir, "utmb_2017.csv"),
  col_select = -1,
  col_types  = cols(
    bib         = col_character(),
    category    = col_factor(),
    rank        = col_integer(),
    nationality = col_factor()
  )
)

New names:
• `` -> `...1`

head(utmb)

# A tibble: 6 × 32
  bib   name         team  category  rank nationality time     timediff Delevret
  <chr> <chr>        <chr> <fct>    <int> <fct>       <time>   <time>   <time>  
1 4     D'HAENE Fra… Salo… SE H         1 FR          19:01:54 00:00:00 01:11:50
2 2     JORNET BURG… Salo… SE H         2 ES          19:16:59 00:15:05 01:10:00
3 14    TOLLEFSON T… Hoka  SE H         3 US          19:53:00 00:51:06 01:15:24
4 7     THEVENARD X… Asics SE H         4 FR          20:03:39 01:01:45 01:11:51
5 1     WALMSLEY Jim Hoka  SE H         5 US          20:11:38 01:09:44 01:09:59
6 17    CAPELL Pau   The … SE H         6 ES          20:12:43 01:10:49 01:13:16
# ℹ 23 more variables: `St-Gervais` <time>, Contamines <time>,
#   `La Balme` <time>, Bonhomme <time>, Chapieux <time>, `Col Seigne` <time>,
#   `Lac Combal` <time>, `Mt-Favre` <time>, Checruit <time>, Courmayeur <time>,
#   Bertone <time>, Bonatti <time>, Arnouvaz <time>, `Col Ferret` <time>,
#   `La Fouly` <time>, `Champex La` <time>, `La Giète` <time>, Trient <time>,
#   `Les Tseppe` <time>, Vallorcine <time>, `Col Montet` <time>,
#   Flégère <time>, Arrivée <time>

# Check the data structure
str(utmb)

tibble [2,535 × 32] (S3: tbl_df/tbl/data.frame)
 $ bib        : chr [1:2535] "4" "2" "14" "7" ...
 $ name       : chr [1:2535] "D'HAENE François" "JORNET BURGADA Kilian" "TOLLEFSON Tim" "THEVENARD Xavier" ...
 $ team       : chr [1:2535] "Salomon" "Salomon" "Hoka" "Asics" ...
 $ category   : Factor w/ 10 levels "SE H","V1 H",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ rank       : int [1:2535] 1 2 3 4 5 6 7 8 9 10 ...
 $ nationality: Factor w/ 66 levels "FR","ES","US",..: 1 2 3 1 3 2 3 4 3 2 ...
 $ time       : 'hms' num [1:2535] 19:01:54 19:16:59 19:53:00 20:03:39 ...
  ..- attr(*, "units")= chr "secs"
 $ timediff   : 'hms' num [1:2535] 00:00:00 00:15:05 00:51:06 01:01:45 ...
  ..- attr(*, "units")= chr "secs"
 $ Delevret   : 'hms' num [1:2535] 01:11:50 01:10:00 01:15:24 01:11:51 ...
  ..- attr(*, "units")= chr "secs"
 $ St-Gervais : 'hms' num [1:2535] 01:45:05 01:44:21 01:48:38 01:45:08 ...
  ..- attr(*, "units")= chr "secs"
 $ Contamines : 'hms' num [1:2535] 02:41:09 02:41:01 02:45:17 02:41:11 ...
  ..- attr(*, "units")= chr "secs"
 $ La Balme   : 'hms' num [1:2535] 03:33:40 03:33:45 03:41:50 03:33:45 ...
  ..- attr(*, "units")= chr "secs"
 $ Bonhomme   : 'hms' num [1:2535] 04:28:07 04:29:18 04:41:04 04:38:06 ...
  ..- attr(*, "units")= chr "secs"
 $ Chapieux   : 'hms' num [1:2535] 04:53:31 04:54:39 05:10:05 05:07:23 ...
  ..- attr(*, "units")= chr "secs"
 $ Col Seigne : 'hms' num [1:2535] 06:18:02 06:18:04 06:40:51 06:41:10 ...
  ..- attr(*, "units")= chr "secs"
 $ Lac Combal : 'hms' num [1:2535] 06:37:51 06:37:54 07:02:40 07:04:45 ...
  ..- attr(*, "units")= chr "secs"
 $ Mt-Favre   : 'hms' num [1:2535] 07:15:35 07:15:37 07:42:45 07:45:38 ...
  ..- attr(*, "units")= chr "secs"
 $ Checruit   : 'hms' num [1:2535] 07:39:09 07:39:16 08:08:05 08:11:11 ...
  ..- attr(*, "units")= chr "secs"
 $ Courmayeur : 'hms' num [1:2535] 08:02:18 08:02:49 08:33:53 08:37:54 ...
  ..- attr(*, "units")= chr "secs"
 $ Bertone    : 'hms' num [1:2535] 08:54:29 08:57:30 09:29:48 09:38:22 ...
  ..- attr(*, "units")= chr "secs"
 $ Bonatti    : 'hms' num [1:2535] 09:44:00 09:48:28 10:21:27 10:31:58 ...
  ..- attr(*, "units")= chr "secs"
 $ Arnouvaz   : 'hms' num [1:2535] 10:17:44 10:23:53 10:55:21 11:09:38 ...
  ..- attr(*, "units")= chr "secs"
 $ Col Ferret : 'hms' num [1:2535] 11:11:12 11:18:54 NA 12:09:17 ...
  ..- attr(*, "units")= chr "secs"
 $ La Fouly   : 'hms' num [1:2535] 12:04:26 12:12:40 12:46:12 13:00:59 ...
  ..- attr(*, "units")= chr "secs"
 $ Champex La : 'hms' num [1:2535] 13:24:20 13:33:52 14:08:23 14:22:44 ...
  ..- attr(*, "units")= chr "secs"
 $ La Giète   : 'hms' num [1:2535] 14:55:05 15:13:06 15:45:55 15:58:54 ...
  ..- attr(*, "units")= chr "secs"
 $ Trient     : 'hms' num [1:2535] 15:24:59 15:41:22 16:12:00 16:28:53 ...
  ..- attr(*, "units")= chr "secs"
 $ Les Tseppe : 'hms' num [1:2535] 16:06:17 16:23:16 16:56:16 17:12:35 ...
  ..- attr(*, "units")= chr "secs"
 $ Vallorcine : 'hms' num [1:2535] 16:51:13 17:05:14 17:39:45 17:55:20 ...
  ..- attr(*, "units")= chr "secs"
 $ Col Montet : 'hms' num [1:2535] 17:20:02 17:34:21 18:09:03 18:23:24 ...
  ..- attr(*, "units")= chr "secs"
 $ Flégère    : 'hms' num [1:2535] 18:23:09 18:39:27 19:17:41 19:28:04 ...
  ..- attr(*, "units")= chr "secs"
 $ Arrivée    : 'hms' num [1:2535] 19:01:54 19:16:59 19:53:00 20:03:39 ...
  ..- attr(*, "units")= chr "secs"
 - attr(*, "spec")=
  .. cols(
  ..   ...1 = col_skip(),
  ..   bib = col_character(),
  ..   name = col_character(),
  ..   team = col_character(),
  ..   category = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  ..   rank = col_integer(),
  ..   nationality = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  ..   time = col_time(format = ""),
  ..   timediff = col_time(format = ""),
  ..   Delevret = col_time(format = ""),
  ..   `St-Gervais` = col_time(format = ""),
  ..   Contamines = col_time(format = ""),
  ..   `La Balme` = col_time(format = ""),
  ..   Bonhomme = col_time(format = ""),
  ..   Chapieux = col_time(format = ""),
  ..   `Col Seigne` = col_time(format = ""),
  ..   `Lac Combal` = col_time(format = ""),
  ..   `Mt-Favre` = col_time(format = ""),
  ..   Checruit = col_time(format = ""),
  ..   Courmayeur = col_time(format = ""),
  ..   Bertone = col_time(format = ""),
  ..   Bonatti = col_time(format = ""),
  ..   Arnouvaz = col_time(format = ""),
  ..   `Col Ferret` = col_time(format = ""),
  ..   `La Fouly` = col_time(format = ""),
  ..   `Champex La` = col_time(format = ""),
  ..   `La Giète` = col_time(format = ""),
  ..   Trient = col_time(format = ""),
  ..   `Les Tseppe` = col_time(format = ""),
  ..   Vallorcine = col_time(format = ""),
  ..   `Col Montet` = col_time(format = ""),
  ..   Flégère = col_time(format = ""),
  ..   Arrivée = col_time(format = "")
  .. )

We can also use a compact string format, where each character represents the type of a column (e.g., "d" = double, "f" = factor).

boys <- read_csv2(here(dataDir, "boys.csv"), col_types = "dddddffdf")

ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

head(boys)

# A tibble: 6 × 9
    age   hgt   wgt   bmi    hc gen   phb      tv reg  
  <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct>
1 0.035  50.1  3.65  14.5  33.7 <NA>  <NA>     NA south
2 0.038  53.5  3.37  11.8  35   <NA>  <NA>     NA south
3 0.057  50    3.14  12.6  35.2 <NA>  <NA>     NA south
4 0.06   54.5  4.27  14.4  36.7 <NA>  <NA>     NA south
5 0.062  57.5  5.03  15.2  37.3 <NA>  <NA>     NA south
6 0.068  55.5  4.66  15.1  37   <NA>  <NA>     NA south

str(boys)

spc_tbl_ [748 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ age: num [1:748] 0.035 0.038 0.057 0.06 0.062 0.068 0.068 0.071 0.071 0.073 ...
 $ hgt: num [1:748] 50.1 53.5 50 54.5 57.5 55.5 52.5 53 55.1 54.5 ...
 $ wgt: num [1:748] 3.65 3.37 3.14 4.27 5.03 ...
 $ bmi: num [1:748] 14.5 11.8 12.6 14.4 15.2 ...
 $ hc : num [1:748] 33.7 35 35.2 36.7 37.3 37 34.9 35.8 36.8 38 ...
 $ gen: Factor w/ 5 levels "G1","G2","G3",..: NA NA NA NA NA NA NA NA NA NA ...
 $ phb: Factor w/ 6 levels "P1","P2","P3",..: NA NA NA NA NA NA NA NA NA NA ...
 $ tv : num [1:748] NA NA NA NA NA NA NA NA NA NA ...
 $ reg: Factor w/ 5 levels "south","west",..: 1 1 1 1 1 1 1 2 2 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   age = col_double(),
  ..   hgt = col_double(),
  ..   wgt = col_double(),
  ..   bmi = col_double(),
  ..   hc = col_double(),
  ..   gen = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  ..   phb = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  ..   tv = col_double(),
  ..   reg = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
  .. )
 - attr(*, "problems")=<externalptr>

Practice

Use readr::read_csv() to load the data stored in “./data/iris.csv”.

Format the Species column as a factor
Remove the Petal.Width column while when reading the file.
Convert “-88” and “-99” to missing values.

The following dendrogram illustrates the structure of the working directory for this webr session.

iris <- read_csv(here::here("data", "iris.csv"),
                 col_types = cols(Species = col_factor()),
                 col_select = c(-ID, -Petal.Width),
                 na = c("-88", "-99"))
head(iris)

# A tibble: 6 × 4
  Sepal.Length Sepal.Width Petal.Length Species
         <dbl>       <dbl>        <dbl> <fct>  
1          5.1         3.5         NA   setosa 
2          4.9         3            1.4 setosa 
3          4.7         3.2          1.3 setosa 
4          4.6         3.1          1.5 setosa 
5          5           3.6         NA   setosa 
6         NA           3.9          1.7 setosa

Base R Options

Base R also includes built-in functions for reading delimited files. The Base R analogue of readr::read_delim() is read.table(). The read.table() function won’t attempt to auto-detect the delimiter, so we need to explicitly specify the delimiter character via the sep argument.

diabetes2 <- read.table(here(dataDir, "diabetes.txt"),
                        header = TRUE,
                        sep = "\t")
head(diabetes2)

  age  bmi  bp  tc   ldl hdl tch    ltg glu progress    sex
1  59 32.1 101 157  93.2  38   4 4.8598  87      151   male
2  48 21.6  87 183 103.2  70   3 3.8918  69       75 female
3  72 30.5  93 156  93.6  41   4 4.6728  85      141   male
4  24 25.3  84 198 131.4  40   5 4.8903  89      206 female
5  50 23.0 101 192 125.4  52   4 4.2905  80      135 female
6  23 22.6  89 139  64.8  61   2 4.1897  68       97 female

For CSV files, the read.csv() and read.csv2() functions are wrappers around read.table() that will read US-formatted and EU-formatted CSV files.

# Standard comma-separated CSV file
utmb2 <- read.csv(here(dataDir, "utmb_2017.csv"))
head(utmb2)

  X bib                  name                  team category rank nationality
1 0   4      D'HAENE François               Salomon     SE H    1          FR
2 1   2 JORNET BURGADA Kilian               Salomon     SE H    2          ES
3 2  14         TOLLEFSON Tim                  Hoka     SE H    3          US
4 3   7      THEVENARD Xavier                 Asics     SE H    4          FR
5 4   1          WALMSLEY Jim                  Hoka     SE H    5          US
6 5  17            CAPELL Pau The North Face / Buff     SE H    6          ES
      time timediff Delevret St.Gervais Contamines La.Balme Bonhomme Chapieux
1 19:01:54 00:00:00 01:11:50   01:45:05   02:41:09 03:33:40 04:28:07 04:53:31
2 19:16:59 00:15:05 01:10:00   01:44:21   02:41:01 03:33:45 04:29:18 04:54:39
3 19:53:00 00:51:06 01:15:24   01:48:38   02:45:17 03:41:50 04:41:04 05:10:05
4 20:03:39 01:01:45 01:11:51   01:45:08   02:41:11 03:33:45 04:38:06 05:07:23
5 20:11:38 01:09:44 01:09:59   01:42:15   02:39:45 03:33:20 04:27:43 04:53:05
6 20:12:43 01:10:49 01:13:16   01:46:46   02:43:57 03:37:13 04:35:55 05:04:53
  Col.Seigne Lac.Combal Mt.Favre Checruit Courmayeur  Bertone  Bonatti Arnouvaz
1   06:18:02   06:37:51 07:15:35 07:39:09   08:02:18 08:54:29 09:44:00 10:17:44
2   06:18:04   06:37:54 07:15:37 07:39:16   08:02:49 08:57:30 09:48:28 10:23:53
3   06:40:51   07:02:40 07:42:45 08:08:05   08:33:53 09:29:48 10:21:27 10:55:21
4   06:41:10   07:04:45 07:45:38 08:11:11   08:37:54 09:38:22 10:31:58 11:09:38
5   06:18:03   06:37:10 07:10:38 07:33:52   07:58:34 08:51:50 09:44:06 10:17:38
6   06:34:38   06:55:57 07:37:29 08:00:59   08:25:08 09:24:11 10:19:34 10:57:51
  Col.Ferret La.Fouly Champex.La La.Giète   Trient Les.Tseppe Vallorcine
1   11:11:12 12:04:26   13:24:20 14:55:05 15:24:59   16:06:17   16:51:13
2   11:18:54 12:12:40   13:33:52 15:13:06 15:41:22   16:23:16   17:05:14
3            12:46:12   14:08:23 15:45:55 16:12:00   16:56:16   17:39:45
4   12:09:17 13:00:59   14:22:44 15:58:54 16:28:53   17:12:35   17:55:20
5   11:11:10 12:09:51   13:55:02 16:11:03 16:35:32   17:14:48   17:52:03
6   12:01:54 12:59:54   14:22:45 15:58:47 16:28:31   17:12:38   17:57:23
  Col.Montet  Flégère  Arrivée
1   17:20:02 18:23:09 19:01:54
2   17:34:21 18:39:27 19:16:59
3   18:09:03 19:17:41 19:53:00
4   18:23:24 19:28:04 20:03:39
5   18:23:11 19:33:35 20:11:38
6   18:28:03 19:39:00 20:12:43

# CSV with semicolons as delimiters
boys2 <- read.csv2(here(dataDir, "boys_eu.csv"))
head(boys2)

    age  hgt   wgt   bmi   hc gen phb tv   reg
1 0.035 50.1 3.650 14.54 33.7         NA south
2 0.038 53.5 3.370 11.77 35.0         NA south
3 0.057 50.0 3.140 12.56 35.2         NA south
4 0.060 54.5 4.270 14.37 36.7         NA south
5 0.062 57.5 5.030 15.21 37.3         NA south
6 0.068 55.5 4.655 15.11 37.0         NA south