Native Data Formats

R provides two native data formats: RData and RDS.

RData Files

Although we can certain store datasets as RData files, it’s a bit misleading to characterize the RData format as a tool for storing datasets. RData files are actually compressed snapshots of an entire R environment. So, when we “load” an RData file, we’re not really loading a single dataset; rather, we’re “restoring” the contents of the environment stored in the file. Of course, the environment we restore may contain only a single datasets, in which case the end result will be functional equivalent to loading that dataset. It’s worth remembering, though, that the RData format is a more general tool than we need for this job.

We use the load() function to load RData files. In the following code chunk, we load the "boys.RData" object. This file contains only a single data frame called boys.

dataDir <- "data"

# List the contents of the current environment
ls()

[1] "dataDir"

# Load the objects stored in 'boys.RData'
load(here::here(dataDir, "boys.RData"))

# Now we have a new data frame in our environment
ls()

[1] "boys"    "dataDir"

head(boys)

     age  hgt   wgt   bmi   hc  gen  phb tv   reg
3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

Notice that we don’t assign the return value of load() to an object. When we run load(), we simply add all the objects stored therein to our current environment. Hence, RData files are best suited to capturing an instantaneous snapshot of your current R session so that you can exactly restore the current state of your environment sometime in the future.

In the following code chunk, we use an RData object to restore the snapshot of a previous R session.

# The environment contains only the objects we loaded in the last example
ls()

[1] "boys"    "dataDir"

# Load the objects stored in 'snapshot.RData'
load(here::here(dataDir, "snapshot.RData"))

# Now, we've added several new objects to our session
ls()

[1] "attitude" "boys"     "dataDir"  "iris"     "out"      "p1"

After we restore the objects stored in “snapshot.RData”, we add two new datasets, iris and attitude, but we also restore a set of regression model results, out, and a ggplot object, p1.

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

str(attitude)

'data.frame':   30 obs. of  7 variables:
 $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
 $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
 $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
 $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
 $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
 $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
 $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...

summary(out)


Call:
lm(formula = Petal.Width ~ Petal.Length + Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.63706 -0.07779 -0.01218  0.09829  0.47814 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.09083    0.05639  -1.611    0.109    
Petal.Length       0.23039    0.03443   6.691 4.41e-10 ***
Speciesversicolor  0.43537    0.10282   4.234 4.04e-05 ***
Speciesvirginica   0.83771    0.14533   5.764 4.71e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1796 on 146 degrees of freedom
Multiple R-squared:  0.9456,    Adjusted R-squared:  0.9445 
F-statistic: 845.5 on 3 and 146 DF,  p-value: < 2.2e-16

print(p1)

We could now pick up with our previous analysis exactly where we left off when we created “snapshot.RData”.

RDS Files

RData files are much better for the kind of “back-up and restore” operations that we see in the preceding example than they are for storing individual datasets. Fortunately, R has a second native data format that is ideally suited for storing datasets. RDS files are specifically designed to store single R objects like the data frames we typically use to hold datasets. We use the readRDS() function to load RDS files.

titanic <- readRDS(here::here(dataDir, "titanic.rds"))
str(titanic)

'data.frame':   887 obs. of  8 variables:
 $ survived        : Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 2 2 ...
 $ class           : Factor w/ 3 levels "1st","2nd","3rd": 3 1 3 1 3 3 1 3 3 2 ...
 $ name            : chr  "Mr. Owen Harris Braund" "Mrs. John Bradley (Florence Briggs Thayer) Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath (Lily May Peel) Futrelle" ...
 $ sex             : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ age             : num  22 38 26 35 35 27 54 2 27 14 ...
 $ siblings_spouses: int  1 1 0 1 0 0 0 3 0 1 ...
 $ parents_children: int  0 0 0 0 0 0 0 1 2 0 ...
 $ fare            : num  7.25 71.28 7.92 53.1 8.05 ...

Notice how the readRDS() call follows the same conventions as the other data-ingest functions we’ve covered. Most importantly, readRDS() returns a single R object, and we assign that object to a variable in our environment.

Practice

Load the dataset saved as “./data/diabetes.rds”.
Use the str() function to compare the structure of the dataset you loaded to the diabetes0 object loaded below.

Are there any differences between these two objects? If so, what are the differences?

The following dendrogram illustrates the structure of the working directory for this webr session.

diabetes0 <- read.table(here::here("data", "diabetes.txt"), header = TRUE, sep = "\t")
diabetes1 <- readRDS(here::here("data", "diabetes.rds"))
str(diabetes0)

'data.frame':   442 obs. of  11 variables:
 $ age     : int  59 48 72 24 50 23 36 66 60 29 ...
 $ bmi     : num  32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ...
 $ bp      : num  101 87 93 84 101 89 90 114 83 85 ...
 $ tc      : int  157 183 156 198 192 139 160 255 179 180 ...
 $ ldl     : num  93.2 103.2 93.6 131.4 125.4 ...
 $ hdl     : num  38 70 41 40 52 61 50 56 42 43 ...
 $ tch     : num  4 3 4 5 4 2 3 4.55 4 4 ...
 $ ltg     : num  4.86 3.89 4.67 4.89 4.29 ...
 $ glu     : int  87 69 85 89 80 68 82 92 94 88 ...
 $ progress: int  151 75 141 206 135 97 138 63 110 310 ...
 $ sex     : chr  "male" "female" "male" "female" ...

str(diabetes1)

'data.frame':   442 obs. of  11 variables:
 $ age     : int  59 48 72 24 50 23 36 66 60 29 ...
 $ bmi     : num  32.1 21.6 30.5 25.3 23 22.6 22 26.2 32.1 30 ...
 $ bp      : num  101 87 93 84 101 89 90 114 83 85 ...
 $ tc      : int  157 183 156 198 192 139 160 255 179 180 ...
 $ ldl     : num  93.2 103.2 93.6 131.4 125.4 ...
 $ hdl     : num  38 70 41 40 52 61 50 56 42 43 ...
 $ tch     : num  4 3 4 5 4 2 3 4.55 4 4 ...
 $ ltg     : num  4.86 3.89 4.67 4.89 4.29 ...
 $ glu     : int  87 69 85 89 80 68 82 92 94 88 ...
 $ progress: int  151 75 141 206 135 97 138 63 110 310 ...
 $ sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 1 2 2 2 1 ...

The main difference between these two objects lies in the fact that diabetes1 preserves all the R-specific formatting. Notably, the sex variable is stored as a two-level factor in diabetes1. In diabetes0, on the other hand, sex is a character vector.