R provides two native data formats: RData and RDS.
RData Files
Although we can certain store datasets as RData files, it’s a bit misleading to characterize the RData format as a tool for storing datasets. RData files are actually compressed snapshots of an entire R environment. So, when we “load” an RData file, we’re not really loading a single dataset; rather, we’re “restoring” the contents of the environment stored in the file. Of course, the environment we restore may contain only a single datasets, in which case the end result will be functional equivalent to loading that dataset. It’s worth remembering, though, that the RData format is a more general tool than we need for this job.
We use the load() function to load RData files. In the following code chunk, we load the "boys.RData" object. This file contains only a single data frame called boys.
dataDir <-"data"# List the contents of the current environmentls()
[1] "dataDir"
# Load the objects stored in 'boys.RData'load(here::here(dataDir, "boys.RData"))# Now we have a new data frame in our environmentls()
[1] "boys" "dataDir"
head(boys)
age hgt wgt bmi hc gen phb tv reg
3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
Notice that we don’t assign the return value of load() to an object. When we run load(), we simply add all the objects stored therein to our current environment. Hence, RData files are best suited to capturing an instantaneous snapshot of your current R session so that you can exactly restore the current state of your environment sometime in the future.
In the following code chunk, we use an RData object to restore the snapshot of a previous R session.
# The environment contains only the objects we loaded in the last examplels()
[1] "boys" "dataDir"
# Load the objects stored in 'snapshot.RData'load(here::here(dataDir, "snapshot.RData"))# Now, we've added several new objects to our sessionls()
[1] "attitude" "boys" "dataDir" "iris" "out" "p1"
After we restore the objects stored in “snapshot.RData”, we add two new datasets, iris and attitude, but we also restore a set of regression model results, out, and a ggplot object, p1.
Call:
lm(formula = Petal.Width ~ Petal.Length + Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.63706 -0.07779 -0.01218 0.09829 0.47814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09083 0.05639 -1.611 0.109
Petal.Length 0.23039 0.03443 6.691 4.41e-10 ***
Speciesversicolor 0.43537 0.10282 4.234 4.04e-05 ***
Speciesvirginica 0.83771 0.14533 5.764 4.71e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1796 on 146 degrees of freedom
Multiple R-squared: 0.9456, Adjusted R-squared: 0.9445
F-statistic: 845.5 on 3 and 146 DF, p-value: < 2.2e-16
print(p1)
We could now pick up with our previous analysis exactly where we left off when we created “snapshot.RData”.
RDS Files
RData files are much better for the kind of “back-up and restore” operations that we see in the preceding example than they are for storing individual datasets. Fortunately, R has a second native data format that is ideally suited for storing datasets. RDS files are specifically designed to store single R objects like the data frames we typically use to hold datasets. We use the readRDS() function to load RDS files.
Notice how the readRDS() call follows the same conventions as the other data-ingest functions we’ve covered. Most importantly, readRDS() returns a single R object, and we assign that object to a variable in our environment.
Practice
Load the dataset saved as “./data/diabetes.rds”.
Use the str() function to compare the structure of the dataset you loaded to the diabetes0 object loaded below.
Are there any differences between these two objects? If so, what are the differences?
The main difference between these two objects lies in the fact that diabetes1 preserves all the R-specific formatting. Notably, the sex variable is stored as a two-level factor in diabetes1. In diabetes0, on the other hand, sex is a character vector.