library(haven)
dataDir <- "data"
mtcars1 <- read_spss(here::here(dataDir, "mtcars.sav"))SPSS Files
In many research areas, especially the social and behavior sciences, datasets are stored and distributed using the native SPSS data format, SAV. So, there’s a reasonable chance that you’ll need to work with SAV files at some point in your data analytic career. Fortunately, the haven and labelled packages provide a powerful set of tools for working with data stored in SAV files.
We use the haven::read_spss() function to load data from SAV files.
SAV files contain some very useful metadata like variable labels (i.e., short description of a variable) and value labels (i.e., meaningful labels for the numeric levels of a variable). These metadata act as a built-in codebook for the dataset, so we’d really like to preserve this information when we read the data into R.
Fortunately, read_spss() preserves this information by representing numeric variables as labelled vectors. For example, when we print the am column in the following code chunk, you’ll notice several additional pieces of information printed alongside the variable’s actual values.
mtcars1$am<labelled<double>[32]>: Transmission type
 [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
Labels:
 value     label
     0 Automatic
     1    ManualLabelled vectors are very similar to factors: the underlying data values are stored as a numeric vector, and each unique numeric value is paired with a descriptive value label. Unlike factors, however, labelled vectors are meant to store numeric data. So, R will treat labelled vectors like numeric vectors for analysis.
# R is happy to analyze labelled vectors as numeric data
mean(mtcars1$am)[1] 0.40625# We can't do numeric calculations with factors
am_factor <- as.factor(mtcars$am)
mean(am_factor)Warning in mean.default(am_factor): argument is not numeric or logical:
returning NA[1] NAWe can use the attributes() function to list all the attributes attached to the am column.
attributes(mtcars1$am)$label
[1] "Transmission type"
$format.spss
[1] "F1.0"
$class
[1] "haven_labelled" "vctrs_vctr"     "double"        
$labels
Automatic    Manual 
        0         1 In this case, we are most interested in the label and labels fields that show the variable label, and value labels, respectively.
Manipulating Labelled Vectors
The labelled package provides a suite of utilities for manipulating the metadata of labelled vectors, including the add, remove, or modify labels. We won’t cover these features in details here, but we’ll consider some of the basic options.
We may not care about the individual value labels. If so, we can remove the value labels (but retain the variable labels), with the labelled::unlabelled() function.
library(labelled)
mtcars2 <- unlabelled(mtcars1)If we compare mtcars1 and mtcars2, we see that all the value labels are gone in mtcars2.
val_labels(mtcars1)$mpg
NULL
$cyl
NULL
$disp
NULL
$hp
NULL
$drat
NULL
$wt
NULL
$qsec
NULL
$vs
V-Shaped Straight 
       0        1 
$am
Automatic    Manual 
        0         1 
$gear
NULL
$carb
NULLval_labels(mtcars2)$mpg
NULL
$cyl
NULL
$disp
NULL
$hp
NULL
$drat
NULL
$wt
NULL
$qsec
NULL
$vs
NULL
$am
NULL
$gear
NULL
$carb
NULLThe variable labels are still present in both datasets, though.
var_label(mtcars1)$mpg
[1] "Fuel economy (miles/gallon)"
$cyl
[1] "Number of cylinders"
$disp
[1] "Displacement (cubic inches)"
$hp
[1] "Gross horsepower"
$drat
[1] "Rear axle ratio"
$wt
[1] "Weight (1000 pounds)"
$qsec
[1] "1/4 mile time (seconds)"
$vs
[1] "Cylinder geometry"
$am
[1] "Transmission type"
$gear
[1] "Number or forward gears"
$carb
[1] "Number or carburators"var_label(mtcars2)$mpg
[1] "Fuel economy (miles/gallon)"
$cyl
[1] "Number of cylinders"
$disp
[1] "Displacement (cubic inches)"
$hp
[1] "Gross horsepower"
$drat
[1] "Rear axle ratio"
$wt
[1] "Weight (1000 pounds)"
$qsec
[1] "1/4 mile time (seconds)"
$vs
[1] "Cylinder geometry"
$am
[1] "Transmission type"
$gear
[1] "Number or forward gears"
$carb
[1] "Number or carburators"Use the haven::read_spss() function to load the SPSS dataset saved as “./data/starwars.sav”.
- What is the variable label for the birth_yearcolumn?
- What are the value labels for the sexcolumn?
The following dendrogram illustrates the structure of the working directory for this webr session.
starwars <- read_spss(here::here("data", "starwars.sav"))
head(starwars)# A tibble: 6 × 11
  name   height  mass hair_color skin_color eye_color birth_year sex     gender 
  <chr>   <dbl> <dbl> <dbl+lbl>  <dbl+lbl>  <dbl+lbl>      <dbl> <dbl+l> <dbl+l>
1 Luke …    172    77  5 [blond]  7 [fair]   2 [blue]       19   3 [mal… 2 [mas…
2 C-3PO     167    75 NA          9 [gold]  15 [yell…      112   4 [non… 2 [mas…
3 R2-D2      96    32 NA         29 [white… 11 [red]        33   4 [non… 2 [mas…
4 Darth…    202   136 10 [none]  28 [white] 15 [yell…       41.9 3 [mal… 2 [mas…
5 Leia …    150    49  7 [brown] 17 [light]  4 [brow…       19   1 [fem… 1 [fem…
6 Owen …    178   120  8 [brown… 17 [light]  2 [blue]       52   3 [mal… 2 [mas…
# ℹ 2 more variables: homeworld <dbl+lbl>, species <dbl+lbl>Variable label for birth_year:
var_label(starwars$birth_year)[1] "Year of birth (BBY = Before Battle of Yavin)"Value labels for sex:
val_labels(starwars$sex)        female hermaphroditic           male           none 
             1              2              3              4