library(tidyverse)
library(fs)
Introduction
In introductory R classes we learn how to import data by reading one file at a time. This week, we will “level up” our R skills by learning how to important multiple files at once. This can be especially useful when you have several files with a similar structure and want to be able to combine and analyze them together. To accomplish this goal, we’ll be working with a new data structure called a list and learning how to iterate over that list.
The Data
Download the data for this lesson here.
Our scenario this week is that we have a folder of registration data for workshops that we want to combine and analyze. We have a separate spreadsheet for each workshop with a list of attendees, their affiliations, and status. This data has been generated and does not contain any names or information of actual workshop participants, but is based on the structure of registration data we use here at UMB.
Let’s load the packages we’ll be working with today.
Review - Importing data
To import one csv file, we can use the function read_csv()
from the readr
package, part of the tidyverse
. You can also use the Import Dataset widget in the RStudio GUI.
<- read_csv("data/Workshop_01.csv") Workshop01
The Power of Iteration
To read in multiple files, we could copy and paste that code a bunch of times with different names each time.
<- read_csv("data/Workshop_02.csv")
Workshop02
<- read_csv("data/Workshop_03.csv")
Workshop03
<- read_csv("data/Workshop_04.csv") Workshop04
But! That’s a really good way to make mistakes, and part of the benefit of learning to code is to avoid the need to repeat things. Most programming languages, including R, have ways of iterating, or repeating, an operation over multiple objects based on some instructions. Often this iteration takes place in some kind of loop operation, e.g.
for (thing in list_of_things) {
do_some_function()
}
These can be used in R, but it is more common to use one of several available R functions for this task.
We’ll use a function called map()
which takes as arguments a vector or list (.x
) and a function (.f
) we will use to iterate over the elements of that vector or list. So the syntax of this function is map(.x, .f)
.
So, we have our function read_csv()
, and now we want it to repeat for all the file names in our directory. So - how do we accomplish this?
- Create a vector of file names
- Use
map()
to runread_csv()
on each file name. - Combine files into one data frame.
- Summarize data frame
Step 1: Create a vector of file names
First, you’ll want to have all the files you need saved in the same directory. Then we can use the base R function list.files()
on that directory. We’ll save this an object so we can use this later.
<- list.files("data/", full.names = TRUE)
files
files
[1] "data//Workshop_01.csv" "data//Workshop_02.csv" "data//Workshop_03.csv"
[4] "data//Workshop_04.csv" "data//Workshop_05.csv" "data//Workshop_06.csv"
[7] "data//Workshop_07.csv" "data//Workshop_08.csv" "data//Workshop_09.csv"
[10] "data//Workshop_10.csv"
Step 2: Iterate over file names
As mentioned before, now we have an object to stand in for the .x
argument of map()
<- map(files, read_csv) workshop_list
Step 3: Combine into one data frame
Working with lists
After running our last step, the resulting object is in a structure called a list. In most beginning R workshops we work with two main data structures: vectors and data frames
homogeneous | heterogenous | |
---|---|---|
1d | vector | list |
2d | matrix | data frame |
nd | array |
Lists are made up of elements like vectors, but those elements can be anything. Often we use lists to work with several data frames at one time.
Adding names to lists
Each element of the list is given an index number. But it will be helpful to us to give the elements names. We do this with the names()
function.
names(workshop_list) #currently no names
NULL
Let’s name each element the same as the file, this way when we combine the data, we can keep track of which rows came from which data frame. We’ll use the fs
package which contains tools for working with file names.
path_file(files) # keeps only file part of path
[1] "Workshop_01.csv" "Workshop_02.csv" "Workshop_03.csv" "Workshop_04.csv"
[5] "Workshop_05.csv" "Workshop_06.csv" "Workshop_07.csv" "Workshop_08.csv"
[9] "Workshop_09.csv" "Workshop_10.csv"
path_ext_remove(files) #keeps just
[1] "data/Workshop_01" "data/Workshop_02" "data/Workshop_03" "data/Workshop_04"
[5] "data/Workshop_05" "data/Workshop_06" "data/Workshop_07" "data/Workshop_08"
[9] "data/Workshop_09" "data/Workshop_10"
names(workshop_list) <-path_ext_remove(path_file(files))
names(workshop_list)
[1] "Workshop_01" "Workshop_02" "Workshop_03" "Workshop_04" "Workshop_05"
[6] "Workshop_06" "Workshop_07" "Workshop_08" "Workshop_09" "Workshop_10"
Binding rows
We’ll use the bind_rows()
function from dplyr
to paste each element in our list into one data frame. The .id argument will allow us to keep track of which row came from which original dataset. It uses the names of the list elements as values.
<- bind_rows(workshop_list, .id = "Workshop") %>%
all_workshops arrange(attendees)
head(all_workshops)
# A tibble: 6 × 4
Workshop attendees affiliation status
<chr> <chr> <chr> <chr>
1 Workshop_06 Abdul Waahid al-Jaber Nursing Faculty
2 Workshop_10 Addison Elkins Graduate Staff
3 Workshop_07 Adriana Picazo Medicine Staff
4 Workshop_05 Alexander Garcia-Marrufo Nursing Faculty
5 Workshop_09 Alexandra Willis Social Work Staff
6 Workshop_06 Alexis Wright Medicine Faculty
Step 4: Summarize the data
Now we can work with our data in the more familiar structure of a data frame. This allows us to do things like get a count of how many of each affiliation and status attended all workshops put together.
%>%
all_workshops count(affiliation)
# A tibble: 6 × 2
affiliation n
<chr> <int>
1 Dentistry 21
2 Graduate 25
3 Medicine 30
4 Nursing 35
5 Pharmacy 31
6 Social Work 31
%>%
all_workshops count(status)
# A tibble: 3 × 2
status n
<chr> <int>
1 Faculty 62
2 Staff 60
3 Student 51
Wrapping up
We might want to work with this data again later so let’s be sure to save it. We can write it back out to a CSV file, or we can save it directly as an R object. Let’s try both ways and save these to our data_outputs
folder.
<-
workshop_breakdown %>%
all_workshops count(affiliation, status)
saveRDS(all_workshops, "data_output/all_workshops.RDS")
saveRDS(workshop_breakdown, "data_output/workshop_breakdown.RDS")