NNLM + CDABS R Community of Practice - R Community of Practice Week 1

Introduction

In introductory R classes we learn how to import data by reading one file at a time. This week, we will “level up” our R skills by learning how to important multiple files at once. This can be especially useful when you have several files with a similar structure and want to be able to combine and analyze them together. To accomplish this goal, we’ll be working with a new data structure called a list and learning how to iterate over that list.

The Data

Download the data for this lesson here.

Our scenario this week is that we have a folder of registration data for workshops that we want to combine and analyze. We have a separate spreadsheet for each workshop with a list of attendees, their affiliations, and status. This data has been generated and does not contain any names or information of actual workshop participants, but is based on the structure of registration data we use here at UMB.

Let’s load the packages we’ll be working with today.

library(tidyverse)
library(fs)

Review - Importing data

To import one csv file, we can use the function read_csv() from the readr package, part of the tidyverse. You can also use the Import Dataset widget in the RStudio GUI.

Workshop01 <- read_csv("data/Workshop_01.csv")

The Power of Iteration

To read in multiple files, we could copy and paste that code a bunch of times with different names each time.

Workshop02 <- read_csv("data/Workshop_02.csv")

Workshop03 <- read_csv("data/Workshop_03.csv")

Workshop04 <- read_csv("data/Workshop_04.csv")

But! That’s a really good way to make mistakes, and part of the benefit of learning to code is to avoid the need to repeat things. Most programming languages, including R, have ways of iterating, or repeating, an operation over multiple objects based on some instructions. Often this iteration takes place in some kind of loop operation, e.g.

for (thing in list_of_things) {
  do_some_function()
}

These can be used in R, but it is more common to use one of several available R functions for this task.

We’ll use a function called map() which takes as arguments a vector or list (.x) and a function (.f) we will use to iterate over the elements of that vector or list. So the syntax of this function is map(.x, .f).

So, we have our function read_csv(), and now we want it to repeat for all the file names in our directory. So - how do we accomplish this?

Create a vector of file names
Use map() to run read_csv() on each file name.
Combine files into one data frame.
Summarize data frame

Step 1: Create a vector of file names

First, you’ll want to have all the files you need saved in the same directory. Then we can use the base R function list.files() on that directory. We’ll save this an object so we can use this later.

files <- list.files("data/", full.names = TRUE)

files

 [1] "data//Workshop_01.csv" "data//Workshop_02.csv" "data//Workshop_03.csv"
 [4] "data//Workshop_04.csv" "data//Workshop_05.csv" "data//Workshop_06.csv"
 [7] "data//Workshop_07.csv" "data//Workshop_08.csv" "data//Workshop_09.csv"
[10] "data//Workshop_10.csv"

Step 2: Iterate over file names

As mentioned before, now we have an object to stand in for the .x argument of map()

workshop_list <- map(files, read_csv)

Step 3: Combine into one data frame

Working with lists

After running our last step, the resulting object is in a structure called a list. In most beginning R workshops we work with two main data structures: vectors and data frames

R Data Structures
	homogeneous	heterogenous
1d	vector	list
2d	matrix	data frame
nd	array

Lists are made up of elements like vectors, but those elements can be anything. Often we use lists to work with several data frames at one time.

Adding names to lists

Each element of the list is given an index number. But it will be helpful to us to give the elements names. We do this with the names() function.

names(workshop_list) #currently no names

NULL

Let’s name each element the same as the file, this way when we combine the data, we can keep track of which rows came from which data frame. We’ll use the fs package which contains tools for working with file names.

path_file(files) # keeps only file part of path

 [1] "Workshop_01.csv" "Workshop_02.csv" "Workshop_03.csv" "Workshop_04.csv"
 [5] "Workshop_05.csv" "Workshop_06.csv" "Workshop_07.csv" "Workshop_08.csv"
 [9] "Workshop_09.csv" "Workshop_10.csv"

path_ext_remove(files) #keeps just

 [1] "data/Workshop_01" "data/Workshop_02" "data/Workshop_03" "data/Workshop_04"
 [5] "data/Workshop_05" "data/Workshop_06" "data/Workshop_07" "data/Workshop_08"
 [9] "data/Workshop_09" "data/Workshop_10"

names(workshop_list) <-path_ext_remove(path_file(files))

names(workshop_list)

 [1] "Workshop_01" "Workshop_02" "Workshop_03" "Workshop_04" "Workshop_05"
 [6] "Workshop_06" "Workshop_07" "Workshop_08" "Workshop_09" "Workshop_10"

Binding rows

We’ll use the bind_rows() function from dplyr to paste each element in our list into one data frame. The .id argument will allow us to keep track of which row came from which original dataset. It uses the names of the list elements as values.

all_workshops <- bind_rows(workshop_list, .id = "Workshop") %>% 
  arrange(attendees)



head(all_workshops)

# A tibble: 6 × 4
  Workshop    attendees                affiliation status 
  <chr>       <chr>                    <chr>       <chr>  
1 Workshop_06 Abdul Waahid al-Jaber    Nursing     Faculty
2 Workshop_10 Addison Elkins           Graduate    Staff  
3 Workshop_07 Adriana Picazo           Medicine    Staff  
4 Workshop_05 Alexander Garcia-Marrufo Nursing     Faculty
5 Workshop_09 Alexandra Willis         Social Work Staff  
6 Workshop_06 Alexis Wright            Medicine    Faculty

Step 4: Summarize the data

Now we can work with our data in the more familiar structure of a data frame. This allows us to do things like get a count of how many of each affiliation and status attended all workshops put together.

all_workshops %>% 
  count(affiliation)

# A tibble: 6 × 2
  affiliation     n
  <chr>       <int>
1 Dentistry      21
2 Graduate       25
3 Medicine       30
4 Nursing        35
5 Pharmacy       31
6 Social Work    31

all_workshops %>% 
  count(status)

# A tibble: 3 × 2
  status      n
  <chr>   <int>
1 Faculty    62
2 Staff      60
3 Student    51

Wrapping up

We might want to work with this data again later so let’s be sure to save it. We can write it back out to a CSV file, or we can save it directly as an R object. Let’s try both ways and save these to our data_outputs folder.

workshop_breakdown <- 
  all_workshops %>% 
  count(affiliation, status)

saveRDS(all_workshops, "data_output/all_workshops.RDS")

saveRDS(workshop_breakdown, "data_output/workshop_breakdown.RDS")