library(tidyverse)
1 Introduction
This week we will look at the difference between the wide and long data format. This concept is related to the concept of “tidy” data, from which the tidyverse
takes its name.
As a review, with tidy data:
- Each column is a variable
- Each row is an observation
- Each cell is a value.
The tidyverse
packages are so called because they provide a set of tools for both producing and working with tidy data. As Figure 1 illustrates, tidy data is preferable because it is generally more predictable to work with.
ggplot2
generally requires your data to be tidy in order produce plots. In this lesson, we’ll be using the tidyr
package to make our data tidy.
2 The Data
This week we are looking at shelving statistics. We have a small data set which tracks numbers of materials shelved by type over four months.
Let’s open up a new script and load the tidyverse
.
And now let’s load the data:
<- read_csv("data/shelving_wide.csv") shelving_wide
This is a fairly small data set of just 8 rows and 6 columns.
month | shelver | stacks_books | reference_books | bound_journals | unbound_journals |
---|---|---|---|---|---|
1 | A | 0 | 0 | 337 | 0 |
1 | B | 81 | 12 | 0 | 0 |
2 | A | 0 | 0 | 325 | 2 |
2 | B | 62 | 13 | 0 | 0 |
3 | A | 0 | 8 | 258 | 0 |
3 | B | 138 | 8 | 5 | 0 |
4 | A | 0 | 0 | 72 | 0 |
4 | B | 70 | 12 | 0 | 0 |
Our ultimate goal is to be able to summarize and represent this data in a bar plot.
Let’s think about some ways we might visualize this data:
- Number of items shelved by material type
- Number of items shelved by month
- Number of items shelved by type and month
- Number of items shelved by shelver (also broken down by type, month)
Given what we learned about ggplot2
in our last lesson, our code should look something like this.
shelving_wide %>%
ggplot(mapping=aes(x=???, y=???)) +
geom_bar(stat="identity")
It’s challenging to name a column for x and a column for y, because this data is in what’s called a wide format, which we’ll describe more in the next section.
So our goal this week is to:
- Pivot data from the wide to long format
- Summarize data in a bar plot.
3 Wide vs Long data
A wide data set is one in which a variable is distributed across several rows. In our shelving_wide
data frame, we have four columns: stacks_books
, reference_books
, bound_journals
, and unbound_journals
. These all represent the type of material being shelved. So, we could say that those column names represent values of the variable material type
. Another variable of interest is the count
of items being shelved, which is being stored in the cells of those four material type
columns. Other variables in this data frame are month
and shelver
.
Data in this format violates Rule 1 of tidy data, because each column does not represent a single variable, instead we have the variable type of material spread out over 4 columns.
To make our data set tidy we would want to reduce those four columns to two columns, one for material type, and one for the count of items. By reducing the number of columns, we must also increase the number of rows. So, we are lengthening our data set to make it tidy. The final result will look like:
month | shelver | material_type | number_shelved |
---|---|---|---|
1 | A | stacks_books | 0 |
1 | A | reference_books | 0 |
1 | A | bound_journals | 337 |
1 | A | unbound_journals | 0 |
1 | B | stacks_books | 81 |
1 | B | reference_books | 12 |
1 | B | bound_journals | 0 |
1 | B | unbound_journals | 0 |
2 | A | stacks_books | 0 |
2 | A | reference_books | 0 |
2 | A | bound_journals | 325 |
2 | A | unbound_journals | 2 |
2 | B | stacks_books | 62 |
2 | B | reference_books | 13 |
2 | B | bound_journals | 0 |
2 | B | unbound_journals | 0 |
3 | A | stacks_books | 0 |
3 | A | reference_books | 8 |
3 | A | bound_journals | 258 |
3 | A | unbound_journals | 0 |
3 | B | stacks_books | 138 |
3 | B | reference_books | 8 |
3 | B | bound_journals | 5 |
3 | B | unbound_journals | 0 |
4 | A | stacks_books | 0 |
4 | A | reference_books | 0 |
4 | A | bound_journals | 72 |
4 | A | unbound_journals | 0 |
4 | B | stacks_books | 70 |
4 | B | reference_books | 12 |
4 | B | bound_journals | 0 |
4 | B | unbound_journals | 0 |
Wide data is not “bad”, and is very common in real world data sets, because it is easy for humans to track and read data that way. But a tidy, or long, format may make it easier to analyze that data with a computer.
4 pivot_longer()
To lengthen our data, we’ll use the pivot_longer()
function from the tidyr
package. There are four arguments we need to provide:
data
- the data frame to lengthencols
- the columns we want to pivot onnames_to
- the name of a new column which will have our old column names as valuesvalues_to
- the name of a new column which will hold the cell values of the pivoted columns
<-
shelving_long %>%
shelving_wide pivot_longer(cols = stacks_books:unbound_journals,
names_to = "material_type",
values_to = "count")
Notice that our new “long” data frame has 32 observations and 4 variables.
5 Plotting the Data
Now it should be easier to create our bar plot.
%>%
shelving_long ggplot(mapping=aes(x=material_type, y=count)) +
geom_bar(stat="identity") +
labs(title="Number of items shelved by material type",
x="Material Type",
y="Number of items shelved")
6 Challenge
Using what you learned last week, how could you create a plot of the number of items shelved each month by type as shown below. Open the code box to see a solution.
Reveal Solution
%>%
shelving_long ggplot(mapping=aes(x=month, y=count, fill=material_type)) +
geom_bar(stat="identity", position = "dodge") +
labs(title="Number of items shelved by material type",
x="Month",
y="Number of items shelved",
fill="Material Type") +
scale_fill_viridis_d(labels=c('Bound Journals', 'Reference Books', 'Stacks Books', 'Unbound Journals'))
How might we plot all four variables: month
, shelver
, material_type
, and count
? (Hint: check out the ggplot2
documentation on faceting)
Reveal Solution
%>%
shelving_long ggplot(mapping=aes(x=month, y=count, fill=material_type)) +
geom_bar(stat="identity", position = "dodge") +
labs(title="Number of items shelved each month",
subtitle= "By material type and shelver",
x="Month",
y="Number of items shelved",
fill="Material Type") +
scale_fill_viridis_d(labels=c('Bound Journals', 'Reference Books', 'Stacks Books', 'Unbound Journals')) +
facet_wrap(~shelver)
7 Wrapping Up
Use ggsave()
to save the plot you created in Section 5 as a .jpg file.
8 Further Reading
There’s much more that can be done with pivot_longer()
and it’s companion function pivot_wider
and with the tidyr
package in general. Check out the following resources for more information:
Footnotes
“Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst”↩︎