Chapter 7 The Tidyverse
You can skip this chapter if:
-
You are comfortable using the tidyverse
pivot_longer
andpivot_wider
commands -
You can rename variables
-
You can create and change variables in a dataset.
7.1 Opinionated Packages
Throughout this book I’ve been teaching you the tidyverse
way of doing things. There’s quite a lot of debate as to whether tidyverse
is the easy or hard way to learn things. A lot of people think that tidyverse
is more difficult because it sometimes generates more lines of code. However, I really like the way that tidyverse
code is easily guessed. If you know you want to change something, you can take a guess at what verb you want to use.
This is because the tidyverse
is ‘opinionated’. That means there’s an underlying philosophy behind how each package tries to think about data. I like the underlying theory, and I also like that the packages are explicit about the fact that data science itself comes with its own philosophies.
One of the most important philosophies, as everyone online says - tidy data has one observation per row.
There are a few things that tidyverse
makes really easy:
Visualising data with ggplot2
Making new variables or changing variables with
mutate
‘Pivoting’ data into tall and wide formats with
pivot
We will cover these commands in this chapter.
7.2 Data for this chapter
For this chapter let’s work on an example of student satisfaction data. We’ll use a short, fictional example to avoid embarrassing anyone. Let’s say I questioned my students on two courses, Professional Skills, an undergraduate course, and Research Methods, a postgraduate course. I know how many students (n
) were in each class, and I asked each class if they agreed with two statements, “Jill was a good teacher” (good_teacher
) and “I learned in this class” (learned
). I know what percentage of students disagreed with the statement (disagree
), were neutral about the statement (neutral
), or agreed with the statement (agree
). And I also know which of the two years I asked the question in (year
).
Let’s load the data and the tidyverse package first:
library(tidyverse)
students <- tibble (course = c("Prof Skills", "Prof Skills", "Prof Skills", "Prof Skills",
"Research Methods", "Research Methods", "Research Methods", "Research Methods"),
level = c("UG", "UG", "UG", "UG",
"PG", "PG", "PG", "PG"),
question = c("good_teacher", "learned","good_teacher", "learned",
"good_teacher", "learned","good_teacher", "learned"),
year = c(1, 1, 2, 2,
1, 1, 2, 2),
disagree = c(0.8, 0.3, 0.8, 0.2, 0.7, 0.5, 0.6, 0.3),
neutral = c(0.05, 0.4, 0.1, 0.3, 0.1, 0.4, 0.2, 0.3),
agree = c(0.15, 0.3, 0.1, 0.5, 0.2, 0.1, 0.2, 0.4),
n = c(121, 121, 140, 140, 50, 50, 57,57))
7.3 Mutating data
We have covered the mutate
function in previous chapters, but I’m going to specifically cover a few different forms of it now.
In this section I’m going to create a new dataset students_tidy
which will leave our original dataset students
untouched. This is to demonstrate how much data can be transformed, and you might want to think about the difference between the original dataset and the finished product when you’re thinking about workflows.
7.3.1 Mutate to change a variable type.
Let’s start with an example you’ve seen before. At the moment, year
is a numerical variable, which we can prove:
## [1] TRUE
So the first thing we want to do is make year a categorical variable, since there’s only two years available to us. We can retain the order of the levels by specifying them with the parse_factor
command. parse_factor
is really useful, but it only works on character variables, so we need to first change year to a character, and then to a factor.
students_tidy <- students |>
mutate(year = as.character(year),
year = parse_factor(year, levels = c("1", "2")))
You can try taking out the year = as.character(year)
line to see what happens. What error messages do you get?
And now we can ask:
## [1] FALSE
## [1] TRUE
7.3.2 Mutate to change the contents of data
What if we don’t want to change data type, but instead change the text of the data? There’s a very cool function called case_when
which works like an if statement in Excel.
students_tidy <- students_tidy |>
mutate(level = case_when(level == "UG" ~ "Undergraduate",
level == "PG" ~ "Postgraduate"))
In this code chunk we:
-
Create the object
students_tidy
(which we are overwriting, since it already exists) -
Make the new
students_tidy
object from the old one, and then … -
Change a variable within
students_tidy
(mutate
) -
Create a new variable
level
(which we are overwriting, since it already exists) -
When a row of
level
readsUG
, change it to (~
)Undergraduate
-
When a row of
level
readsPG
, change it to (~
)Postgraduate
And we can check to see if this work by looking at a slice of the data.
## # A tibble: 6 × 8
## course level question year disagree neutral agree n
## <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Prof Skills Undergraduate good_teacher 1 80 5 15 121
## 2 Prof Skills Undergraduate learned 1 30 40 30 121
## 3 Prof Skills Undergraduate good_teacher 2 80 10 10 140
## 4 Prof Skills Undergraduate learned 2 20 30 50 140
## 5 Research Methods Postgraduate good_teacher 1 70 10 20 50
## 6 Research Methods Postgraduate learned 1 50 40 10 50
7.3.3 Mutate to change multiple variables
We can also change multiple variables using the mutate_at
function. This can be a little more difficult to master, but is often faster than typing out multiple lines of mutate
.
Our disagree
, neutral
and agree
columns are currently expressed as percentages, e.g. row 1 above had 80% of students disagreeing, 5% of students neutral, and 15% of students agreeing with the statement I was a good teacher. However, we know the number of students in each class, so it might be better to express those values as a proportion (e.g. 0.8, 0.05, 0.15). That’s a simple calculation - we need to take each value and divide by 100.
To do this, we need to use two particularly cool things about tidyverse
, the ability to select multiple variables, and the ability to use .
to mean whatever I just asked for
.
students_tidy <- students_tidy |>
mutate_at(.vars = vars(c(disagree, neutral, agree)),
.funs = ~(. / 100))
In this code chunk we:
-
Create the object
students_tidy
(which we are overwriting, since it already exists) -
Make the new
students_tidy
object from the old one, and then … -
Change more than one variable within
students_tidy
(mutate_at
) -
Specify what variables we want to change
(
.vars = vars
) -
List those variables, which are a string of names
(
c(disagree, neutral, agree)
) -
Specify the function we want to apply to each of the previously
selected variables (
.funs =
) -
We’re not asking for a named function so we show this with
~
-
We want to divide the previously asked for variables by 100
(
(./100)
, where.
is a dummy variable standing in for the previously selected variables. )
And as always, we can test this by showing a slice of the data:
## # A tibble: 6 × 8
## course level question year disagree neutral agree n
## <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Prof Skills Undergraduate good_teacher 1 0.8 0.05 0.15 121
## 2 Prof Skills Undergraduate learned 1 0.3 0.4 0.3 121
## 3 Prof Skills Undergraduate good_teacher 2 0.8 0.1 0.1 140
## 4 Prof Skills Undergraduate learned 2 0.2 0.3 0.5 140
## 5 Research Methods Postgraduate good_teacher 1 0.7 0.1 0.2 50
## 6 Research Methods Postgraduate learned 1 0.5 0.4 0.1 50
7.3.4 Summarise as a unique form of mutate
Mutate is really powerful thing, so unsurprisingly the idea behind it is used in other calls. One that’s really useful to know about (and that we’ll talk more about in descriptive statistics) is summarise
.
summarise
creates a new mutated data frame by default, so its good for grouping together things, for example we can use it to look at the average percentage in each group by question:
students_summed <- students_tidy |>
group_by(question) |>
summarise(mean_disagree = mean(disagree),
mean_neutral= mean(neutral),
mean_agree = mean(agree))
And in fact, if we just want to look at this data quickly, we don’t even need to create a new dataset, we can just look at the output in the console:
students_tidy |>
group_by(question) |>
summarise(mean_disagree = mean(disagree),
mean_neutral= mean(neutral),
mean_agree = mean(agree))
## # A tibble: 2 × 4
## question mean_disagree mean_neutral mean_agree
## <chr> <dbl> <dbl> <dbl>
## 1 good_teacher 0.725 0.112 0.162
## 2 learned 0.325 0.35 0.325
You’ll note that we lose all the other variables (like course
, level
and n
) doing this, so you might want to be careful if you’re using summarise
to make a new dataset.
7.4 Wide and tall data
If you are looking at older materials they may use the terms
gather
instead of pivot_longer
and
spread
instead of pivot_wider
. In fact you can
find a very similar version of the below text on
my github page.
This is a notable point about R - it is a language that is being
actively used and changes as people use it. The idea is that
pivot_longer
is a more informative verb than
gather
, and so we should try to use that instead. At the
moment both commands still work, but this may change in the years to
come.
7.4.1 Processing data
Let’s look at students_tidy
again.
## # A tibble: 6 × 8
## course level question year disagree neutral agree n
## <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Prof Skills Undergraduate good_teacher 1 0.8 0.05 0.15 121
## 2 Prof Skills Undergraduate learned 1 0.3 0.4 0.3 121
## 3 Prof Skills Undergraduate good_teacher 2 0.8 0.1 0.1 140
## 4 Prof Skills Undergraduate learned 2 0.2 0.3 0.5 140
## 5 Research Methods Postgraduate good_teacher 1 0.7 0.1 0.2 50
## 6 Research Methods Postgraduate learned 1 0.5 0.4 0.1 50
At first glance, this looks tidy. The data is presented with each course on a row - surely I’m observing at the course level?
Well, actually, I probably often want to know what % of students agreed (or not) with each statement in each course. The observation in this case is actually the proportion of students, with question response, question, course, level, and year, all being extra pieces of information I know about the proportion.
I want much taller data.
(I’m using this specific example not because it’s a particularly easy example, but because this is a format you’ll see for data in the real world all the time, and people will make big decisions on this data. It’s a good idea to show you how to tidy it.)
7.4.2 pivot_longer
The pivot_longer
command is a quick way to smush this data into a tall (or long) format. It creates two new columns, the names_to
column which collects your old column names and your values_to
column which collects the row values (fairly self-explanatory).
students_tall <- students_tidy |>
pivot_longer(cols = c(disagree, neutral, agree),
names_to = "response",
values_to = "prop")
This says:
In the above code block we:
-
Create a new dataset called
students_tall
-
students_tall
is based onstudents_tidy
-
We want to squish the data into new columns
(
pivot_longer
) -
We specify the columns we want to stretch into two
(
cols = c(disagree, neutral, agree)
) -
We specify the name for new column which will take the value of the
old column headers (
names_to = “response”
- note we have to put quotation marks around the new name, which is not very common intidyverse
) -
We specify the name for the new column which will store the values
of the old rows (
values_to = “prop”
)
And of course, we can see what this has done to the data:
## # A tibble: 6 × 7
## course level question year n response prop
## <chr> <chr> <chr> <fct> <dbl> <chr> <dbl>
## 1 Prof Skills Undergraduate good_teacher 1 121 disagree 0.8
## 2 Prof Skills Undergraduate good_teacher 1 121 neutral 0.05
## 3 Prof Skills Undergraduate good_teacher 1 121 agree 0.15
## 4 Prof Skills Undergraduate learned 1 121 disagree 0.3
## 5 Prof Skills Undergraduate learned 1 121 neutral 0.4
## 6 Prof Skills Undergraduate learned 1 121 agree 0.3
It’s very important to think about your variable names
I once spent a whole afternoon trying to recreate an error message I
was getting with this, when I realised that I was saying
names_to = “question”
. The variable question
already exists in the dataset, and so R was re-writing the variable
every time it gathered the data. Unique variable names are really
helpful!
7.4.3 pivot_wider
What if, after all that, you realise that you never wanted your data gathered at all? pivot_wider
is here to rescue you.
Just as before, pivot_wider
wants to know the names
and the value
, but this time, it will split those two columns into multiple columns. This time we want all that data to be spread out like marmalade on toast, so we don’t exclude any columns (in fact, try excluding the columns and see what spread says. )
And of course we can view this:
## # A tibble: 6 × 8
## course level question year n disagree neutral agree
## <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Prof Skills Undergraduate good_teacher 1 121 0.8 0.05 0.15
## 2 Prof Skills Undergraduate learned 1 121 0.3 0.4 0.3
## 3 Prof Skills Undergraduate good_teacher 2 140 0.8 0.1 0.1
## 4 Prof Skills Undergraduate learned 2 140 0.2 0.3 0.5
## 5 Research Methods Postgraduate good_teacher 1 50 0.7 0.1 0.2
## 6 Research Methods Postgraduate learned 1 50 0.5 0.4 0.1