Chapter 7 The Tidyverse

You can skip this chapter if:

  • You are comfortable using the tidyverse pivot_longer and pivot_wider commands

  • You can rename variables

  • You can create and change variables in a dataset.

7.1 Opinionated Packages

Throughout this book I’ve been teaching you the tidyverse way of doing things. There’s quite a lot of debate as to whether tidyverse is the easy or hard way to learn things. A lot of people think that tidyverse is more difficult because it sometimes generates more lines of code. However, I really like the way that tidyverse code is easily guessed. If you know you want to change something, you can take a guess at what verb you want to use.

This is because the tidyverse is ‘opinionated’. That means there’s an underlying philosophy behind how each package tries to think about data. I like the underlying theory, and I also like that the packages are explicit about the fact that data science itself comes with its own philosophies.

One of the most important philosophies, as everyone online says - tidy data has one observation per row.

There are a few things that tidyverse makes really easy:

  • Visualising data with ggplot2

  • Making new variables or changing variables with mutate

  • ‘Pivoting’ data into tall and wide formats with pivot

We will cover these commands in this chapter.

7.2 Data for this chapter

For this chapter let’s work on an example of student satisfaction data. We’ll use a short, fictional example to avoid embarrassing anyone. Let’s say I questioned my students on two courses, Professional Skills, an undergraduate course, and Research Methods, a postgraduate course. I know how many students (n) were in each class, and I asked each class if they agreed with two statements, “Jill was a good teacher” (good_teacher) and “I learned in this class” (learned). I know what percentage of students disagreed with the statement (disagree), were neutral about the statement (neutral), or agreed with the statement (agree). And I also know which of the two years I asked the question in (year).

Let’s load the data and the tidyverse package first:

library(tidyverse)

students <- tibble (course = c("Prof Skills", "Prof Skills", "Prof Skills", "Prof Skills", 
                               "Research Methods", "Research Methods", "Research Methods", "Research Methods"),
                    level = c("UG", "UG", "UG", "UG",
                              "PG", "PG", "PG", "PG"),
                    question = c("good_teacher", "learned","good_teacher", "learned",
                                 "good_teacher", "learned","good_teacher", "learned"),
                    year = c(1, 1, 2, 2,
                             1, 1, 2, 2),
                    disagree = c(0.8, 0.3, 0.8, 0.2, 0.7, 0.5, 0.6, 0.3),
                    neutral = c(0.05, 0.4, 0.1, 0.3, 0.1, 0.4, 0.2, 0.3),
                    agree = c(0.15, 0.3, 0.1, 0.5, 0.2, 0.1, 0.2, 0.4),
                    n = c(121, 121, 140, 140, 50, 50, 57,57))  

7.3 Mutating data

We have covered the mutate function in previous chapters, but I’m going to specifically cover a few different forms of it now.

In this section I’m going to create a new dataset students_tidy which will leave our original dataset students untouched. This is to demonstrate how much data can be transformed, and you might want to think about the difference between the original dataset and the finished product when you’re thinking about workflows.

7.3.1 Mutate to change a variable type.

Let’s start with an example you’ve seen before. At the moment, year is a numerical variable, which we can prove:

is.numeric(students$year)
## [1] TRUE

So the first thing we want to do is make year a categorical variable, since there’s only two years available to us. We can retain the order of the levels by specifying them with the parse_factor command. parse_factor is really useful, but it only works on character variables, so we need to first change year to a character, and then to a factor.

students_tidy <- students  |>  
  mutate(year = as.character(year),
         year = parse_factor(year, levels = c("1", "2")))

You can try taking out the year = as.character(year) line to see what happens. What error messages do you get?

And now we can ask:

is.factor(students$year)
## [1] FALSE
is.factor(students_tidy$year)
## [1] TRUE

7.3.2 Mutate to change the contents of data

What if we don’t want to change data type, but instead change the text of the data? There’s a very cool function called case_when which works like an if statement in Excel.

students_tidy <- students_tidy |> 
  mutate(level = case_when(level == "UG" ~ "Undergraduate",
                           level == "PG" ~ "Postgraduate"))

In this code chunk we:

  1. Create the object students_tidy (which we are overwriting, since it already exists)
  2. Make the new students_tidy object from the old one, and then …
  3. Change a variable within students_tidy (mutate)
  4. Create a new variable level (which we are overwriting, since it already exists)
  5. When a row of level reads UG, change it to (~) Undergraduate
  6. When a row of level reads PG, change it to (~) Postgraduate

And we can check to see if this work by looking at a slice of the data.

head(students_tidy)
## # A tibble: 6 × 8
##   course           level         question     year  disagree neutral agree     n
##   <chr>            <chr>         <chr>        <fct>    <dbl>   <dbl> <dbl> <dbl>
## 1 Prof Skills      Undergraduate good_teacher 1           80       5    15   121
## 2 Prof Skills      Undergraduate learned      1           30      40    30   121
## 3 Prof Skills      Undergraduate good_teacher 2           80      10    10   140
## 4 Prof Skills      Undergraduate learned      2           20      30    50   140
## 5 Research Methods Postgraduate  good_teacher 1           70      10    20    50
## 6 Research Methods Postgraduate  learned      1           50      40    10    50

7.3.3 Mutate to change multiple variables

We can also change multiple variables using the mutate_at function. This can be a little more difficult to master, but is often faster than typing out multiple lines of mutate.

Our disagree, neutral and agree columns are currently expressed as percentages, e.g. row 1 above had 80% of students disagreeing, 5% of students neutral, and 15% of students agreeing with the statement I was a good teacher. However, we know the number of students in each class, so it might be better to express those values as a proportion (e.g. 0.8, 0.05, 0.15). That’s a simple calculation - we need to take each value and divide by 100.

To do this, we need to use two particularly cool things about tidyverse, the ability to select multiple variables, and the ability to use . to mean whatever I just asked for.

students_tidy <- students_tidy |> 
  mutate_at(.vars = vars(c(disagree, neutral, agree)),
            .funs = ~(.  / 100))

In this code chunk we:

  1. Create the object students_tidy (which we are overwriting, since it already exists)
  2. Make the new students_tidy object from the old one, and then …
  3. Change more than one variable within students_tidy (mutate_at)
  4. Specify what variables we want to change (.vars = vars)
  5. List those variables, which are a string of names (c(disagree, neutral, agree))
  6. Specify the function we want to apply to each of the previously selected variables (.funs =)
  7. We’re not asking for a named function so we show this with ~
  8. We want to divide the previously asked for variables by 100 ((./100), where . is a dummy variable standing in for the previously selected variables. )

And as always, we can test this by showing a slice of the data:

head(students_tidy)
## # A tibble: 6 × 8
##   course           level         question     year  disagree neutral agree     n
##   <chr>            <chr>         <chr>        <fct>    <dbl>   <dbl> <dbl> <dbl>
## 1 Prof Skills      Undergraduate good_teacher 1          0.8    0.05  0.15   121
## 2 Prof Skills      Undergraduate learned      1          0.3    0.4   0.3    121
## 3 Prof Skills      Undergraduate good_teacher 2          0.8    0.1   0.1    140
## 4 Prof Skills      Undergraduate learned      2          0.2    0.3   0.5    140
## 5 Research Methods Postgraduate  good_teacher 1          0.7    0.1   0.2     50
## 6 Research Methods Postgraduate  learned      1          0.5    0.4   0.1     50

7.3.4 Summarise as a unique form of mutate

Mutate is really powerful thing, so unsurprisingly the idea behind it is used in other calls. One that’s really useful to know about (and that we’ll talk more about in descriptive statistics) is summarise.

summarise creates a new mutated data frame by default, so its good for grouping together things, for example we can use it to look at the average percentage in each group by question:

students_summed <- students_tidy |> 
  group_by(question) |> 
  summarise(mean_disagree = mean(disagree),
            mean_neutral= mean(neutral),
            mean_agree = mean(agree))

And in fact, if we just want to look at this data quickly, we don’t even need to create a new dataset, we can just look at the output in the console:

students_tidy |> 
  group_by(question) |> 
  summarise(mean_disagree = mean(disagree),
            mean_neutral= mean(neutral),
            mean_agree = mean(agree))
## # A tibble: 2 × 4
##   question     mean_disagree mean_neutral mean_agree
##   <chr>                <dbl>        <dbl>      <dbl>
## 1 good_teacher         0.725        0.112      0.162
## 2 learned              0.325        0.35       0.325

You’ll note that we lose all the other variables (like course, level and n) doing this, so you might want to be careful if you’re using summarise to make a new dataset.

7.4 Wide and tall data

If you are looking at older materials they may use the terms gather instead of pivot_longer and spread instead of pivot_wider. In fact you can find a very similar version of the below text on my github page.

This is a notable point about R - it is a language that is being actively used and changes as people use it. The idea is that pivot_longer is a more informative verb than gather, and so we should try to use that instead. At the moment both commands still work, but this may change in the years to come.

7.4.1 Processing data

Let’s look at students_tidy again.

head(students_tidy)
## # A tibble: 6 × 8
##   course           level         question     year  disagree neutral agree     n
##   <chr>            <chr>         <chr>        <fct>    <dbl>   <dbl> <dbl> <dbl>
## 1 Prof Skills      Undergraduate good_teacher 1          0.8    0.05  0.15   121
## 2 Prof Skills      Undergraduate learned      1          0.3    0.4   0.3    121
## 3 Prof Skills      Undergraduate good_teacher 2          0.8    0.1   0.1    140
## 4 Prof Skills      Undergraduate learned      2          0.2    0.3   0.5    140
## 5 Research Methods Postgraduate  good_teacher 1          0.7    0.1   0.2     50
## 6 Research Methods Postgraduate  learned      1          0.5    0.4   0.1     50

At first glance, this looks tidy. The data is presented with each course on a row - surely I’m observing at the course level?

Well, actually, I probably often want to know what % of students agreed (or not) with each statement in each course. The observation in this case is actually the proportion of students, with question response, question, course, level, and year, all being extra pieces of information I know about the proportion.

I want much taller data.

(I’m using this specific example not because it’s a particularly easy example, but because this is a format you’ll see for data in the real world all the time, and people will make big decisions on this data. It’s a good idea to show you how to tidy it.)

7.4.2 pivot_longer

The pivot_longer command is a quick way to smush this data into a tall (or long) format. It creates two new columns, the names_to column which collects your old column names and your values_to column which collects the row values (fairly self-explanatory).

students_tall <- students_tidy |> 
  pivot_longer(cols = c(disagree, neutral, agree),
               names_to = "response",
               values_to = "prop")

This says:

In the above code block we:

  1. Create a new dataset called students_tall
  2. students_tall is based on students_tidy
  3. We want to squish the data into new columns (pivot_longer)
  4. We specify the columns we want to stretch into two (cols = c(disagree, neutral, agree))
  5. We specify the name for new column which will take the value of the old column headers (names_to = “response” - note we have to put quotation marks around the new name, which is not very common in tidyverse)
  6. We specify the name for the new column which will store the values of the old rows (values_to = “prop”)

And of course, we can see what this has done to the data:

head(students_tall)
## # A tibble: 6 × 7
##   course      level         question     year      n response  prop
##   <chr>       <chr>         <chr>        <fct> <dbl> <chr>    <dbl>
## 1 Prof Skills Undergraduate good_teacher 1       121 disagree  0.8 
## 2 Prof Skills Undergraduate good_teacher 1       121 neutral   0.05
## 3 Prof Skills Undergraduate good_teacher 1       121 agree     0.15
## 4 Prof Skills Undergraduate learned      1       121 disagree  0.3 
## 5 Prof Skills Undergraduate learned      1       121 neutral   0.4 
## 6 Prof Skills Undergraduate learned      1       121 agree     0.3

It’s very important to think about your variable names

I once spent a whole afternoon trying to recreate an error message I was getting with this, when I realised that I was saying names_to = “question”. The variable question already exists in the dataset, and so R was re-writing the variable every time it gathered the data. Unique variable names are really helpful!

7.4.3 pivot_wider

What if, after all that, you realise that you never wanted your data gathered at all? pivot_wider is here to rescue you.

Just as before, pivot_wider wants to know the names and the value, but this time, it will split those two columns into multiple columns. This time we want all that data to be spread out like marmalade on toast, so we don’t exclude any columns (in fact, try excluding the columns and see what spread says. )

students_wide <- students_tall |> 
  pivot_wider(names_from = response,
              values_from = prop)

And of course we can view this:

head(students_wide)
## # A tibble: 6 × 8
##   course           level         question     year      n disagree neutral agree
##   <chr>            <chr>         <chr>        <fct> <dbl>    <dbl>   <dbl> <dbl>
## 1 Prof Skills      Undergraduate good_teacher 1       121      0.8    0.05  0.15
## 2 Prof Skills      Undergraduate learned      1       121      0.3    0.4   0.3 
## 3 Prof Skills      Undergraduate good_teacher 2       140      0.8    0.1   0.1 
## 4 Prof Skills      Undergraduate learned      2       140      0.2    0.3   0.5 
## 5 Research Methods Postgraduate  good_teacher 1        50      0.7    0.1   0.2 
## 6 Research Methods Postgraduate  learned      1        50      0.5    0.4   0.1