Chapter 11 How to choose your stats

This chapter is potentially controversial. I deliberately haven’t discussed statistics in great detail in this book, not least because I feel I re-learn stats every few years. This chapter is not setting out to teach you statistics. There are many statistical assumptions and issues that I breeze past in this chapter. In this chapter I want to give you a quick reference to running statistical analyses in R, and, importantly, interpreting those results.

A much, much better book is Andy Field, Jeremy Miles, and Zoe Field’s brilliant Discovering Statistics Using R book.

You can skip this chapter if:

You are comfortable deciding which statistical test to use

So its happened. You’ve collected your data, you’ve installed R, and now you’re ready to run some analyses! But what on earth do you do next? How do you choose and interpret the right statistic?

To be brutally honest, only repeated practice and engaging with statistics can help build this skill, and the more you learn, the more uncertain you may feel! But what this chapter can do is to give you a bit of a cheat sheet and show you the ‘greatest hits’ of statistical testing in R. I will also use the APA 7th Edition Numbers and Statistics guide to report the statistical tests, and show you how to extract that information from the output, but you should also look up the style guide of your desired report (e.g. APA 7th Edition).

For this chapter I will use in-built R datasets and fake datasets to demonstrate these tests. I will also use the tidyverse package throughout (which offers a few more datasets), so you will want to have that loaded.

I will note here all the packages that are used in this chapter, and highlight them at their specific tests. Remember, you can always install a package with install.packages(“package_name”)

library(tidyverse)
library(vcd)
library(broom)
library(pgirmess)
library(clinfun)

11.1 Where to start

To work your way through this chapter, you need to identify your:

Response variable (the y axis)
Explanatory variable (the x axis)

and then work your way down this handy list:

I have a . . .

11.2 My response is a categorical variable

11.2.1 Categorical Response, Historical/Known Comparisons

11.2.1.1 Proportion of responses different from a known/expected result (1 Sample Proportion Test)

Our Data

Let’s say we have 20 German Shepherd dogs, and we know that the prevalence of hip dysplasia, a very painful condition, is usually 18% in this breed. We want to know if these particular dogs (perhaps they are from a new genetic line) are showing less hip dysplasia than we would expect. We have recorded whether or not these dogs are showing signs of dysplasia as a ‘yes/no’ categorical variable, and 3 dogs are showing signs of hip dysplasia.

We can run a 1 proportion test like so:

binom.test(x = 3, n = 20, p = 0.18, alternative = "two.sided")

## 
##  Exact binomial test
## 
## data:  3 and 20
## number of successes = 3, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.18
## 95 percent confidence interval:
##  0.03207094 0.37892683
## sample estimates:
## probability of success 
##                   0.15

Note that we don’t even need any real ‘data’ to run this test - its simply drawn from statistical probabilities. We would write this up as:

In this study, 15% of the dogs presented with signs of hip dysplacia, which is not significantly different from the historical prevalence of 18% in a binomial test (95% CI [3%-38%], p > .9).

What if we had more dogs?

We’ve continued running this study, mainly because we’re worried that only having 20 dogs is not accounting for enough of the natural variation we see in the world. This time we have 50 dogs from this genetic line and we’ve seen 4 of them develop dysplasia. As we have more than 30 responses, we can use a different version of the test.

prop.test(x = 4, n = 50, p = 0.18, alternative = "two.sided")

## 
##  1-sample proportions test with continuity correction
## 
## data:  4 out of 50, null probability 0.18
## X-squared = 2.7439, df = 1, p-value = 0.09763
## alternative hypothesis: true p is not equal to 0.18
## 95 percent confidence interval:
##  0.0259402 0.2011004
## sample estimates:
##    p 
## 0.08

This time we have a little more information that we can report with:

In this study, 8% of the dogs presented with hip dysplasia, which is not significantly different from the historical prevalence of 18% in a one sample proportion test (95% CI[3%-20%], X² = 2.74, p = .098)

11.2.2 Categorical Response, Categorical Explanatory

11.2.2.1 Categorical Response (Repeated Measures) Categorical Explanatory, McNemar’s Test

Our Data

We have 40 German Shepherd dogs, some of whom already present with dysplasia, a painful condition. We’re concerned that after a season of work more dogs will develop dysplasia, and want to see if this is true. For this dataset we will create a contingency table:

work <- matrix(c(2, 5, 38, 35), 
              ncol=2, 
              byrow=TRUE,
              dimnames = list(c("Dysplasia", "No Dysplasia"),
                              c("Before Work Season", "After Work Season")))

We are going to run a specific test which accounts for the fact we have the same dogs in both conditions (before and after work):

mcnemar.test(work)

## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  work
## McNemar's chi-squared = 23.814, df = 1, p-value = 1.061e-06

After the work season significantly more dogs developed dysplasia in McNemar’s test (X²[1, n = 40] = 23.81, p = <.001)

11.2.2.2 Proportion of responses compared across groups (Fishers/Chi² test)

11.2.2.2.1 Fisher’s Exact Test (Small Sample Sizes / Summarised Data)

Our Data

German Shepherd Dogs suffer from hip dysplasia, a very painful condition. We know that more inbred dogs are more susceptible, and so we have followed 20 highly inbred dogs and 20 less inbred dogs and observed whether they developed the disease or not. As the prevalence is low (18% historically), we would not expect more than 3.6 dogs to develop dysplasia in either category (if it was left up to chance). For this reason, we should use Fisher’s Exact Test, which is more robust for smaller sample sizes.

Of the inbred dogs, we’ve seen 4 cases of dysplasia. Of the less inbred dogs, we’ve seen 2 cases of dysplasia. Like before, we can test this without having any actual data in R:

fisher.test(x = c(4,16), y = c(2,18))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  c(4, 16) and c(2, 18)
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.02564066        Inf
## sample estimates:
## odds ratio 
##        Inf

We do need to calculate one more thing to add to our report,the odds ratio:

(4+18) / (16 + 2)

## [1] 1.222222

And this is reported like so:

A Fisher’s Exact Test found no significant difference in the frequency of dysplasia occurring in highly inbred dogs versus less inbred dogs (OR = 1.2, 95% CI[0.02-Inf], p > .9)

11.2.2.2.2 Chi² Test (Larger Sample Sizes)

Our Data

This time we have observed more German Shepherd Dogs, a highly inbred line which we think will be more prone to hip dysplasia, and a less inbred line which we think will be less prone to hip dysplasia. For this dataset we will create a contingency table:

gsd <- matrix(c(16, 12, 84, 86), 
              ncol=2, 
              byrow=TRUE,
              dimnames = list(c("Dysplasia", "No Dysplasia"),
                              c("Inbred", "Less Inbred")))

With a Chi² test we often also want to calculate an effect size, and there’s a handy package (vcd) which we can use to calculate Cramer’s V. We’ll calculate it here too. To interpret V, a small effect size is <0.10, a medium effect size < 0.30, and a large effect size is <0.50.

chisq.test(gsd)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gsd
## X-squared = 0.30714, df = 1, p-value = 0.5794

#Odds Ratio
(16 + 86) / (84 + 12)

## [1] 1.0625

library(vcd)
assocstats(gsd)

##                      X^2 df P(> X^2)
## Likelihood Ratio 0.57672  1  0.44760
## Pearson          0.57481  1  0.44835
## 
## Phi-Coefficient   : 0.054 
## Contingency Coeff.: 0.054 
## Cramer's V        : 0.054

And report like so:

A Chi² test found no significant different in the frequency of dysplasia occurring in highly inbred dogs versus less inbred dos (OR 1.06, X²[1, n = 198] = 0.307, V = .05, p = .579)

11.2.3 Categorical Response, Numerical Explanatory (Logistic Regression)

Logistic regression is a bit of a ‘catch-all’ phase, certainly in this chapter. It’s also having its own day in the sun and is now often called a form of machine learning. I have mixed feelings about this that we can’t get into right now.

If you’re reading this chapter from top-to-bottom, this subheading will feel like a huge jump in difficulty. It is. We are beginning to step into the land of generalised linear models which are powerful statistical tools. If you’re very new to statistics and this is where your path has led you, you should really seek out advice from others. This is getting into the fun, meaty parts of statistical theory. But I did say I’d walk you through it . . .

Our Data

We have 21 German Shepherd Dogs. We took a measure of inflammation from their bodies, and then later recorded whether or not they were showing signs of hip dysplasia. We want to see if inflammation (a numerical score) can be used to predict whether the dogs will end up with hip dysplasia (a categorical variable, 0 = no signs, 1 = hip dysplasia).

We will need to generate this data in R:

dysp <- tibble(dysplasia = c(1, 1, 1, 1, 1, 1, 1,
                              0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0),
                inflammation =c(0.91, 0.79, 1.40, 0.71, 1.01, 0.77, 0.85,
                                0.42, 1.02, 0.31, 0.05, 1.17, 0.04, 0.36, 
                                0.12, 0.02, 0.05, 0.42, 0.92, 0.72,  1.05))

To run a binary logistic regression (predicting yes vs no from a numerical variable) we can use the glm command. We do need to take a few extra steps. For reporting we’ll also make use of the broom package.

logit <- glm(dysplasia ~ inflammation, data = dysp, family = "binomial")
summary(logit)

## 
## Call:
## glm(formula = dysplasia ~ inflammation, family = "binomial", 
##     data = dysp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5715  -0.5727  -0.3094   1.0397   1.4914  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -3.190      1.484  -2.150   0.0316 *
## inflammation    3.488      1.740   2.004   0.0450 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.734  on 20  degrees of freedom
## Residual deviance: 20.534  on 19  degrees of freedom
## AIC: 24.534
## 
## Number of Fisher Scoring iterations: 5

logLik(logit)

## 'log Lik.' -10.26719 (df=2)

exp(cbind(OddsRatio = coef(logit), confint(logit)))

##                OddsRatio        2.5 %       97.5 %
## (Intercept)   0.04118154 0.0008355444    0.4146414
## inflammation 32.71638289 1.9151146706 2735.6121035

And we could report this like so:

For each unit increase in inflammation, the odds of developing hip dysplasia increases by a factor of 32.7 in a binary logistic regression (Table 11.1)

library(broom)

tidy(logit) |> 
  mutate(OddsRatio = exp(coef(logit))) |> 
  select(term, estimate, std.error, statistic, OddsRatio, p.value ) |> 
  kable(caption = "Coefficient table for variables in binary logistic regression predicting hip dysplasia\nfrom inflammation markers in 21 German Shepard Dogs")

Table: 11.1: Coefficient table for variables in binary logistic regression predicting hip dysplasia from inflammation markers in 21 German Shepard Dogs
term	estimate	std.error	statistic	OddsRatio	p.value
(Intercept)	-3.189765	1.48362	-2.149988	0.0411815	0.0315561
inflammation	3.487876	1.74028	2.004204	32.7163829	0.0450482

If you want to dig more deeply into logistic regressions, or model more predictor variables, I think this link is good. Remember to talk to your nearest friendly statistician.

11.3 My response is a numerical variable

11.3.1 Numerical Response, Historical/Known Comparisons (One Sample t-Test)

Our Data

The dataset trees gives us the height in feet of 31 black cherry trees. We want to know if these trees differ from a historical average height of 80ft. The t test command in R is simplistic, and so we do need to calculate the mean of height (mu) in addition to the t test to report properly.

t.test(trees$Height, mu = 80)

## 
##  One Sample t-test
## 
## data:  trees$Height
## t = -3.4952, df = 30, p-value = 0.001496
## alternative hypothesis: true mean is not equal to 80
## 95 percent confidence interval:
##  73.6628 78.3372
## sample estimates:
## mean of x 
##        76

trees |> 
  summarise(mean= mean(Height),
            sd = sd(Height))

##   mean       sd
## 1   76 6.371813

We would report this like so:

The mean height (76ft +/- 6.37) for black cherry trees was smaller than the historical average of 80ft in a one-sample t-test (t(30) = -3.50, p = .001)

11.3.2 Numerical Response (Repeated Measures), Categorical Explanatory (Paired t-Test)

Our Data

In this example we have 10 housecats who we put on a special weight-loss diet. We weigh them before the diet and after, and we want to see if they have lost any weight. We can do this with a paired t-test.

diet <- tibble (before = c(5.04, 4.63, 4.04, 5.10, 5.43, 4.83, 3.45, 3.49, 5.02, 4.81),
                after = c( 4.78, 2.49, 4.46, 2.03, 5.13, 7.23, 3.50, 1.89, 3.30, 3.91))

diet |> 
  summarise(b_mean = mean(before),
            a_mean = mean(after),
            b_sd = sd(before),
            a_sd = sd(after))

## # A tibble: 1 × 4
##   b_mean a_mean  b_sd  a_sd
##    <dbl>  <dbl> <dbl> <dbl>
## 1   4.58   3.87 0.689  1.62

t.test(diet$before, diet$after, paired = TRUE, alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  diet$before and diet$after
## t = 1.4615, df = 9, p-value = 0.1779
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.3900511  1.8140511
## sample estimates:
## mean difference 
##           0.712

After being on the weight-loss diet, cats weighed 3.9kg (+/- 1.62kg SD), compared to a mean weight of 4.6kg (+/- 0.69kg SD) before going on the diet, but this difference of 0.71kg (95% CI [-0.39, 1.81]) was not statistically significant in a paired t-test (t(9) = 1.46, p = .178)

11.3.3 Numerical Response, Categorical Explanatory (ANOVA/GLM)

Traditionally, depending on how many explanatory variables you have and whether your design is balanced or unbalanced, you would select a different type of test, leading up to a general linear model as the most flexible form.

One explanatory factor = 1 way ANOVA
Two explanatory factors and balanced design = 2-way ANOVA
Three or more explanatory factor or unbalanced design = general linear model (glm)

I like using the Grafen and Hails approach to statistics, and considering all ANOVAs to be a glm. I am fundamentally lazy, and don’t like to remember three separate things when one model will do it all for me. Real statisicians may cry at this.

Our data

Let’s use the chickwts data to explore this. chickwts describes a simple experiment with six different feedtypes and the weights of the chicks fed on them. We are looking for the feedtype which provides the heaviest chick on average.

As an ANOVA

The basic steps of running a one-way ANOVA test is to create the model, inspect the model, check the model assumptions with a residual plot, and run pairwise comparisons to define the difference between the levels. Its also very difficult to interpret this without a plot so we’ll add on of those to our interpretation.

chk_a <- aov(weight ~ feed, data = chickwts)

plot(chk_a, 1)

summary(chk_a)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## feed         5 231129   46226   15.37 5.94e-10 ***
## Residuals   65 195556    3009                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(chk_a)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ feed, data = chickwts)
## 
## $feed
##                            diff         lwr       upr     p adj
## horsebean-casein    -163.383333 -232.346876 -94.41979 0.0000000
## linseed-casein      -104.833333 -170.587491 -39.07918 0.0002100
## meatmeal-casein      -46.674242 -113.906207  20.55772 0.3324584
## soybean-casein       -77.154762 -140.517054 -13.79247 0.0083653
## sunflower-casein       5.333333  -60.420825  71.08749 0.9998902
## linseed-horsebean     58.550000  -10.413543 127.51354 0.1413329
## meatmeal-horsebean   116.709091   46.335105 187.08308 0.0001062
## soybean-horsebean     86.228571   19.541684 152.91546 0.0042167
## sunflower-horsebean  168.716667   99.753124 237.68021 0.0000000
## meatmeal-linseed      58.159091   -9.072873 125.39106 0.1276965
## soybean-linseed       27.678571  -35.683721  91.04086 0.7932853
## sunflower-linseed    110.166667   44.412509 175.92082 0.0000884
## soybean-meatmeal     -30.480519  -95.375109  34.41407 0.7391356
## sunflower-meatmeal    52.007576  -15.224388 119.23954 0.2206962
## sunflower-soybean     82.488095   19.125803 145.85039 0.0038845

We first inspect the plot to ensure the residuals are normally distributed. If we saw a pattern here we would know that the model is consistently underfitting or overfitting some observations, and that one of the ANOVA’s key assumptions had not been met (in this case there are many future options but you really should be talking to a statistician or friendly peer at this point).

There was a significant effect of feedtype on chick weight in a one-way ANOVA (F^(5,65) = 15.37, p <.001, Figure 11.1). Post-hoc comparisons with Tukeys HSD found variety in performance, with sunflower feeds significantly outperformed soybean (82g, p = .003) linseed (110g, p <.001), and horsebean (168g, p <.001). Casein diets outperform horsebean (163g, p <.001), linseed (105g, p <.001), and soybean (77g, p = .008). Meatmeal outperforms horsebean (117g, p <.001) and soybean outperformed horsebean (86g, p = .004).

chickwts |> 
  ggplot(aes(x = reorder(feed, -weight), y = weight)) +
  geom_boxplot()+
  theme_classic() +
  labs(y = "Weight (g)",
       x = "Feedtype")

Figure: 11.1: Chick weight (g) per feed type, data from ‘chickwts’

As a linear model

The lm argument fits linear models and can do ANOVAs and regressions. Its my preferred way of running this kind of analysis, partly because I think the R² number (a measure of how good a ‘fit’ the model is) is a really useful way to help present the data. We can also use the broom package for extra reporting and to visualise effects.

chk_l <- lm(weight ~ feed, data = chickwts)
summary(chk_l)

## 
## Call:
## lm(formula = weight ~ feed, data = chickwts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -123.909  -34.413    1.571   38.170  103.091 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    323.583     15.834  20.436  < 2e-16 ***
## feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
## feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
## feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
## feedsoybean    -77.155     21.578  -3.576 0.000665 ***
## feedsunflower    5.333     22.393   0.238 0.812495    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.85 on 65 degrees of freedom
## Multiple R-squared:  0.5417, Adjusted R-squared:  0.5064 
## F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

library(broom)
chk_l |> 
  tidy() |> 
  ggplot(aes(x = reorder(term, - estimate), y = estimate)) +
  geom_point(aes(colour = term, shape = term)) +
  geom_errorbar(aes(x = term, ymin = estimate-std.error, ymax = estimate+std.error, colour = term)) +
  theme_classic()

Feedtype had a significant effect in a linear model, explaining 51% of the variance observed in chick weight. (F^(5,65) = 15.36, p < .001). Meatmeal had a moderately lower weight gain than casein (t = -2.04 p = .046), soybean (t = -3.58, p = < .001), linseed (t = -4.68, p < .001), and horsebean (t = -6.96, p <.001) all had much lower weight gains. There was no significant difference between the casein diet and the sunflower diet (t = 0.24, p = .812).

11.3.4 Numerical Response, Numerical Explanatory (Correlation & Regression)

Our data

We will use the trusty cars dataset which describes the speed (mph) and stopping distance (ft) of cars in the 1920s to talk about two types of test, Pearson’s correlation and linear regression.

As Pearson’s correlation

As many, many, many folk say, correlation is not causation. In statistical terms, that means that correlation does not aim to predict y from x, only to see if there is an underlying structure to the variation.

cor.test(x = cars$speed, y = cars$dist, alternative = "two.sided", method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  cars$speed and cars$dist
## t = 9.464, df = 48, p-value = 1.49e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6816422 0.8862036
## sample estimates:
##       cor 
## 0.8068949

Speed has a strong, positive association with the stopping distance of cars in a correlation (r = .81, 95%CI[.68, .89], t⁽⁴⁸⁾ = 9.46, p <.001).

As a regression

In a linear model, we can be more definitive in our reporting.

carm <- lm(dist ~ speed, data = cars)
summary(carm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Stopping distance increased by 3.9ft for every mph faster the car travelled at (R² = 64%, F^(1,48) = 89.6, p<.001)

11.4 My response is an ordinal variable

11.4.1 Ordinal Response, Historical/Known Comparisons (One Sample Wilcoxan Test)

Our data

In this example we have student ratings of this textbook from 1 (very unsatisfied) to 5 (very satisfied). We’re hoping that at the very least, people are satisfied, and have ranked the book a 4. For this we can use the One Sample Wilcoxan Test, but we need to calculate an additional step, the median.

ratings <- tibble(satisfaction = c(1,2,1,1,3,2,1,5,1,2,5,1,2,3,3,1,3))

ratings |> 
  summarise(median = median(satisfaction))

## # A tibble: 1 × 1
##   median
##    <dbl>
## 1      2

wilcox.test(ratings$satisfaction, mu = 4)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  ratings$satisfaction
## V = 7, p-value = 0.0009212
## alternative hypothesis: true location is not equal to 4

The textbook achieved a medium score of ‘2’, or ‘unsatisfied’ from students, which was significantly lower than the desired median score of ‘4’ or ‘satisfied’ in a one-sample Wilcoxan test (V = 7, p = <.001).

11.4.2 Ordinal Response (Repeated Measures), Categorical Explanatory (One Sample Wilcoxan Signed-Rank Test)

Our data

In this example we have student ratings of this textbook from 1 (very unsatisfied) to 5 (very satisfied). We made some changes to the book after the first set of reviews and we’re hoping the same students will now score the book higher. For this we can use the Wilcoxan Signed Rank Test, but we need to calculate an additional step, the median.

ratings <- tibble(satisfaction1 = c(1,2,1,1,3,2,1,5,1,2,5,1,2,3,3,1,3),
                  satisfaction2 = c(1,4,2,3,3,3,2,5,2,4,5,3,3,3,4,1,3))

ratings |> 
  summarise(satis1 = median(satisfaction1),
            satis2 = median(satisfaction2))

## # A tibble: 1 × 2
##   satis1 satis2
##    <dbl>  <dbl>
## 1      2      3

wilcox.test(x = ratings$satisfaction1, ratings$satisfaction2, paired = TRUE, alternative = "two.sided")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  ratings$satisfaction1 and ratings$satisfaction2
## V = 0, p-value = 0.004565
## alternative hypothesis: true location shift is not equal to 0

After revisions, the textbook received a median score of ‘3’ (undecided), which was significantly higher than the before revision median score of ‘2’ (unsatisfied) in a Wilcoxan Signed Rank Test (V = 0, p = .005).

11.4.3 Ordinal Response, Categorical Explanatory

Our data

In this example, we have measured student satisfaction with their R teaching on a 5-point scale from 1 (very unsatisfied) to 5 (very satisfied). We have three conditions, students who searched their own materials (self-taught), students who received this textbook (textbook), and students who received the textbook plus tutorials (textbook plus). We want to know which students were most satisfied. For this we can use a Kruskal-Wallis test. We can use two more functions, kruskalmc from library(pgirmess) and jonckheere.test from library(clinfun) to help us interpret the data. Jonkheere-Terpstra tests are technically a different type of test, but are useful in exploring how ordered differences occur, e.g. in this case we expect that each intervention will improve student satisfaction, and that textbook-plus will be better than textbook, which itself will be better than self-taught.

library(clinfun)
library(pgirmess)

textbk <- tibble(self_taught = c(1,2,1,1,3,2,1,5,1,2,5,1,2,3,3,1,3),
                 textbook= c(1,4,2,3,3,3,2,5,2,4,5,3,3,3,4,1,3),
                 textbook_plus = c( 3, 2, 3, 3, 3, 3, 2, 4, 4, 5, 3, 4, 4, 5, 4, 5, 3)) |> 
  pivot_longer(cols = c(self_taught:textbook_plus),
               names_to = "condition",
               values_to = "score")



kruskal.test(score ~ condition, textbk)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  score by condition
## Kruskal-Wallis chi-squared = 10.503, df = 2, p-value = 0.00524

kruskalmc(score ~ condition, textbk, cont = "two-tailed")

## Multiple comparison test after Kruskal-Wallis, treatments vs control (two-tailed) 
## alpha: 0.05 
## Comparisons
##                             obs.dif critical.dif stat.signif
## self_taught-textbook       9.705882     11.42896       FALSE
## self_taught-textbook_plus 15.882353     11.42896        TRUE

jonckheere.test(textbk$score, as.numeric(as.factor(textbk$condition)))

## 
##  Jonckheere-Terpstra test
## 
## data:  
## JT = 616, p-value = 0.001609
## alternative hypothesis: two.sided

Student satisfaction increased significantly with intervention (H⁽²⁾ = 10.5, p* = .005), with textbook-plus showing a significant difference (p<.05) from self-taught, but no difference between self-taught and textbook (p>.05). The directionality of effect was significant in a Jonckheere-Terpstra test (T = 616, p = .002, Figure 11.2)

textbk |> 
  ggplot(aes(x = condition, y = score)) +
  geom_boxplot()+
  theme_classic() +
  scale_y_continuous(breaks = seq(1,5,1), limits = c(1,5), labels = c("Very unsatisfied", "Unsatisfied", "Not sure",
                                                                  "Satisfied", "Very satisfied")) +
  scale_x_discrete(labels = c("Self-Taught", "Textbook", "Textbook plus lecture")) +
  labs(x = "Teaching Method", y = "Score")

Figure: 11.2: Student satisfaction by teaching method

11.4.4 Ordinal Response, Numerical Explanatory (Spearman rank correlation)

Our data

When you have an ordinal response and either a numerical explanatory or an ordinal explanatory variable, a Spearman rank correlation is probably for you. A Spearman rank correlation is exactly the same as a Pearson correlation, but the data is ranked (or ordered) before the calculation is run. Spearman rank correlations are therefore often used for skewed data.

Let’s say we have 15 students who have ranked their satisfaction with the course (from 1, “Very unsatisfied” to 5, “Very satisfied”) and their grades (E, “very poor”, to A “excellent”). We think students who are very satisfied will have higher grades.

grades <- tibble(satisfaction = c(1, 5, 2, 1, 1, 5, 4, 5, 4, 5, 5, 3, 4, 3, 1),
                 grade = c(1, 4, 1,2, 1, 4, 5, 5, 5, 4, 3, 3, 2, 2,1 ))

cor.test(grades$satisfaction, grades$grade,method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  grades$satisfaction and grades$grade
## S = 121.25, p-value = 0.0005493
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7834775

Student satisfaction was strongly positively associated with grade (r_s = .78,p<.001)

11.5 I got all of them (generalised linear model)

If your data is looking extremely complicated, and you have multiple explanatory variables, you likely want a glm. At this point you really should consult a statistician and start thinking seriously about your model assumptions . . .

Chapter 11 How to choose your stats

11.1 Where to start

11.2 My response is a categorical variable

11.2.1 Categorical Response, Historical/Known Comparisons

11.2.1.1 Proportion of responses different from a known/expected result (1 Sample Proportion Test)

11.2.2 Categorical Response, Categorical Explanatory

11.2.2.1 Categorical Response (Repeated Measures) Categorical Explanatory, McNemar’s Test

11.2.2.2 Proportion of responses compared across groups (Fishers/Chi2 test)

11.2.2.2.1 Fisher’s Exact Test (Small Sample Sizes / Summarised Data)

11.2.2.2.2 Chi2 Test (Larger Sample Sizes)

11.2.3 Categorical Response, Numerical Explanatory (Logistic Regression)

11.3 My response is a numerical variable

11.3.1 Numerical Response, Historical/Known Comparisons (One Sample t-Test)

11.3.2 Numerical Response (Repeated Measures), Categorical Explanatory (Paired t-Test)

11.3.3 Numerical Response, Categorical Explanatory (ANOVA/GLM)

11.3.4 Numerical Response, Numerical Explanatory (Correlation & Regression)

11.4 My response is an ordinal variable

11.4.1 Ordinal Response, Historical/Known Comparisons (One Sample Wilcoxan Test)

11.4.2 Ordinal Response (Repeated Measures), Categorical Explanatory (One Sample Wilcoxan Signed-Rank Test)

11.4.3 Ordinal Response, Categorical Explanatory

11.4.4 Ordinal Response, Numerical Explanatory (Spearman rank correlation)

11.5 I got all of them (generalised linear model)

11.2.2.2 Proportion of responses compared across groups (Fishers/Chi² test)

11.2.2.2.2 Chi² Test (Larger Sample Sizes)