Practical statistics in R (2024)

An overview with ready-to-use code!

I finally learned how to apply statistics after 5 years of university. In my 5th year, I took a 2 month practical statistics course where everything was explained using the programming language R. I’ll tell you the main concepts including code, so you can just CTRL+F and don’t have to spend 2 months in class.

This article won’t learn you all the nitty-gritty details of statistics. Other resources on the internet are already doing an awesome job at this (for example, Khan Academy).

This article is meant to quickly find which test to use for your data. For persons who forgot the contents of their statistics course. Or for those who remember, but don’t now how to apply the test using code.

The topics are presented as follows:
1. Normality assumption
2. One-sample tests (one sample t-test, sign test, Wilcoxon signed rank test)
3. Paired two-sample tests (paired t-test, Pearson correlation, Spearman rank correlation)
4. Independent two-sample tests (unpaired t-test, Mann-Whitney test)
5. Multiple groups, one factor (one-way ANOVA, Kruskal-Wallis)
6. Multiple groups, multiple factors (two-way ANOVA, Friedman)

Statistical tests have prior assumptions about the data. Parametric tests make assumptions about how data are distributed (normal or not). Non-parametric tests make no or less strict assumptions about population distributions, but are generally less powerful. Thus, tests are generally divided by the concept of normality: does your data follow the normal density curve or not? We can use a combinations of things to find out.

Histogram: A bar graph where data is bucketed into columns along the horizontal axis. The vertical axis represents the count of occurrences in the data. The histogram of normally distributed data should look similar to the normal density curve in Figure 1. In R, plotting a histogram is simple:

hist(data)

In Figure 2 above, we plotted histograms of 2 datasets. The histogram on the left looks similar to the bell-shape in Figure 1, thus we assume normality. The right graph however is not symmetric and thus we suspect (we don’t draw conclusions yet!) non normality.

QQ-plot: In this graph, the theoretical ordered probabilities from normal distribution on the x-axis (theoretical quantiles) are plotted versus the from sampling obtained quantiles on y-axis. The quantile value q_α is the median value if we choose α = 0.5. If α = 0.25, q_α is the
value such that the data is split in 25% below and 75% above q_α. The function qnorm() in R aims to find the boundary value. For example, suppose you want to find the 85th percentile of a normal distribution whose mean is 70 and whose standard deviation is 3. In R, you ask for:

qnorm(0.85, mean = 70, sd = 3)

Regarding the qqplot, we can plot this using qqnorm(). A linear line in this graph means that the data comes from a normal distribution.

qqnorm(data)

In Figure 3 above, the left graph shows a linear line, whereas the right figure deviates from a straight line. Therefore, we suspect that the data from the right figure might come from a non-normal distribution.

Shapiro test: This is a test specifically to determine if data comes from a normal distribution. From this test, we get a p-value. With a high p-value, we can assume that data comes from a normal distribution. A general rule of thumb is that when the p-value is below 0.05, we can suspect that data comes from a non-normal distribution.

shapiro.test(data)

It is important to note that it is best practice to test normality using all of above 3 methods, and deciding upon normality based on the combination of all 3 methods.

To summarize, when we decide data is from a normal distribution, we can use parametric tests or non-parametric tests, where parametric tests are preferred as those are more powerful. When data is from a non-normal distribution, we can only use non-parametric tests.

Now we know how to test for normality, we can move on to the actual tests.

For an one-sample test (is the mean of our data mean equal / smaller / bigger to / than a certain value?) we can use t-test, sign test or Wilcoxon signed rank test. The t-test has the assumption that data must come from a normal distribution, the other two not. It is important to note that the other two tests can be used for data from a normal distribution as well, but the outcome of a t-test has bigger power due to the prior assumptions about the data.

Parametric

One sample t-test: to apply a one-sample t-test, we have to give R our data, our boundary value mu, and whether we want to know if the mean is equal, smaller of bigger than the boundary value. Moreover, we need to give the confidence level (how sure do you want to be about the outcome). Generally, the confidence level is chosen as 0.95: we want to be 95% sure.

t.test(data, mu = 2800, alternative = “greater”, conf.level = 0.95)

Non-parametric

Sign-test: the sign test is very simple: if you think the median of your data is x, you calculate how many data points are above or below that; both amounts should be approximately half of the data each. We do a binomality test to test this including a p-value (since if the amounts differ from half, this could still be non-significant). Below, we first calculate the amount of our data above 40, and then test whether this amount is significantly lower than half of our data or not. We only print the p-value (using [[3]]). If this value is below 0.05, we can state that the median of our data is significantly lower than 40.

amount = sum(data >= 40)
binom.test(amount, length(data), p=0.5, alternative="less")[[3]]

Wilcoxon signed rank test: Although this test does not assume normality, it does assume a symmetric population (a stronger assumption than the sign test has). This test works as follows: we subtract our suspected median from all data points, calculate the rank order of all datapoints, and then sum the ranks of the datapoints for which the value is bigger than the suspected median. A large sum indicates that the true median is higher than suspected median. If you use this test in R, you get a corresponding p-value for which a value below 0.05 means the difference is significant.

wilcox.test(data, mu>=40)

Two sample data can be divided into paired and independent (unpaired) data. A paired dataset means that each data point in one sample is uniquely paired to a data point in the second sample.

Say we have data of a group of students performing a 100 meter sprint. We let the students run two times, where students run either without ingesting a soft drink before the race, or with. We want to find out there is a difference in run time between the two conditions. Data from both samples is uniquely paired by the fact that the same persons do the running test. For this type of data, we can use a paired t-test, Pearson correlation, or a Spearman rank correlation, where the latter is a non-parametric test.

Parametric

Paired t-test: In R, setting the parameter paired=True for t.test is given to tell samples are paired. If the p-value resulting from the t-test is below 0.05, we can assume that there is a significant difference between the two groups.

data = read.table("run.txt")
data = subset(data, data$drink == 'lemo')
t.test(data$before, data$after, paired = True)

We could also calculate the differences between the groups, and use an one sample t-test where we want to know if the difference is significantly higher than 0:

t.test(data$after - data$before, mu = 0, alternative = “greater”, conf.level = 0.95)

Pearson correlation test: A correlation test can be used to test whether there is a relationship between two samples, or that both samples are independent of each other. A p-value below 0.05 means that the correlation is significant.

cor.test(data$before, data$after, method = "pearson")

Non-parametric

Spearman rank correlation test: This correlation test compares ordering of ranks of both samples. If the data are rank correlated, the sequence of ranks will run in parallel or in opposite. We use the same command as above, only changing the method:

cor.test(data$before, data$after, method = "spearman")

Data of two samples can also be independent. Say we again have data of a group of students performing two times a 100 meter sprint, where students run either without ingesting a drink before the race, or with. Only here, the student were divided in two groups: one drinking the soft drink, and one drinking an energy drink.

Now, we want to know which group improved their times the most. The data of both groups are independent, as both groups consist of different students.

Parametric

Unpaired t-test:

data = read.table("run.txt")
data_softdrink = subset(data, data$drink == 'lemo')
data_energydrink = subset(data, data$drink == 'energy')data_softdrink[‘differences’] = data_softdrink$after — data_softdrink$before
data_energydrink[‘differences’] = data_energydrink$after — data_energydrink$beforet.test(data_softdrink$differences, data_energydrink$differences, paired = False)

Non-parametric

Mann-Whitney test: here, we combine both samples and calculate ranks for all. If there is no significant difference between both samples, the sum of ranks of both samples should be similar. If the sum of ranks from one sample is much higher than the other, this indicates that this sample is significantly higher than the other. In R, we call this test the same as the earlier Wilcoxon test, this time providing 2 samples. The reason for the same naming is that the Mann–Whitney U test is also called the Wilcoxon rank-sum test (but is not the same as the Wilcoxon signed-rank test from before!).

wilcox.test(data_softdrink$differences, data_energydrink$differences)

If we would add one more group of students to our experiment, for example a group drinking coffee, we extend our experiment to a three-group problem, while testing one factor: running time difference.

Parametric

One-way ANOVA: An ANOVA test assumes data from each group to come for a normal distribution, and using an equal sample size across groups is preferable.

To use the ANOVA test in R, data should be in two columns: one column containing the data values, and a second column containing the group to which the data belongs.

To better test the normality assumption, we can plot a qqplot of the residuals of the ANOVA, which plots a qqplot of data of all groups combined. The second plot used to test for equal variances by plotting the residuals of a linear model of the data against the fitted model.

As before, the qqplot should show a linear line. The second plot is to test for equal variances across groups. This graph should show an equal spread over the horizontal axis (where the groups are presented). In Figure 4 above, both assumptions hold.

# tell R that this which column contains our factor to test for
data$group = as.factor(data$group)
# design a linear model
data_model = lm(data$differences~data$group)# apply anova 
# and get the summary including p-value and other useful metrics
anova(data_model)
summary(data_model) # check normality
qqnorm(residuals(rataov)); plot(fitted(rataov),residuals(rataov))# if spread doesn’t differ over x-axis, we have equal variance
detach(ratframe)

Non-parametric

Kruskal-Wallis: A non-parametric counterpart of ANOVA, which does not rely on normality but on ranks. Computes the sum of ranks of each group within the total data (same as the Mann-Whitney test discussed before). This test does assume a sample size bigger than 5 per group. To apply it, we only have to program 1 line to get the output including p-value.

kruskal.test(data)

Imagine we have build 3 different websites, and we want to know which site is the most user-friendly. We want to test this by measuring the time it takes a person to search and find relevant information on the website. However, the times it takes for a person is highly dependent on their skill of using a computer. Thus, we want to compare the 3 different websites, while making sure that any differences are not caused by - or dependent on - the skill level of the users. In this case, we already expect a lower time needed for more skilled users. We are not interested in the results of this factor individually, and only want to make sure this factor does not influence our results regarding the other factor. This factor we call a ‘block factor’.

Parametric

Two-way ANOVA: To perform the two-way ANOVA, we first have to investigate the interaction effect of the two factors. If the interaction effect is non-significant (higher than 0.05), we are able to investigate the effects of the factors individually.

First, we tell R that the columns website and skill are our factors. Then, we can apply a linear model testing both factors together by using the * sign:

data = read.table(‘search.txt’)
interface = as.factor(data$website)
skill = as.factor(data$skill)
anova(lm(time~interface*skill))

When the p-value is above 0.05, we can continue with the additive model by using the + sign:

addmodel = lm(time~data$website+data$skill)
anova(addmodel)
summary(addmodel)

From this model, we probably see a significant p-value for skill (< 0.05), as we already expected this factor to be of influence. What we are interested in, is the p-value of website. If this value is significant, we can conclude that search time differs significantly between the websites.

taking into account the blocking factors (e.g., id). If significant, H0 is rejected: there is a
treatment effect. We can compare the results with ANOVA which also gives significant effect for drug
factor (and also for id but that is expected from a block factor, e.g. every person is different).

However, before drawing final conclusions, we also have to test for the assumptions of the ANOVA test, just as with the one-way ANOVA!

qqnorm(residuals(addmodel); plot(fitted(addmodel),residuals(addmodel))

Non-parametric

Friedman test: The non-parametric alternative to the two-way ANOVA test is the Friedman test. To use the Friedman test to investigate the effect of the website on search time while taking into account the skill level, we call for the Friedman test with the order of variables as time (the measurement variable), website (the factor we’re interested in), skill (the block factor). Again, we look at the p-value to determine significance.

friedman.test(time, interface, skill)

In this blog post, we covered the essential statistical tests used to draw conclusion about experiments with categorical outcomes (is drink A better than drink B?, is website A better than website B and C?). We covered:

The most important assumption on which statistical tests can be divided upon: normality. Parametric test assume normality, while non-parametric test do not. When data comes from a normal distribution, we can use parametric or non-parametric test, with the first having more power due to the normality assumption. When data does not come from a normal distribution, we cannot use a parametric test, only a non-parametric test.
Then, we discussed the parametric tests: one sample t-test, paired t-test, Pearson correlation, unpaired t-test, one-way ANOVA, and the two-way ANOVA.
Alongside, we discussed the non-parametric tests: sign test, Wilcoxon signed rank test, Spearman rank correlation, Mann-Whitney test, Kruskal-Wallis, and the Friedman test.
And for every test, code in R was provided to help you directly apply the tests to your data and problems!

Thank you for reading. If this post helped you in any way, please consider following my account to stay up-to-date on future blog posts.

Practical statistics in R (2024)

An overview with ready-to-use code!

Parametric

Non-parametric

Parametric

Non-parametric

Parametric

Non-parametric

Parametric

Non-parametric

Parametric

Non-parametric

References