Statistics is always tricky especially for the non statistician. I find it easier to follow a flowchart to decide on the correct statistical test. Thie relies on understanding some semantics:
Group=sample ie the population that has been measured. If you have measured from two groups or populations then these are two samples eg between patients from one hospital and another hospital
Observations= the actual measurement eg the height of that patient.
Matching = Dependent= whether the population is matched eg repeated measures from the same subject ()
Variable = the parameter being measured eg height of patient. This is similar to a column in a standard spreadsheet.
There are many statistical tests to choose from so I use the following diagram to help me choose:
In the above figure, populations that are matched are along the y axis and unmatched along the x axis. Subscript letters are non-parametric tests and superscript are parametric as follows:
wrs= Wilcoxon rank sums
wmp=Wilcoxon matched pairs
f=Friedman test
A (if matched)=Repeated measures one way ANOVA
A (if unmatched)=One way ANOVA
k=Kruskall-Wallis
Note the two sample unmatched t test could also be a Welch corrected unpaired t test if the variances are not equal.
Note the above diagram is when one variable is being compared between populations
For example if you want to compare the mean of two unmatched populations around one variable you would use a unpaired t-test if the sampling was parametric or a Mann-Whitney U if it was non parametric.
R accomodates all of the stats tests there are so as a basic run through of them lets create our input dataframe
The tests are as follows:
The following relate the the large letters along the y-axis in the diagram
1 sample:
# one sample t-test
y1 = c(18.2, 20.1, 17.6, 16.8, 18.8, 19.7, 19.1)
t.test(y1,mu=3) # Ho: mu=3
2 sample:
#Create the data frame
y1 = c(18.2, 20.1, 17.6, 16.8, 18.8, 19.7, 19.1)
y2 = c(17.4, 18.7, 19.1, 16.4, 15.9, 18.4, 17.7)
y = c(y1, y2)
n = rep(7, 2)
group = rep(1:2, n)
data = data.frame(y = y, group = factor(group))
# paired t-test
t.test(y1,y2,paired=TRUE) # where y1 & y2 are numeric
##
## Paired t-test
##
## data: y1 and y2
## t = 1.9001, df = 6, p-value = 0.1062
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2754342 2.1897199
## sample estimates:
## mean of the differences
## 0.9571429
Multiple samples:
#ANOVA for matched samples
y1 = c(18.2, 20.1, 17.6, 16.8, 18.8, 19.7, 19.1)
y2 = c(17.4, 18.7, 19.1, 16.4, 15.9, 18.4, 17.7)
y3 = c(15.2, 18.8, 17.7, 16.5, 15.9, 17.1, 16.7)
y = c(y1, y2, y3)
n = rep(7, 3)
group = rep(1:3, n)
data = data.frame(y = y, group = factor(group))
fit = lm(y ~ group, data)
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 11.007 5.5033 3.9683 0.03735 *
## Residuals 18 24.963 1.3868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Get the confidence intervals:
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 17.679156 19.5494153
## group2 -2.279616 0.3653301
## group3 -3.093902 -0.4489556
#Plot it out in ggplot
new = data.frame(Fitted = fitted(fit),
Residuals = resid(fit), Treatment = data$group)
#and then produce the plot:
library(ggplot2)
ggplot(new, aes(Fitted, Residuals, colour = Treatment)) + geom_point(size=18)
2 sample:
# independent 2-group t-test
t.test(y1,y2) # where y1 and y2 are numeric
Multiple samples:
This is the anova as illustrated above
1 sample:
wilcox.test(y1,y2,paired=TRUE)
2 sample:
wilcox.test(y1,y2,paired=TRUE)
Multiple samples:
#Kruskall Wallis
kruskal.test(y~A) # where y1 is numeric and A is a factor
1 sample:
wilcox.test(immer$Y1, immer$Y2)
2 sample:
# dependent 2-group Wilcoxon Signed Rank Test
wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric
Multiple samples:
friedman.test(y~A|B)
# where y are the data values, A is a grouping factor
# and B is a blocking factor