Introduction

Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. It may seem odd that the technique is called “Analysis of Variance” rather than “Analysis of Means”. As you will see, the name is appropriate because inferences about means are made by analyzing variance.

ANOVA Design

In describing an ANOVA design, the term factor is a synonym of independent variable. An ANOVA conducted on a design in which there is only one factor is called a one-way ANOVA. If an experiment has two factors, then the ANOVA is called a two-way ANOVA. The number of levels measure how many different values for each factor.

For example, suppose an experiment on the effects of age and gender on reading speed were conducted using three age groups (8 years, 10 years, and 12 years) and the two genders (male and female). The factors would be age and gender. Age would have three levels and gender would have two levels.

So in the design, there would be a total of six different groups as show in the table below.

Group Gender Age
1 Female 8
2 Female 10
3 Female 12
4 Male 8
5 Male 10
6 Male 12

One Way ANOVA

Hypothesis

The null hypothesis tested by ANOVA is that the population means for all conditions are the same. This can be expressed as follows:

And the alternative hypothesis could be:

Assumptions

The analysis of variance can be presented in terms of a linear model, which makes the following assumptions about the probability distribution of the responses:

  • Independence of observations – this is an assumption of the model that simplifies the statistical analysis.
  • Normality – the distributions of the residuals are normal.
  • Equality (or “homogeneity”) of variances, called homoscedasticity — the variance of data in groups should be the same.

Why “Variance”?

Recall that analysis of variance is a method for testing differences among means by analyzing variance. The test is based on estimating and comparing the population variance (). ANOVA computes two estimates of the population variance.

The first estimate is called mean square error (MSE). MSE is based on difference among scores within the groups, which estimates regardless of whether is true.

The second estimate is called mean square between (MSB). MSB is based on difference among sample means, which only estimate when is true. If the population means are not equal, then MSB estimates a quantity larger than .

Therefore, if MSB is much larger than MSE, then the population means are unlikely to be equal. On the other hand, if the MSB is about the same as MSE, then the data are consistent with the hypothesis that the population means are equal.

Computing Estimates

Suppose we have 34 subjects in each of the four conditions:

Condition Mean Variance
Cond1 5.3676 3.3380
Cond2 4.9118 2.8253
Cond3 4.9118 2.1132
Cond4 4.1176 2.3191

Sample Sizes

So we have samples with equal size. We refer to the number of observations in each group as and the total number of observations as . In this example, we have 4 groups, and .

MSE

According to the assumption homogeneity of variance, MSE is computed as the mean of the sample variances:

$$ MSE = (3.3380 + 2.8253 + 2.1132 + 2.3191) / 4 = 2.6489

MSB

The formula for MSB is based on the fact that the variance of the sampling distribution of the mean is:

where $n$ is the sample size of each group. So we can compute the population variance as:

Although we don’t know the variance of the sampling distribution of mean, we can estimate it with the variance of the sample means, then

Test by Comparing MSE and MSB

Recall that, MSE estimates σ2 whether or not the population means are equal, whereas MSB estimates σ2 only when the population means are equal and estimates a larger quantity when they are not equal. Therefore, we can test the sample means are equal by check if the MSB estimates a larger quantity than MSE.

However since MSB could be larger than MSE by chance even if the sample means are equal, how much larger must MSB be in order to justify the conclusion that the sample means differ?

The mathematics necessary to answer this question were worked out by the statistician R. Fisher, which is based on the ratio of MSB to MSE, is named after Fisher, called F ratio. The ratio should follow an F distribution.

The shape of the F distribution depends on the sample size. More precisely, it depends on two degrees of freedom (df): numerator degrees of freedom and denominator degrees of freedom. That is,

You can use the following R command to draw a F distribution with (5, 2) degrees of freedom.

curve(df(x, 5, 2), from = 0, to = 5)

Partition of Variance

One of the important characteristics of ANOVA is that it partitions the variation into its various sources. ANOVA divides two variances and comparing the ratio to a handbook value to determine statistical significance.

The definitional equation of variance is $$ \sigma^2 = \frac{1}{n-1} \sum{(X_i - \bar(X))^2}, where the divisor is called the degrees of freedom, the summation is called the sum of squares (SS). In ANOVA, we can use SS to indicate variation, and ANOVA estimates 3 sample variances:

  1. : a total variance based on all the observations from the grand mean (GM, the mean of all observations). is defined as:

  2. : an error variance based on all the observation deviations from its group mean, $SS_{error} is also called sum of square within group (SSW). SSE is defined as:

  3. : a conditional variance based on the deviations of group means from the grand mean, the result being multiplied by the number of observations in each group. This is also called sum of sqaure between (SSB), defined as: If there are unequal sample sizes, use the weighted sum:

The sum of squares error can also be computed by substraction:

Also, the number of degrees of freedom can be partitioned in a similar way:

F Test

Then we can apply the F distribution formula to calculate F ratio:

where and numerator degrees of freedom , and and denominator degrees of freedom .

In another word, we can calculate MSB and MSE estimates from sum of squares, and calculate F ratios as:

The formula for computing is:

and the formula for computing is:

R Example

We extract our test data from mtcars dataset:

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We only want to research three factors: mpg, cylinder number and horse power.

df <- mtcars[, c("mpg", "cyl", "hp")]

And we want to convert cyl from numbers to factor levels:

df$cyl <- as.factor(df$cyl)
levels(df$cyl)
# [1] "4" "6" "8"

So my research question is, does cyl have a significant impact on mpg? Then we can define the hypothesis:

By using condition cyl, the population can be divided into three groups:

> table(df$cyl)
#
#  4  6  8
# 11  7 14

The total number of observations is , the number of groups is , and the sample sizes are not equal.

We compute first,

SST <- sum( (df$mpg - mean(df$mpg))^2 )
# [1] 1126.047

and then compute (or sum of square between):

# Compute grand mean
GM <- mean(df$mpg)
# Compute sample group means
M <- sapply(levels(df$cyl), function(i) { mean(df$mpg[which(df$cyl == i)]) })
# Compute sum of squares between
SSB <- sum( as.numeric(table(df$cyl)) * (M - GM)^2 )
# [1] 824.7846

Finally, we can compute :

SSE <- SST - SSB
[1] 301.2626

Then we can calculate the F ratio:

# DF total: N - 1 = 31
dft <- nrow(df) - 1
# DF condition: k - 1 = 2
dfb <- length(levels(df$cyl)) - 1
# DF error: N - k = 29
dfe <- nrow(df) - length(levels(df$cyl))
# Mean square error
MSE <- SSE / dfe
# [1] 10.38837
# Mean square condition (between)
MSB <- SSB / dfb
# [1] 412.3923
# F ratio = MSB / MSE
# [1] 39.69752

Use F distribution with degree of freedom (2, 29) to get the probability value of F ratio:

# Probability of F = F.ratio
Pr1 <- df(F.ratio, dfb, dfe)
# [1] 1.33206e-09
# Probability of F > F.ratio
Pr2 <- pf(F.ratio, dfb, dfe, lower.tail = FALSE)
# [1] 4.978919e-09

Therefore, we have the following ANOVA summary table:

Source df SS MS F ratio Pr(>F)
Condition 2 824.7846 412.3923 39.69752 4.978919e-09
Error 29 301.2626 10.38837    
Total 31 1126.047      

R has an ANOVA function aov, and we can compare the result with our calculation:

fit1 <- aov(mpg ~ cyl, data = df)
summary(fit1)
            Df Sum Sq Mean Sq F value   Pr(>F)
cyl          2  824.8   412.4    39.7 4.98e-09 ***
Residuals   29  301.3    10.4
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

They are exactly same result except rounding. From the ANOVA summary table, the probability under is very small, so we can reject and conclude that cyl have a significant impact on mpg.