When computing the degrees of freedom for ANOVA How is the between group DF calculated?

In essence, we now know that we want to break down the TOTAL variation in the data into two components:

  1. a component that is due to the TREATMENT (or FACTOR), and
  2. a component that is due to just RANDOM ERROR.

Let's see what kind of formulas we can come up with for quantifying these components. But first, as always, we need to define some notation. Let's represent our data, the group means, and the grand mean as follows:

GroupDataMeans
1 \(X_{11}\) \(X_{12}\) . . . \(X_{1_{n_1}}\) \(\bar{{X}}_{1.}\)
2 \(X_{21}\) \(X_{22}\) . . . \(X_{2_{n_2}}\) \(\bar{{X}}_{2.}\)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
\(m\) \(X_{m1}\) \(X_{m2}\) . . . \(X_{m_{n_m}}\) \(\bar{{X}}_{m.}\)
    Grand Mean \(\bar{{X}}_{..}\)

That is, we'll let:

  1. m denotes the number of groups being compared
  2. \(X_{ij}\) denote the \(j_{th}\) observation in the \(i_{th}\) group, where \(i = 1, 2, \dots , m\) and \(j = 1, 2, \dots, n_i\). The important thing to note here... note that j goes from 1 to \(n_i\), not to \(n\). That is, the number of the data points in a group depends on the group i. That means that the number of data points in each group need not be the same. We could have 5 measurements in one group, and 6 measurements in another.
  3. \(\bar{X}_{i.}=\dfrac{1}{n_i}\sum\limits_{j=1}^{n_i} X_{ij}\) denote the sample mean of the observed data for group i, where \(i = 1, 2, \dots , m\)
  4. \(\bar{X}_{..}=\dfrac{1}{n}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} X_{ij}\) denote the grand mean of all n data observed data points

Okay, with the notation now defined, let's first consider the total sum of squares, which we'll denote here as SS(TO). Because we want the total sum of squares to quantify the variation in the data regardless of its source, it makes sense that SS(TO) would be the sum of the squared distances of the observations \(X_{ij}\) to the grand mean \(\bar{X}_{..}\). That is:

\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{..})^2\)

With just a little bit of algebraic work, the total sum of squares can be alternatively calculated as:

\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} X^2_{ij}-n\bar{X}_{..}^2\)

Can you do the algebra?

Now, let's consider the treatment sum of squares, which we'll denote SS(T). Because we want the treatment sum of squares to quantify the variation between the treatment groups, it makes sense that SS(T) would be the sum of the squared distances of the treatment means \(\bar{X}_{i.}\) to the grand mean \(\bar{X}_{..}\). That is:

\(SS(T)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (\bar{X}_{i.}-\bar{X}_{..})^2\)

Again, with just a little bit of algebraic work, the treatment sum of squares can be alternatively calculated as:

\(SS(T)=\sum\limits_{i=1}^{m}n_i\bar{X}^2_{i.}-n\bar{X}_{..}^2\)

Can you do the algebra?

Finally, let's consider the error sum of squares, which we'll denote SS(E). Because we want the error sum of squares to quantify the variation in the data, not otherwise explained by the treatment, it makes sense that SS(E) would be the sum of the squared distances of the observations \(X_{ij}\) to the treatment means \(\bar{X}_{i.}\). That is:

\(SS(E)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})^2\)

As we'll see in just one short minute why the easiest way to calculate the error sum of squares is by subtracting the treatment sum of squares from the total sum of squares. That is:

\(SS(E)=SS(TO)-SS(T)\)

Okay, now, do you remember that part about wanting to break down the total variation SS(TO) into a component due to the treatment SS(T) and a component due to random error SS(E)? Well, some simple algebra leads us to this:

\(SS(TO)=SS(T)+SS(E)\)

and hence why the simple way of calculating the error of the sum of squares. At any rate, here's the simple algebra:

Proof

Well, okay, so the proof does involve a little trick of adding 0 in a special way to the total sum of squares:

\(SS(TO) = \sum\limits_{i=1}^{m}  \sum\limits_{i=j}^{n_{i}}((X_{ij}-\color{red}\overbrace{\color{black}\bar{X}_{i_\cdot})+(\bar{X}_{i_\cdot}}^{\text{Add to 0}}\color{black}-\bar{X}_{..}))^{2}\)

Then, squaring the term in parentheses, as well as distributing the summation signs, we get:

\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})^2+2\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})(\bar{X}_{i.}-\bar{X}_{..})+\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (\bar{X}_{i.}-\bar{X}_{..})^2\)

Now, it's just a matter of recognizing each of the terms:

\(S S(T O)=
\color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(X_{i j}-\bar{X}_{i \cdot}\right)^{2}}^{\text{SSE}}
\color{black}+2
\color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(X_{i j}-\bar{X}_{i \cdot}\right)\left(\bar{X}_{i \cdot}-\bar{X}_{. .}\right)}^{\text{O}}
\color{black}+
\color{red}\overbrace{\color{black}\left(\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(\bar{X}_{i \cdot}-\bar{X}_{* . *}\right)^{2}\right.}^{\text{SST}}\)

That is, we've shown that:

\(SS(TO)=SS(T)+SS(E)\)

as was to be proved.

How do you find df between group Anova?

To calculate degrees of freedom for ANOVA: Subtract 1 from the number of groups to find degrees of freedom between groups. Subtract the number of groups from the total number of subjects to find degrees of freedom within groups. Subtract 1 from the total number of subjects (values) to find total degrees of freedom.

When computing the degrees of freedom for Anova How is the between group estimate calculated?

In other words, the degrees of freedom between groups is equal to the total number of groups minus one.

How do you calculate DF between and within?

dftotal = N - 1. dfbetween treatments = K - 1 (Notice the name change here) dfbetween subjects = n - 1 (Notice the formula change here) dfwithin = N - K.

How do you find the degrees of freedom in DF?

The most commonly encountered equation to determine degrees of freedom in statistics is df = N-1. Use this number to look up the critical values for an equation using a critical value table, which in turn determines the statistical significance of the results.