What Is the Coefficient of Determination?
The coefficient of determination is a statistical measurement that examines how differences in one variable can be explained by the difference in a second variable, when predicting the outcome of a given event. In other words, this coefficient, which is more commonly known as R-squared (or R2), assesses how strong the linear relationship is between two variables, and is heavily relied on by researchers when conducting trend analysis. To cite an example of its application, this coefficient may contemplate the following question: if a woman becomes pregnant on a certain day, what is the likelihood that she would deliver her baby on a particular date in the future? In this scenario, this metric aims to calculate the correlation between two related events: conception and birth.
R-Squared
Key Takeaways
- The coefficient of determination is a complex idea centered on the statistical analysis of models for data.
- The coefficient of determination is used to explain how much variability of one factor can be caused by its relationship to another factor.
- This coefficient is commonly known as R-squared (or R2), and is sometimes referred to as the "goodness of fit."
- This measure is represented as a value between 0.0 and 1.0, where a value of 1.0 indicates a perfect fit, and is thus a highly reliable model for future forecasts, while a value of 0.0 would indicate that the model fails to accurately model the data at all.
Understanding the Coefficient of Determination
The coefficient of determination is a measurement used to explain how much variability of one factor can be caused by its relationship to another related factor. This correlation, known as the "goodness of fit," is represented as a value between 0.0 and 1.0. A value of 1.0 indicates a perfect fit, and is thus a highly reliable model for future forecasts, while a value of 0.0 would indicate that the calculation fails to accurately model the data at all. But a value of 0.20, for example, suggests that 20% of the dependent variable is predicted by the independent variable, while a value of 0.50 suggests that 50% of the dependent variable is predicted by the independent variable, and so forth.
Graphing the Coefficient of Determination
On a graph, the goodness of fit measures the distance between a fitted line and all of the data points that are scattered throughout the diagram. The tight set of data will have a regression line that's close to the points and have a high level of fit, meaning that the distance between the line and the data is small. Although a good fit has an R2 close to 1.0, this number alone cannot determine whether the data points or predictions are biased. It also doesn't tell analysts whether the coefficient of determination value is intrinsically good or bad. It is at the discretion of the user to evaluate the meaning of this correlation, and how it may be applied in the context of future trend analyses.
Recommended textbook solutions
Statistical Techniques in Business and Economics
15th EditionDouglas A. Lind, Samuel A. Wathen, William G. Marchal
1,236 solutions
Fundamentals of Engineering Economic Analysis
1st EditionDavid Besanko, Mark Shanley, Scott Schaefer
215 solutions
Introductory Business Statistics
1st EditionAlexander Holmes, Barbara Illowsky, Susan Dean
2,174 solutions
Fundamentals of Financial Management
14th EditionEugene F. Brigham, Joel F Houston
845 solutions
Learning Objectives
If the scatter diagram of a set of \((x,y)\) pairs shows neither an upward or downward trend, then the horizontal line \(\hat{y} =\overline{y}\) fits it well, as illustrated in Figure \(\PageIndex{1}\). The lack of any upward or downward trend means that when an element of the population is selected at random, knowing the value of the measurement \(x\) for that element is not helpful in predicting the value of the measurement \(y\).
If the scatter diagram shows a linear trend upward or downward then it is useful to compute the least squares regression line
\[\hat{y} =\hat{β}_1x+\hat{β}_0\]
and use it in predicting \(y\). Figure \(\PageIndex{2}\) illustrates this. In each panel we have plotted the height and weight data of Section 10.1. This is the same scatter plot as Figure \(\PageIndex{2}\), with the average value line \(\hat{y} =\overline{y}\) superimposed on it in the left panel and the least squares regression line imposed on it in the right panel. The errors are indicated graphically by the vertical line segments.
The sum of the squared errors computed for the regression line, \(SSE\), is smaller than the sum of the squared errors computed for any other line. In particular it is less than the sum of the squared errors computed using the line \(\hat{y}=\overline{y}\), which sum is actually the number \(SS_{yy}\) that we have seen several times already. A measure of how useful it is to use the regression equation for prediction of \(y\) is how much smaller \(SSE\) is than \(SS_{yy}\). In particular, the proportion of the sum of the squared errors for the line \(\hat{y} =\overline{y}\) that is eliminated by going over to the least squares regression line is
\[\dfrac{SS_{yy}−SSE}{SS_{yy}}=\dfrac{SS_{yy}}{SS_{yy}}−\dfrac{SSE}{SS_{yy}}=1−\dfrac{SSE}{SS_{yy}}\]
We can think of \(SSE/SS_{yy}\) as the proportion of the variability in \(y\) that cannot be accounted for by the linear relationship between \(x\) and \(y\), since it is still there even when \(x\) is taken into account in the best way possible (using the least squares regression line; remember that \(SSE\) is the smallest the sum of the squared errors can be for any line). Seen in this light, the coefficient of determination, the complementary proportion of the variability in \(y\), is the proportion of the variability in all the \(y\) measurements that is accounted for by the linear relationship between \(x\) and \(y\).
In the context of linear regression the coefficient of determination is always the square of the correlation coefficient \(r\) discussed in Section 10.2. Thus the coefficient of determination is denoted \(r^2\), and we have two additional formulas for computing it.
Definition: coefficient of determination
The coefficient of determination of a collection of \((x,y)\) pairs is the number \(r^2\) computed by any of the following three expressions:
\[r^2=\dfrac{SS_{yy}−SSE}{SS_{yy}}=\dfrac{SS^2_{xy}}{SS_{xx}SS_{yy}}=\hat{β}_1 \dfrac{SS_{xy}}{SS_{yy}}\]
It measures the proportion of the variability in \(y\) that is accounted for by the linear relationship between \(x\) and \(y\).
If the correlation coefficient \(r\) is already known then the coefficient of determination can be computed simply by squaring \(r\), as the notation indicates, \(r^2=(r)^2\).
Example \(\PageIndex{1}\)
The value of used vehicles of the make and model discussed in "Example 10.4.2" in Section 10.4 varies widely. The most expensive automobile in the sample in Table 10.4.3 has value \(\$30,500\), which is nearly half again as much as the least expensive one, which is worth \(\$20,400\). Find the proportion of the variability in value that is accounted for by the linear relationship between age and value.
Solution:
The proportion of the variability in value \(y\) that is accounted for by the linear relationship between it and age \(x\) is given by the coefficient of determination, \(r^2\). Since the correlation coefficient \(r\) was already computed in "Example 10.4.2" in Section 10.4 as
\[r=-0.819\\ r^2=(-0.819)2=0.671\]
About \(67\%\) of the variability in the value of this vehicle can be explained by its age.
Example \(\PageIndex{2}\)
Use each of the three formulas for the coefficient of determination to compute its value for the example of ages and values of vehicles.
Solution:
In "Example 10.4.2" in Section 10.4 we computed the exact values
\[SS_{xx}=14\\ SS_{xy}=-28.7\\ SS_{yy}=87.781\\ \hat{\beta _1}=-2.05\]
In "Example 10.4.4" in Section 10.4 we computed the exact value
\[SSE=28.946\]
Inserting these values into the formulas in the definition, one after the other, gives
\[r^2=\dfrac{SS_{yy}−SSE}{SS_{yy}}=\dfrac{87.781−28.946}{87.781}=0.6702475479\]
\[r^2= \dfrac{SS^2_{xy}}{SS_{xx}SS_{yy}}=\dfrac{(−28.7)^2}{(14)(87.781)}=0.6702475479\]
\[r^2=\hat{β}_1 \dfrac{SS_{xy}}{SS_{yy}}=−2.05\dfrac{−28.7}{87.781}=0.6702475479\]which rounds to \(0.670\). The discrepancy between the value here and in the previous example is because a rounded value of \(r\) from "Example 10.4.2" was used there. The actual value of \(r\) before rounding is \(0.8186864772\), which when squared gives the value for \(r^2\) obtained here.
The coefficient of determination \(r^2\) can always be computed by squaring the correlation coefficient \(r\) if it is known. Any one of the defining formulas can also be used. Typically one would make the choice based on which quantities have already been computed. What should be avoided is trying to compute \(r\) by taking the square root of \(r^2\), if it is already known, since it is easy to make a sign error this way. To see what can go wrong, suppose \(r^2=0.64\). Taking the square root of a positive number with any calculating device will always return a positive result. The square root of \(0.64\) is \(0.8\). However, the actual value of \(r\) might be the negative number \(-0.8\).
Key Takeaway
- The coefficient of determination \(r^2\) estimates the proportion of the variability in the variable \(y\) that is explained by the linear relationship between \(y\) and the variable \(x\).
- There are several formulas for computing \(r^2\). The choice of which one to use can be based on which quantities have already been computed so far.