One assumption underlying linear regression is that the X values are normally distributed


Introduction

Regression analysis is commonly used for modeling the relationship between a single dependent variable Y and one or more predictors.  When we have one predictor, we call this "simple" linear regression:

E[Y] = β0 + β1X

That is, the expected value of Y is a straight-line function of X. The betas are selected by choosing the line that minimizing the squared distance between each Y value and the line of best fit. The betas are chose such that they minimize this expression:

∑i (yi – (β0 + β1X))2

An instructive graphic I found on the Internet

One assumption underlying linear regression is that the X values are normally distributed

Source: http://www.unc.edu/~nielsen/soci709/m1/m1005.gif

When we have more than one predictor, we call it multiple linear regression:

 Y = β0 + β1X1+ β2X2+ β2X3+… + βkXk

The fitted values (i.e., the predicted values) are defined as those values of Y that are generated if we plug our X values into our fitted model.

The residuals are the fitted values minus the actual observed values of Y.

Here is an example of a linear regression with two predictors and one outcome:

Instead of the "line of best fit," there is a "plane of best fit."

One assumption underlying linear regression is that the X values are normally distributed

Source: James et al. Introduction to Statistical Learning (Springer 2013)

There are four assumptions associated with a linear regression model:

  1. Linearity: The relationship between X and the mean of Y is linear.
  2. Homoscedasticity: The variance of residual is the same for any value of X.
  3. Independence: Observations are independent of each other.
  4. Normality: For any fixed value of X, Y is normally distributed.

We will review how to assess these assumptions later in the module.

Let's start with simple regression. In R, models are typically fitted by calling a model-fitting function, in our case lm(), with a "formula" object describing the model and a "data.frame" object containing the variables used in the formula. A typical call may look like

> myfunction <- lm(formula, data, …)

and it will return a fitted model object, here stored as myfunction. This fitted model can then be subsequently printed, summarized, or visualized; moreover, the fitted values and residuals can be extracted, and we can make predictions on new data (values of X) computed using functions such as summary(), residuals(),predict(), etc. Next, we will look at how to fit a simple linear regression.

return to top | previous page | next page

Building a linear regression model is just the first step. Certain conditions should be met before we draw inferences regarding the model estimates or before we use the model for making predictions.

Photo by Dustin Groh on Unsplash

The Linear Regression model should be validated for all model assumptions including the definition of the functional form. If the assumptions are violated, we need to revisit the model.

In this article, I will explain the key assumptions of Linear Regression, why is it important and how we can validate the same using Python. I will also talk about remedial measures in case the assumptions are not satisfied.

Assumptions of Linear Regression :

Assumption 1

The functional form of regression is correctly specified i.e. there exists a linear relationship between the coefficient of the parameters (independent variables) and the dependent variable Y.

Assumption 2

The residuals are normally distributed.

Assumption 3

The variance of the residuals is constant across all values of the independent variable X. (also known as ‘Homoscedasticity’).

Assumption 4

There is no autocorrelation between errors.

Assumption 5

There is no (low) correlation between independent variables (also known as ‘Multi Collinearity’).

Lets understand these assumptions in detail.

Assumption 1 : The functional form of regression is correctly specified.

The linear regression algorithm assumes that there is a linear relationship between the parameters of independent variables and the dependent variable Y. If the true relationship is not linear, we cannot use the model as the accuracy will be significantly reduced.

Thus, it becomes important to validate this assumption. This is done by plotting Residual Plots.

In case of a simple linear regression model with only one independent variable, we plot the residuals vs. the predictor i.e. error vs. the predictor X1. In case of a multiple linear regression, where we have more than one predictor, we plot the residual and the predicted or fitted value ŷ .

The residual plot should not show any pattern i.e. the residuals should be randomly dispersed around the horizontal axis. In case there is a pattern, we can infer that there is a problem with some aspect of the linear model i.e. an incorrect functional form has been used. This is also referred to as ‘Model- Misspecification’.

In case of a non linear association, we can use non linear transformation of the predictors such as log X, X², sqrt(X) in the regression model.

An example of a residual plot with non linear association is shown below.

Plotting Residual Plot in Python:

Calculate Residuals as :

residuals = y_test — y_pred

Simply plot the scatter plot of the residuals and the predicted y values. The code is as shown:
import matplotlib.pyplot as plt
plt.scatter(y_pred, residuals)

Assumption 2: Residuals are Normally Distributed

Linear Regression model building has two very important steps — estimation and hypothesis testing. Using the Ordinary Least Squares Method (OLS), we are able to estimate the parameters Beta 1, Beta 2 and so on. The value of these parameters will change from sample to sample and hence we can say that these estimators are random variables.

Once we are done with estimation, we need to do hypothesis testing to make inferences about the population. Thus, we would like to know how close is beta hat (estimated beta) to true beta or how close the variance of beta hat is to the true variance.

Thus, we need to find out their probability distributions, without which we will not be able to relate to their true values.

Using OLS, we get Beta hat = ∑ (ki ∗ Yi) where ki = Xi / ∑(Xi)²

Here i= 1,2,3,4,....n observations

Here X is non stochastic, Ki is a constant. Thus, beta hat (estimated beta) is a linear function of Yi.

We can write the above equation as :

β2 hat = ∑ ki * (β1 + β2 ∗ Xi + ui) , since Yi = β1 + β2 * Xi + Error (ui)

Since k, X are constant, β2 hat is ultimately a linear function of the random variable ui.
Thus, the probability distribution of beta will depend on the assumption made about the probability distribution of the error term or the residual term ui.

Since the knowledge of the probability distribution of OLS estimators is necessary to draw inference about their population values, the distribution of the residuals assumes an important role in hypothesis testing.

Why normal distribution?

An error is the combined influence of one or more independent variables that are not included in the model. The Central Limit theorem states that if there are a large number of independent and identically distributed random variables, then the distribution of their sum tends to be a normal distribution as the number of variables increases. Thus, the CLT theoretically explains the reason why the ‘normal distribution’ is assumed for the errors.

Making this assumption enables us to derive the probability distribution of OLS estimators since any linear function of a normally distributed variable is itself normally distributed. Thus, OLS estimators are also normally distributed. It further allows us to use t and F tests for hypothesis testing.

How do we determine if the errors are normally distributed?

The easiest way to check if the errors are normally distributed is to use the P-P (Probability — Probability)plot.

A P-P plot assesses how closely a theoretical distribution models a given data distribution. A normal probability plot of the residuals is a scatter plot with the theoretical percentiles of the normal distribution on the x-axis and the sample percentiles of the residuals on the y-axis.

The comparison line is the 45 degree diagonal line. The diagonal line which passes through the lower and upper percentiles of the theoretical distribution provides a visual aid to help assess whether the relationship between the theoretical and sample percentiles is linear. The distributions are equal if and only if the plot falls on this line. Any deviation from this line indicates a difference between the two distributions

Below is an example of the P-P plot where the error distribution is not normal

Here is the P-P plot where the error distribution is approximately normal

Plotting P-P plot in Python:

The above code is run to get the following output:
normality_plot = sm.qqplot(residual, line = ‘r’)

In addition to the P-P plot, a more statistical way to check for normality of errors is to conduct Anderson Darling test.

Anderson Darling Test for checking Normality of Errors

The Anderson Darling statistic measures how well the data follows a particular distribution. For a given data set and its distribution, the better the data fits a distribution, the smaller this statistic will be.
The null and alternate hypothesis of the test are as follows:

Null Hypothesis Ho: The data follows a specified distribution

Alternate Hypothesis Ha : The data does not follow a specified distribution

If the p value is less than the chosen alpha (0.05 or 0.10), we reject the Null Hypothesis that the data comes from a specified distribution

Python Code for Anderson Darling Test :

from scipy import stats
anderson_results = stats.anderson(model.resid, dist=’norm’)

Assumption 3 : The variance of errors is constant for all values of independent variable X

The assumption of homoscedasticity (homo — equal , scedasticity — spread) states that the y population corresponding to various X values have the same variance i.e. it neither increases nor decreases as X increases.

Heteroscedasticity is when the variance is unequal i.e. the variance of y population varies with X.

What are the reasons for non constant variance?

One of the most popular example quoted on heteroscedasticity is the savings vs income data. As incomes grow, people have more discretionary income and thus a wider choice about the income disposition. Thus, regressing savings on income is likely to find higher variance as income increases cause people have more choices regarding their saving decision.

Other reasons for Heteroscedasticity include presence of outliers, omitted important variables, incorrect data transformation or incorrect functional form of the equation.

How does heteroscedasticity impact OLS estimation?

OLS estimation provides the Best Linear Unbiased Estimate (BLUE) of beta if all assumptions of the linear regression are satisfied.

What does BLUE mean? It means that the OLS method gives the estimates that are :

Linear :

It is a linear function of the dependent variable Y of the regression model

Unbiased :

Its expected or average value is equal to the true value

Efficient :

It has minimum variance. An unbiased estimator with the least variance is known as efficient estimator.

If all assumptions of the Linear Regression are satisfied, OLS gives us the best linear unbiased estimates.

However, if the assumption of homoscedasticity is relaxed, the beta estimates are still linear and unbiased but they are not efficient/best. Why ? Because, the beta estimated now will not have minimum variance.

Lets try to understand it better. The most common example used to explain heteroscedasticity is the one factor linear regression model with savings as dependent variable and income as explanatory variable. As seen in the graph above, when the income is low, the variance in saving is low. As income increases, disposable income also increases which gives more saving option to individuals. Thus, with increasing income, the variance in savings also increases. This is evident from the above graph. In homoscedastic data, the variance is uniform across all observations and equal to (sigma)². In case of heteroscedastic data, the variance becomes equal to (sigma(i))² where i=1,2,3,….n observations.

Ideally you would like to estimate in such a manner that the observations coming from population with lower variability are given more weight as compared to observation coming from population with higher variability as this will enable us to estimate the population regression function more accurately. The idea is to give small weights to observations associated with higher variances to shrink their squared residuals.

However, OLS does not change the weights of the observation depending on their contribution to residual sum of squares (RSS). It gives equal weight to all observations. Thus when the RSS is minimized to compute the estimate values, the observation with higher variance will have a larger pull in the equation. Thus the beta estimated using OLS for heteroscedastic data will no longer have minimum variance.

The beta estimated with heteroscedastic data will therefore have a higher variance and thus a high standard error.

We know that t-stats = beta hat /standard error of beta hat.

Higher standard error of beta leads to lower t-stats value. This further leads to higher p value and the chances of committing a Type 2 error also increases. That is, with a high p value, variables that are significant are concluded as insignificant.

It also affects the confidence intervals, T test, F test and other hypothesis tests. Thus, it is very crucial to detect and provide remedial measures for heteroscedasticity

How do we detect heteroscedasticity :

Heteroscedasticity produces a distinctive fan or cone shape in residual plots. To check for heteroscedasticity, you need to assess the residuals by fitted value plots in case of multiple linear regression and residuals vs. explanatory variable in case of simple linear regression. Typically, the pattern for heteroscedasticity is that as the fitted values increases, the variance of the residuals also increases.

In addition to the above plot, certain statistical tests are also done to confirm heteroscedasticity. One of them is the Breusch-Pagan test for normally distributed data.

Breusch Pagan Test for Heteroscedasticity :

Null Hypothesis Ho— Error variances are equal(Homoscedasticity)
Alternate Hypothesis Ha — Error variances are not equal(Heteroscedasticity)

The below code gives the p value of the hypothesis test . Here X is the list of independent variables

from statsmodels.stats.diagnostic import het_breuschpagan
bptest = het_breuschpagan(model3.resid, X)[1]
print(‘The p value of Breuchpagen test is’, bptest)

If the p value is less than 0.05, we reject the Null Hypothesis. Thus, we reject the Null Hypothesis that error variances are equal.

Remedial Measures for Heteroscedasticity

One way to deal with heteroscedasticity is to transform the response variable Y using a concave function such as log Y or sqrt Y. Such a transformation leads to greater amount of shrinkage of larger responses leading to reduction in heteroscedasticity.

Assumption 4 : There is no (less) multi collinearity between independent variables

Multicollinearity occurs in multiple regression model where two or more explanatory variables are closely related to each other.

This can pose a problem since it is difficult to separate out the individual effects of the correlated variables on the response variable i.e. determine how each variable is separately associated with the response variable.

Why do we assume that there is no multicollinearity among the independent variables?

Beta gives the rate of change in the response variable y as X1 changes by one unit, keeping all other Xs constant. When there is a high correlation between say independent variables X1 and X2, it is not possible to keep X2 constant as it will change with the change in X1. For data with multicollinearity, Betas will thus take multiple values, and there is no unique solution for the regressor coefficients. Therefore, in case of high multicollinearity, the beta estimates are indeterminate and the standard errors are infinite.

Multicollinearity is a more serious problem if the number of independent variables are less than or just equal to the number of observations.

How does Multicollinearity impact estimation and hypothesis tests in regression model?

The OLS estimators will have large variances and covariances making estimation difficult. Thus, the confidence intervals are much wider, leading to acceptance of zero null hypothesis and thus the Type 2 error.

Due to high estimated standard errors, the t stats = beta hat / (standard error of beta hat) gets smaller. Thus, multicollinearity leads to insignificant t ratios.

The estimated betas are very sensitive to even small changes in data in case of multicollinearity.

How do we check for Multicollinearity :

  1. Scatter Plot

One can use a scatter plot to check the correlation between the independent variables. High correlations are easily visualized from the scatter plot. If the points lie on the diagonal line or close to the diagonal line, we can infer high correlation between two variables.

2. Variance Inflation Factor

Variance Inflation Factor or VIF is used as an indicator of multicollinearity. The larger the value of VIF, the more correlated the variable is with other regressors. VIF shows how much the variance of a variable is inflated due to presence of multicollinearity. As the extent of collinearity increases, VIF also increases. If there is no collinearity between two variables, VIF will be 1.

It is calculated by taking a predictor and regressing it against every other predictor in the model. This gives the R² value which is then used in the VIF formula. Usually a variable with VIF greater than 10 is considered to be troublesome.

Python Code for Checking VIF:

from statsmodels.stats.outliers_influence import variance_inflation_factor

def get_vif_factors(X):
X_matrix = X.values
vif = [variance_inflation_factor(X_matrix,i) for i in range(X_matrix.shape[1])]
vif_factors = pd.DataFrame()
vif_factors[‘column’] = X.columns
vif_factors[‘VIF’] = vif
return vif_factors

vif_factors = get_vif_factors(X_train))
vif_factors

Remedies for Multicollinearity

  • Drop the variable with high VIF
  • Use Principal Component Analysis (PCA) to come up with non correlated variables

Assumption 5 : There is no autocorrelation of errors

Linear regression model assumes that error terms are independent. This means that the error term of one observation is not influenced by the error term of another observation. In case it is not so, it is termed as autocorrelation.

It is generally observed in time series data. Time series data consists of observations for which data is collected at discrete points in time. Usually, observations at adjacent time intervals will have correlated errors.

How does the presence of autocorrelation influence the OLS estimation?

The estimated standard errors tend to underestimate the true standard error. P value thus associated is lower. This leads to conclusion that a variable is significant even when it is not. The confidence and prediction intervals are narrower than what they should be.

How do we detect autocorrelation?

Durbin Watson test is used to check for autocorrelation .
Null Hypothesis Ho : There is no autocorrelation of errors
Alternate Hypothesis Ha : There is autocorrelation of errors

Durbin Watson statistic checks for the presence of autocorrelation in lag 1 of the residuals. The statistic is given as follows:

The value of the statistic will lie between 0 to 4. A value between 1.8 and 2.2 indicates no autocorrelation. A value less than 1.8 indicates positive autocorrelation and a value greater than 2.2 indicates negative autocorrelation

One can also look at a scatter plot with residuals on one axis and the time component on the other axis. If the residuals are randomly distributed, there is no autocorrelation. If a specific pattern is observed, it indicates presence of autocorrelation.

Durbin Watson Test in Python :

from statsmodels.stats.stattools import durbin_watson

durbin_watson(model1.resid)

Remedial Measures for Autocorrelation

Check if the autocorrelation is due to misspecification of the model i.e. either the functional form of the model is incorrect or some important variable has been excluded from the model. In such a case, one will need to revisit the model.
One can also add lags of a dependent variable and/or lags of some of the independent variables.

Conclusion :

All the above assumptions are important to validate the model. If any of these are violated, then the forecasts, confidence intervals and scientific insights derived from the model would be misleading and biased.

Hope this article has helped you to gain more insight into one of the most useful algorithms.

Here are the few good references to understand the assumptions even better.

  1. ‘Basic Econometrics’ by Damodar N Gujarati, Dawn C Porter, Manoranjan Pal
  2. ‘Business Analytics: The Science of Data — Driven Decision Making’ by U Dinesh Kumar
  3. https://statisticsbyjim.com/

Does regression assume that the X variables are normally distributed?

Linear regression analysis, which includes t-test and ANOVA, does not assume normality for either predictors (IV) or an outcome (DV).

What are the assumptions of linear regression?

There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other.

Are regression coefficients normally distributed?

More precisely, if we consider repeated sampling from our population, for large sample sizes, the distribution (across repeated samples) of the ordinary least squares estimates of the regression coefficients follow a normal distribution.

What are the five assumptions of linear multiple regression?

The regression has five key assumptions:.
Linear relationship..
Multivariate normality..
No or little multicollinearity..
No auto-correlation..
Homoscedasticity..