Normality of errors and residuals in ordinary linear regression

  • I
  • Thread starter fog37
  • Start date
  • Tags
    Errors
In summary, the normality of errors and residuals in ordinary linear regression is a key assumption that ensures the validity of statistical inferences made from the model. When the residuals (the differences between observed and predicted values) are normally distributed, it supports the reliability of hypothesis tests and confidence intervals. Violations of this assumption can lead to biased estimates and incorrect conclusions. Techniques such as visual inspections (e.g., Q-Q plots) and statistical tests (e.g., Shapiro-Wilk test) are commonly used to assess normality. If normality is not met, transformations or alternative methods may be considered to improve model performance.
  • #1
fog37
1,569
108
TL;DR Summary
Checking for normality of errors and residuals in ordinary linear regression
Hello,
In reviewing the classical linear regression assumptions, one of the assumptions is that the residuals have a normal distribution...I also read that this assumption is not very critical and the residual don't really have to be Gaussian.
That said, the figure below show ##Y## values and their residuals with a normal distribution of equal variance at the ##X## value:

1704757020147.png


To check for residual normality, should we check the distribution of residuals at each ##X## (not very practical)? Instead, we usually plot a histogram of ALL the residuals at different X values...But that is not what the assumption is about (normality of residuals for each predictor ##X## value)...

Thank you...
 

Attachments

  • 1704756869482.png
    1704756869482.png
    11.7 KB · Views: 54
Physics news on Phys.org
  • #2
fog37 said:
one of the assumptions is that the residuals have a normal distribution...I also read that this assumption is not very critical
Critical for what? You should probably be careful about any probability or confidence intervals that come from a model where the random term is not normal.
fog37 said:
and the residual don't really have to be Gaussian.
There are glaring and common examples that violate that assumption. If all the ##Y## must be positive, then a lot of the negative normal tail might be missing. If the random variance is a percentage of the ##Y## values, then a log transformation should be looked at.
fog37 said:
That said, the figure below show ##Y## values and their residuals with a normal distribution of equal variance at the ##X## value:

View attachment 338295

To check for residual normality, should we check the distribution of residuals at each ##X## (not very practical)? Instead, we usually plot a histogram of ALL the residuals at different X values...But that is not what the assumption is about (normality of residuals for each predictor ##X## value)...
True. A lot depends on the subject matter expertise of the statistician. Does he have a valid reason to model the subject as a linear model with a random normal term?
 
  • Like
Likes fog37
  • #3
The assumption of a Gaussian error structure is not part of the basic regression assumptions. IF that assumption is added then things like the distributions of the LS estimates are exact rather than approximate as they are without it.

When the Gaussian assumption is made it is this: the error terms are i.i.d normal with mean 0 and variance sigma squared. This links to your picture of the bell curves superimposed on the regression line as follows:
- in this case Y1 through Yn are each normally distributed with mean b0 + bx1 and variance sigma squared
- the bell curves on the regression plot don't show the distribution of the errors, it is meant to show each normal distribution of the Y values

This leads to your question: we don't need to check the error distribution for each value of x since those values don't influence the error distributions: the error distributions are, as I mentioned above, i.i.d with mean 0 and constant variance, so the checks we use on them work

You should also remember this: there is no such thing as any data that is truly normally distributed: that is an ideal, and our checks are done simply to see whether our collected data's distribution is similar enough to that of the ideal to allow us to use normal-based calculations.
 
  • Like
Likes fog37 and FactChecker
  • #4
I believe, using Cochran's theorem, it justifies the distribution of the associated Anova statistics.
 

FAQ: Normality of errors and residuals in ordinary linear regression

What is the importance of normality of errors in ordinary linear regression?

Normality of errors in ordinary linear regression is important because it underpins the validity of various statistical tests and confidence intervals. When errors are normally distributed, the estimators of the regression coefficients are unbiased and have minimum variance, making them the Best Linear Unbiased Estimators (BLUE). This normality assumption also allows for the application of inferential statistics, such as t-tests and F-tests, to determine the significance of predictors and the overall model fit.

How can I check if the residuals are normally distributed?

To check if the residuals are normally distributed, you can use graphical methods and statistical tests. Common graphical methods include Q-Q (quantile-quantile) plots, where residuals are plotted against a theoretically normal distribution, and histograms of residuals. For statistical tests, the Shapiro-Wilk test and the Kolmogorov-Smirnov test are frequently used. These tests provide a p-value indicating whether the residuals significantly deviate from a normal distribution.

What should I do if the residuals are not normally distributed?

If the residuals are not normally distributed, you can consider several approaches. One common approach is to apply a transformation to the dependent variable, such as a logarithmic, square root, or Box-Cox transformation, to stabilize variance and achieve normality. Another approach is to use robust regression methods that are less sensitive to deviations from normality. Additionally, you can consider using generalized linear models (GLMs) that do not assume normality of errors.

Does non-normality of residuals affect the coefficients in linear regression?

Non-normality of residuals does not bias the estimated coefficients in linear regression, but it affects the efficiency and reliability of these estimates. Specifically, the standard errors of the coefficients may be inaccurate, leading to unreliable hypothesis tests and confidence intervals. This can result in incorrect conclusions about the significance of predictors and the overall model fit.

Is normality of residuals the same as homoscedasticity?

No, normality of residuals and homoscedasticity are different concepts. Normality of residuals refers to the distribution of the error terms, which should ideally follow a normal distribution. Homoscedasticity, on the other hand, refers to the constant variance of the error terms across all levels of the independent variables. Both assumptions are important for the validity of ordinary linear regression, but they address different aspects of the error structure. Violations of either assumption can lead to inefficiencies and inaccuracies in the regression analysis.

Back
Top