Estimating error in slope of a regression line

In summary, the standard error of the slope can be estimated by: 1) Regenerating random data and doing a bootstrap simulation, 2) Comparing the observed linear regression slope (X) with the estimated linear regression slope (\hat\beta) using the standard error of the slope, or 3) Using a Monte Carlo simulation.
  • #1
Signifier
76
0
OK, I have a question I have no idea how to answer (and all my awful undergrad stats books are useless on the matter). Say I make a number of pairs of measurements (x,y). I plot the data, and it looks strongly positively correlated. I do a linear regression and get an equation for a line of best fit, say y = 0.3x + 0.1 or something. The Pearson coefficient is very close to one, IE 0.9995 or so.

Now, say that the quantity I am interested in is the slope of this line, that is (for the above equation) 0.3. I take all my measurements, get the line of best fit, find its slope, and the slope is something I want. For example, with the photoelectric effect, maybe I measure stopping potential vs. frequency of light; the slope can be related to Planck's constant. Or something similar.

The question I have is: how do I estimate the error (uncertainty) in this slope value I get? My professor said to use the "standard deviation in the slope," which doesn't sound sensible to me. I thought to myself: well, maybe it has to do with using the uncertainty in x and the uncertainty in y. But how would you combine these uncertainties to find the uncertainty in dy/dx?

How does one estimate the error range for a parameter obtained from the slope of a line of best fit on a set of (x,y) data?

Thank you so much, this one seems really important and I'm a bit disturbed I haven't the slightest idea what to do.
 
Physics news on Phys.org
  • #2
The standard assumption is that there is no uncertainty in x. y is the random variable.

If you run a regression in Excel (or any other more sophisticated statistics package) it will display the standard errors for both parameters.
 
  • #3
OK, that's a good assumption in my case. It would mean that the uncertainty in the slope is equal to the uncertainty in y, right?

Unfortunately I don't have Excel and I'm doing this all by hand, heh. How do I calculate the standard errors for both parameters by hand?
 
  • #4
I'll assume some familiarity with the linear algebra notation.

The estimated parameter vector is [itex]\hat \beta = (X'X)^{-1}X'y[/itex] where X = [1 x] is the n x 2 data matrix.

Substitute [itex]X\beta + \epsilon[/itex] for y.

Calculate [itex]Var\left[{\hat \beta}^2\right]=Var\left[(\beta+(X'X)^{-1}X'\epsilon)^2\right][/itex].
 
Last edited:
  • #5
EDIT: The last line of the last post should have been:

Calculate [itex]Var\left[{\hat \beta}\right]=Var\left[\beta+(X'X)^{-1}X'\epsilon\right][/itex].

 
  • #6
If the dependent or independent variables in your regression have error bars, then one way to estimate the error of the slope (and intercept) is via a Monte Carlo simulation. Let's say you are doing a linear fit of 10 experimental points. The 10 x-values each have some standard deviation, and the 10 y-values each have some standard deviation. Rather than doing a single linear regression, you do many regressions in which you create simulated data where the experimental points have a Gaussian distribution about their nominal x- and y-values according to the standard deviations. Let's say you generate 100 sets of 10 experimental points. You would then get 100 different linear regression results (100 slopes and 100 intercepts). You can then calculate the standard deviations of these slopes and intercepts to give you an estimate of their errors that takes into account the measurement errors on the experimental points. You may have to do more than 100 simulations. I usually vary this number to see where I get very little change in the answer.

It can be computationally intensive. I've been told there are other ways to do this, but I don't know what they are. If you try to blindly apply simple error propagation techniques, you will get absurd numbers, so don't try that.
 
  • #7
In simple linear regression the standard deviation of the slope can be estimated as

[tex]
\sqrt{\frac{\frac 1 {n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (x_i - \overline x)^2}}
[/tex]

In comparison to post 6: rather than regenerating random data each time, you can carry out a bootstrap simulation using your original data, and obtain an estimate of the distribution of the slope. You can carry out the work for fixed or random predictors (slightly different setups in the calculations).

However, you'd either have to write the code yourself to use in Excel or get some software that has some real statistics capability (R (or S), SAS, Minitab (little work required with Minitab too).
 
  • #8
Aloha statdad! Thanks for your reply. Can you speak some more about your bootstrap simulation approach? I don't quite understand it, and I am looking for a simpler way to estimate slope and intercept error bars than Monte Carlo if such exists. In particular, 1) what is the y-hat term?, and 2) I'm not seeing how error estimates in your x- and y-values is taken into account in this approach (perhaps the y-hat term does this, but what about x? x-bar does not encapsulate measurement error in x) I have access to IDL, and the Advanced Math and Statistics package, but that doesn't really help if I can't figure out how to utilize that functionality properly. I get Monte Carlo, though it is decidedly brute force. Any clarifying points you can provide would be much appreciated.
 
  • #9
The bootstrap approach is itself a Monte Carlo technique. It involves resampling your n data points over and over with replacement. Each time, you recalculate the slope of the best-fit line, building up a long list of slopes. The standard deviation of the list, multiplied by [itex]\sqrt{[n/(n-1)]}[/itex], is an estimator for the standard error for the original slope.

For example, if your data points are (1,10), (2,9), (3,7), (4,6), a few bootstrap samples (where you sample with replacement) might be (2,9), (2,9), (3,7), (4,6) (i.e., the first data point was omitted and the second picked twice), or (1,10), (3,7), (3,7), (3,7), or (1,10), (2,9), (3,7), (4,6). You do this a lot of times (perhaps thousands), fitting a slope to each sample, until the standard deviation of the slopes has converged to your desired accuracy. A caveat: the bootstrap technique works better with a larger original data set. Four points wouldn't cut it.

(Sorry to butt in here, statdad, but I discovered this technique last year and have been using it often in my own research and excitedly telling my colleagues about its usefulness in handling non-Gaussian data. Please let me know if I've made any errors in this explanation.)
 
  • #10
Hmmm...very interesting, Mapes. Thanks for the response! I'm curious, though, it seems this approach would potentially overestimate the error in the slope by a fair amount, since replacing the point (2,9) with the point (3,7) may greatly exceed the actual error in the measurement of the point (2,9). Are there any general rules for how one does this replacement to minimize the chance of gross overestimation? Also, is there a formal name for this approach, such that I can try to find some references to read up on the technique? I don't want to keep bothering you guys when I can get answers on my own, but I don't know where to look for something like this.
 
  • #11
mdmann00 said:
Hmmm...very interesting, Mapes. Thanks for the response! I'm curious, though, it seems this approach would potentially overestimate the error in the slope by a fair amount, since replacing the point (2,9) with the point (3,7) may greatly exceed the actual error in the measurement of the point (2,9).

But that's the meaning of standard error of the slope; when taking data, you might just as well have measured (3,7) instead of (2,9). Lacking additional data, the bootstrap approach simulates additional data by sampling existing data. It might be helpful to try an example with normally distributed data and check that it matches analytical results from equations that assume a Gaussian distribution.

mdmann00 said:
Also, is there a formal name for this approach, such that I can try to find some references to read up on the technique? I don't want to keep bothering you guys when I can get answers on my own, but I don't know where to look for something like this.

Chernick's Bootstrap Methods: A Practitioner's Guide is very clear.
 
  • #12
OK, so if I understood you correctly, you're saying that *if* you don't have data to suggest what the actual x and y measurement errors are, this technique allows you to get *some* kind of estimate of the regression errors from the available data. If so, that makes sense.

I will take your advice and see how the bootstrap technique, in the absence of error data, compares with a Monte Carlo simulation *with* error data.

Thanks for the reference and the help! It is much appreciated.
 
  • #13
mdmann00 said:
OK, so if I understood you correctly, you're saying that *if* you don't have data to suggest what the actual x and y measurement errors are, this technique allows you to get *some* kind of estimate of the regression errors from the available data.

Exactly - good luck!
 
  • #14
A good reference for bootstrapping is Efron & Tibshirani (1993) An Introduction to the Bootstrap. They discovered the bootstrap. That said, I wish to address the inappropriateness of using a bootstrap to find the standard error of the slope and intercept of a simple linear regression.

The bootstrap is a sophisticated statistical procedure that is frequently used when one wishes to understand the variability and distributional form of some function (e.g., nonlinear combination) of sample estimates. Usually, this function of estimates has an unknown density. The slope and intercept of a simple linear regression have known distributions, and closed forms of their standard errors exist. Therefore, why complicate estimates of standard errors? If one were fitting a Bayesian model, then I could understand the use of MCMC methods. I highly doubt this is the case.

Also, inferences for the slope and intercept of a simple linear regression are robust to violations of normality. Unless the histogram of residuals evidences a strong departure from Normality, I would not be concerned with non-Normal errors. I would be more concerned about homogeneous (equal) variances.

If people lack software to compute standard errors of LS-regression estimates, I recommend using R. It is freeware that is available at www.r-project.org This is not a point and click interface. However, there is sufficient documentation to guide new users. The function lm() should be used for a linear regression. As a statistician, I despise the use of Excel for any statistical analysis!
 
  • #15
Aloha d3t3rt,

If closed forms of the standard errors in linear regression exist, are these not what are used to estimated the standard errors of the slope and intercept in Excel? And if so, why should one not use that tool to do that calculation?

Thanks for the second reference.
 
  • #16
Here is a website outlining many of Excel's shortcomings:

http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf

I am very suspect of the algorithms that Excel uses to calculate statistics. Very simple statistical summaries have been calculated incorrectly by Excel (e.g., sample standard deviation).

With respect to computer estimation of b0 and b1, statistics programs usually calculate these through an iterative computer algorithm. For example, when estimating the mean of a Normally distributed random variable, the maximum likelihood estimates are the sample mean. However, a computer calculates this estimate with an iterative computer algorithm like the Newton-Raphson or golden search algorithm. Generally, there is a one-to-one correspondence with the computer estimates of standard errors and their "brute-force" hand calculations.
 
Last edited by a moderator:
  • #17
"Also, inferences for the slope and intercept of a simple linear regression are robust to violations of normality. Unless the histogram of residuals evidences a strong departure from Normality, I would not be concerned with non-Normal errors. I would be more concerned about homogeneous (equal) variances."

The inferences are not robust to violations of normality - that fact is one of the reasons for the development of non-parametric and robust methods. Since histograms themselves can be misleading - the shape is easily influenced by the number of bins, for instance, and for small sample sizes histograms are virtually worthless, their use in outlier detection is minimal - even if you do graph the residuals.

Further, since high leverage points have the capability of controlling the entire fit, they will not be detected as outliers since they do not have large residuals. Graph the data and residuals several ways, not just the quickest way.

"The slope and intercept of a simple linear regression have known distributions, and closed forms of their standard errors exist."

These distributions are exact only when normality applies perfectly (which is never), and are convenient asymptotic descriptions otherwise. Using them when data are significantly non-normal isn't a good idea.

"I would be more concerned about homogeneous (equal) variances."
I wouldn't say more concerned, but of equal concern.

"The bootstrap is a sophisticated statistical procedure that is frequently used when one wishes to understand the variability and distributional form of some function (e.g., nonlinear combination) of sample estimates."
It can be used with non-linear statistics, but it is not limited to it, and work very well with regression.

"As a statistician, I despise the use of Excel for any statistical analysis!"
Best point. It was a long struggle at our school to convince the business group to dump Excel for its courses. Many years ago I was optimistic that the group inside Microsoft with responsibility for Excel would address the complaints. I gave up that hope not long after I started it.
 
  • #18
Statdad, thank you for fixing my statement about known standard errors and distributional forms for the sample slope and intercept.
 
  • #19
statdad said:
In simple linear regression the standard deviation of the slope can be estimated as

[tex]
\sqrt{\frac{\frac 1 {n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (x_i - \overline x)^2}}
[/tex]

In comparison to post 6: rather than regenerating random data each time, you can carry out a bootstrap simulation using your original data, and obtain an estimate of the distribution of the slope. You can carry out the work for fixed or random predictors (slightly different setups in the calculations).

However, you'd either have to write the code yourself to use in Excel or get some software that has some real statistics capability (R (or S), SAS, Minitab (little work required with Minitab too).



Hello StatDad.
Where did you get this equation from, and what is y_hat ?
thank you!
 
  • #20
You can find it in most statistics texts. [itex] \hat y_i [/itex] is the ith predicted value of [itex] y [/itex].
 
  • #21
So I believe that predictive value requires some coding? Or is there a function ready made for it in excel?
 
  • #22
I haven't used Excel for statistics in such a long time that I'm afraid I can't answer your second question. If you are looking for inference procedures for the slope in a regression, for simple linear regression the output should contain information about tests for the slope; this is true for least squares (available in most every software) and robust measures (R has the MASS package with a very good robust regression option).
 

FAQ: Estimating error in slope of a regression line

1. What is the purpose of estimating error in the slope of a regression line?

The purpose of estimating error in the slope of a regression line is to determine how accurately the slope of the regression line represents the relationship between two variables. It helps to evaluate the reliability and precision of the slope estimate, and to determine if the relationship between the variables is statistically significant.

2. How is error in the slope of a regression line calculated?

Error in the slope of a regression line is calculated by using the standard error of the slope formula, which takes into account the variability of the data points and the sample size. The formula is: SE = √(Σ(y - ŷ)^2 / (n - 2) * Σ(x - x̄)^2).

3. What factors can affect the error in the slope of a regression line?

Several factors can affect the error in the slope of a regression line, including the variability of the data points, the sample size, and the presence of influential outliers. The choice of regression model and the assumptions made about the relationship between the variables can also impact the error in the slope estimate.

4. How do you interpret the error in the slope of a regression line?

The error in the slope of a regression line is typically expressed as a standard error or a confidence interval. It represents the range within which the true slope of the population is likely to fall. A smaller error indicates a more precise estimate, while a larger error suggests a less reliable estimate. Additionally, if the error includes the value of zero, it indicates that the slope is not statistically significant.

5. How can you reduce the error in the slope of a regression line?

To reduce the error in the slope of a regression line, you can increase the sample size, which will decrease the standard error. Additionally, you can check for influential outliers and consider using a different regression model if the assumptions are not met. It is also important to carefully consider the variables included in the regression analysis and to ensure that they are appropriately measured and selected.

Back
Top