Linear Regression of estimated measures / outliers

In summary, the conversation is discussing the theory of identifying outliers in a linear model for house size and sale price data. The speaker suggests using a 95% confidence interval to determine if a data point should be labeled as significant or not. However, it is not appropriate to remove an outlier simply because it is an outlier. It is suggested to investigate the data point and its potential influence on the problem before deciding to remove it. The conversation also mentions using a robust fit or regression diagnostics to further analyze the data.
  • #1
Whenry
23
0
Hi all, I would like to understand the theory for determining outliers in the following scenario.

Let's say I am to fit a linear model to the data of house size v. sale price for a particular location.

And let's say I have a fairly good linear relationship, as house size increases, so does price.

But then I have a mansion that only sold for $80,000.

Well, if it is just one house, I could safely ignore it as an outlier. But if , in my data, I have 50 mansions that all sold for under $100,000, I may have to suspect there is something about a large house that makes it very undesirable to the particular community, and that I should choose a model that reflects this.

My question is - is there a mathematical method as to when to label data as significant or not.

I have thought about creating the %95 confidence interval for a measure, in this case the measure would be the mean price of mansions. Clearly if I only have one mansion in my data...well I don't even know how to construct a 95% CI for such a small sample size, but if i had one, I would leave it out. If I had 50 mansions that sold for a low price, and I had a fairly tight 95% CI around this low price...then at some point I would say it is significant.

Any help on further understanding this would be much appreciated.
 
Physics news on Phys.org
  • #2
I can't tell whether you just want someone to give you a ton of links about methods of removing outliers or whether you are trying to solve a specific problem.

To get an answer to a specific problem, you must have a considerable amount of "given" information (which, in real world problems, mean you must make assumptions). You must also be able to state clearly what you are trying to accomplish.

Based on other threads, many non-statisticians who mention "confidence intervals" in their posts are really talking about "credible intervals" or "prediction intervals". So I hesitate to comment on the method you outlined till that is cleared up.
 
  • #3
It is never appropriate to eliminate an outlier from a problem simply because it is an outlier. You can
* temporarily remove it and rerun the analysis to gauge the influence the outlier is having on the problem
* investigate it to see whether there was some error in transcription (writing $80000 rather than $800,000, for instance)
* investigate whether the peculiar data point(s) is (are) from a population you don't intend to study

If you find an error (transcription, wrong population, problem with the recording, etc) it is acceptable to remove the outlier and proceed. Absent that, you should leave it in. Removing it simply because you don't like it is not an acceptable statistical practice.
Have you tried a robust fit, or even looking at the regression diagnostics to see what influence the data you mention exhibits?
 

FAQ: Linear Regression of estimated measures / outliers

1. What is linear regression and why is it used?

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is used to make predictions or identify patterns in data and is commonly used in scientific research, business, and other fields.

2. How do outliers affect linear regression?

An outlier is a data point that is significantly different from the rest of the data. Outliers can skew the results of a linear regression analysis by pulling the line of best fit away from the majority of the data points. It is important to identify and address outliers before conducting a linear regression analysis.

3. How can you identify and handle outliers in linear regression?

Outliers can be identified by visualizing the data using a scatter plot or by conducting statistical tests. Once identified, outliers can be handled by either removing them from the data set or transforming the data to make it more normally distributed. It is important to carefully consider the impact of handling outliers on the overall results of the analysis.

4. What are some assumptions of linear regression?

Linear regression relies on several assumptions, including linearity, independence of errors, normality of errors, and homoscedasticity (constant variance). Violating these assumptions can lead to inaccurate or biased results. It is important to assess these assumptions before conducting a linear regression analysis.

5. How do you interpret the results of a linear regression analysis?

The results of a linear regression analysis typically include a regression equation, coefficients, and statistical measures such as R-squared and p-values. These results can be interpreted to determine the strength and direction of the relationship between the variables, the significance of the relationship, and how well the model fits the data. It is important to consider the context of the data and potential limitations of the analysis when interpreting the results.

Back
Top