- #1
Whenry
- 23
- 0
Hi all, I would like to understand the theory for determining outliers in the following scenario.
Let's say I am to fit a linear model to the data of house size v. sale price for a particular location.
And let's say I have a fairly good linear relationship, as house size increases, so does price.
But then I have a mansion that only sold for $80,000.
Well, if it is just one house, I could safely ignore it as an outlier. But if , in my data, I have 50 mansions that all sold for under $100,000, I may have to suspect there is something about a large house that makes it very undesirable to the particular community, and that I should choose a model that reflects this.
My question is - is there a mathematical method as to when to label data as significant or not.
I have thought about creating the %95 confidence interval for a measure, in this case the measure would be the mean price of mansions. Clearly if I only have one mansion in my data...well I don't even know how to construct a 95% CI for such a small sample size, but if i had one, I would leave it out. If I had 50 mansions that sold for a low price, and I had a fairly tight 95% CI around this low price...then at some point I would say it is significant.
Any help on further understanding this would be much appreciated.
Let's say I am to fit a linear model to the data of house size v. sale price for a particular location.
And let's say I have a fairly good linear relationship, as house size increases, so does price.
But then I have a mansion that only sold for $80,000.
Well, if it is just one house, I could safely ignore it as an outlier. But if , in my data, I have 50 mansions that all sold for under $100,000, I may have to suspect there is something about a large house that makes it very undesirable to the particular community, and that I should choose a model that reflects this.
My question is - is there a mathematical method as to when to label data as significant or not.
I have thought about creating the %95 confidence interval for a measure, in this case the measure would be the mean price of mansions. Clearly if I only have one mansion in my data...well I don't even know how to construct a 95% CI for such a small sample size, but if i had one, I would leave it out. If I had 50 mansions that sold for a low price, and I had a fairly tight 95% CI around this low price...then at some point I would say it is significant.
Any help on further understanding this would be much appreciated.