How to filter erroneous reading from distribution of weights

In summary, the conversation discusses finding a method to automatically filter out erroneous data in distributions of weights that are predominantly normal. Some suggested methods include using mean +/- standard deviation, trimmed means, and Huber-estimators. The conversation also mentions the Sn estimator for estimating scale and provides a link to more information on the topic. The conversation ends with the individual planning to compare the different methods and their results.
  • #1
j777
148
0
Hi,

I'm working with distributions of weights that are predominently "normal". The weights on the upper end of the distribution are in error and I'd like to find a method that I can use to automatically "chop" off this portion of the distribution. Based on my inexperienced inspection of these distributions it appears as though using the mean +/- the standard deviation as the range for "good" data and throwing everything else away yields a fairly accurate distribution but I'm not convinced that this is the correct/best way of filtering out the bad data.

I'm not a statistics expert so I'm hoping somebody who is can point me in the right direction.


Thanks
 
Physics news on Phys.org
  • #2
There is no "correct" way to do that without knowing what the true distribution is. And even with a specific normal distribution, it is always possible to have some "outliers". If you are looking for a way to throw away "erroneous" measurements, you will need some methods, outside of the data itself, to decide which measurements are "erroneous".
 
  • #3
Throwing out values willy-nilly simply because they seem "too high" is an incredibly bad bit of work.
If you are concerned about the influences of values, either high or low, you should try a method that is more robust than the traditional mean+standard deviation (i.e., normal-distribution-based) theories.
You have a variety of choices: trimmed means, medians, Huber-estimators of location and scale, and so on. Noting that you are by admission "not a statistician", the simplest approach might be to start with a trimmed mean.
More information on your problem and what you are trying to do would be useful.
 
  • #4
Plot the data, compute a few statistics, see what you get. What is your data reflective of? Are there any studies regarding this out there that suggest what kind of distribution/regression/etc. is appropriate?
 
  • #5
Thanks everybody for your input.

statdad -- One of the main calculations I'd like to do on the data uses the mean. Since the readings that are generally in error are on the upper end of the distributions I considered calculating the mean of the data within the interquartile range but I didn't find anything in my research that suggested that this is a good approach. Now that you've educated me on trimmed means I realize that this is in fact a common approach. As NoMoreExams suggests I'll do some computations using trimmed means and see what the results look like.
 
  • #6
Good - the primary downfall of deleting outliers based on "experience" or "gut feeling" is that even in the best of situations your biases guide your decisions. As I said, use of a robust methodology, with measures of location and scale designed to work together, will serve you well.
 
  • #7
Actually statdad suggested using trimmed means :-)
 
  • #8
Using a trimmed median as a measure of location works pretty well but SD as a measure of scale doesn't work so well because of the outliers present in the data. I'm trying to understand the calculations involved in using Sn (proposed by Rousseeuw and Croux) to estimate scale. Can anybody walk me through the calculation?
 
  • #9
Does anybody know anything about Sn or Qn estimators?
 
  • #10
The [tex] S_n [/tex] estimate I know of is

[tex]
S_n = 1.1926\text{median}_{1 \le i \le n} \left( \text{median}_{1 \le j \le n} |x_i - x_j | \right)
[/tex]

If you don't have software to calculate this you should be able to do it rather easily in a spreadsheet:
Step 1: For each [tex] i [/tex] you calculate the median of [tex] |x_i - x_j | [/tex]
Step 2: The estimate is the median of all the quantities you calculated in step 1 multiplied by [tex] 1.1926 [/tex]

The multiplication at the end is done to make the current estimate consistent in the case of normally distributed data.
This estimate of scale does not require the underlying distribution for the data to be normal (or even symmetric).

On a side note, you might find the information at this link

http://www.technion.ac.il/docs/sas/qc/chap1/sect21.htm

helpful. (You might not too, but it can't hurt to check it.)

good luck - keep the questions coming if you have more.
 
Last edited by a moderator:
  • #11
Thanks statdad. One question though: In step 1 am I right in saying that xi is a datapoint and xj is the previous data point and I must calculate the median of the absolute value of the difference between the two points?
 
  • #12
No - the [tex] |x_i - x_j| [/tex] means that every difference is used. Since the absolute value is in there is some duplication in effort, and the cases where [tex] i = j [/tex] obviously cancel out, but unless your data set contains thousands of values the effort you'd expend in looking only at different x-values would far exceed the savings in calculation time.
What software are you using?
 
  • #13
I think I mis-understood, or mis-answered, your question. Let me try again.

  • Call your first data value [tex] x_1 [/tex]. Calculate every possible [tex] |x_1 - x_j | [/tex], then get the median of these values
  • Take the second data value and calculate every possible [tex] |x_2 - x_j | [/tex], then get the median of these values
  • Repeat the calculations shown above for every data value, taking the median of each set of differences. This gives you the [tex] \text{median}_{1 \le j \le n} |x_i - x_j| [/tex] terms
  • Find the median of all the values found in the step immediately above this - this gives the [tex] \text{median}_{1 \le i \le n} \left( \text{median}_{1 \le j \le n} |x_i - x_j| \right) [/tex] term
  • The final calculation is to multiply the result found above by [tex] 1.1926 [/tex] - this will give you [tex] S_n [/tex] for your data

Hope this (and all my responses) help.
 
  • #14
I'm writing the algorithm in C. In retrospect my explanation wasn't very clear. I'm trying to understand what step 1 involves as far is writing an algorithm to perform it.
 
  • #15
OK I got it now! That's a lot of computations.
 
  • #16
Thank you so much for the detailed explanation. Now I'm going to see how this measure of space performs compared to SD.
 
  • #17
Can't help you with the C programming - it's been a loooong time since I did that.
As an aid in interpretation - the standard deviation, as well as the MAD you may have seen references to, both measure variability in terms of the distance data values are from a fixed reference ([tex] \overline X [/tex] for the standard deviation, the median for MAD), while [tex] S_n [/tex] measures variability in terms of (loosely) the distance between pairs of data values.
 
  • #18
One more comment - sorry. The (free) statistics software R is very powerful, runs on Windows, Linux, Mac 0S X, and others, and has a module to compute the robust estimates we're discussing.
 

FAQ: How to filter erroneous reading from distribution of weights

What is the purpose of filtering erroneous readings from a distribution of weights?

The purpose of filtering erroneous readings from a distribution of weights is to remove outliers or incorrect data points that may skew the overall distribution and affect the accuracy of the results. By filtering out these erroneous readings, the distribution can better represent the true population and provide more reliable insights.

How do you determine which readings are erroneous?

There are several methods for identifying erroneous readings, such as visual inspection of the distribution, statistical analysis, or using mathematical algorithms. Some common indicators of erroneous readings include extreme values, inconsistent patterns, or being significantly different from the majority of the data points.

What techniques can be used to filter erroneous readings?

There are various techniques that can be used to filter erroneous readings, including trimming, winsorizing, smoothing, and clustering. Trimming involves removing the top and bottom percentage of data points, while winsorizing replaces extreme values with a predetermined percentile. Smoothing techniques, such as moving averages, can also help to reduce the impact of erroneous readings. Clustering involves grouping data points based on their similarity and then removing outliers within each cluster.

Is it necessary to filter all erroneous readings?

No, it is not necessary to filter all erroneous readings. In some cases, these data points may actually provide valuable insights or represent true anomalies in the population. The decision to filter erroneous readings should be based on the specific goals of the analysis and the potential impact of these readings on the overall results.

What are the potential drawbacks of filtering erroneous readings?

Filtering erroneous readings may result in a loss of information and potentially bias the distribution. It can also be a time-consuming process, especially when dealing with large datasets. Additionally, the choice of filtering technique and the threshold for determining erroneous readings may also affect the results. It is important to carefully consider these factors before implementing any filtering methods.

Back
Top