Should I be treating the data I have as a Population or Sample?

  • B
  • Thread starter SumDood_
  • Start date
  • Tags
    Statistics
In summary, determining whether to treat data as a population or a sample depends on the context of the data collection. A population includes all members of a defined group, while a sample is a subset of that population used to make inferences about the whole. If the data represents the entire group of interest, it should be treated as a population; if it is a portion meant to estimate characteristics of the population, it should be treated as a sample. This distinction affects statistical analysis, conclusions drawn, and the applicability of results.
  • #1
SumDood_
30
6
TL;DR Summary
The question is fairly easy. What I am not sure is if I am supposed to treat this data as a sample or population? That changes the standard deviation calculation slightly. For part b, I think mean and median are appropriate measures but mode is not.
A study on strength properties of high-performance concrete obtained by using super-plasticizers and certain binders recorded the following data on flexural strength (in mega-pascals, MPa) from 28 tests:
6.1, 5.6, 7.1, 7.3, 6.6, 8.0, 6.8, 6.6, 7.6, 6.8, 6.7, 6.6, 6.8, 7.6, 9.3, 8.2, 8.7, 7.7, 9.3, 6.9, 8.1, 10.0, 7.5, 8.0,
11.6, 11.3, 11.9, 10.3.
a) Find the mean and standard deviation of these 28 strengths.
Mean = 8.04 MPa

b) Discuss which central tendency measures are appropriate for this data set, and which are inappropriate.
Mean and median are appropriate measures, but mode is not. Is this correct? I don't know how to justify my answer.
 
Physics news on Phys.org
  • #2
SumDood_ said:
TL;DR Summary: The question is fairly easy. What I am not sure is if I am supposed to treat this data as a sample or population?
The data you have is not the entire set of every possible result. It's a sample.
SumDood_ said:
b) Discuss which central tendency measures are appropriate for this data set, and which are inappropriate.
Mean and median are appropriate measures, but mode is not. Is this correct? I don't know how to justify my answer.
A sample can have the mode occur at two, widely separated, values. It could also have multiple values that are closely tied and a more samples can make the mode jump around significantly.
As an extreme example, consider the uniform distribution on the real line between 0 and 1. What is its mode? What kind of behavior could you expect for the mode of a sample from that distribution?
 
Last edited:
  • Like
Likes PeroK
  • #3
That's sample data. If you new the population parameters you wouldn't need to conduct tests.
 
  • Like
Likes SumDood_
  • #4
FactChecker said:
The data you have is not the entire set of every possible result. It's a sample.

A sample can have the mode occur at two, widely separated, values. It could also have multiple values that are closely tied and a more samples can make the mode jump around significantly.
As an extreme example, consider the uniform distribution on the real line between 0 and 1. What is its mode? What kind of behavior could you expect for the mode of a sample from that distribution?
When taking a sample from a uniform distribution, every real number within 0 and 1 has equal probability of being drawn. I would think that there would be no mode, as no 2 values would be the same.
 
  • #5
SumDood_ said:
When taking a sample from a uniform distribution, every real number within 0 and 1 has equal probability of being drawn. I would think that there would be no mode, as no 2 values would be the same.
Technically the mode is only defined for a discrete distribution. You could consider the mode on a uniform distribution of random numbers to a fixed number of decimal places.
 
  • #6
WWGD said:
That's sample data. If you new the population parameters you wouldn't need to conduct tests.
It's possible to test an entire population.
 
  • #7
PeroK said:
Technically the mode is only defined for a discrete distribution. You could consider the mode on a uniform distribution of random numbers to a fixed number of decimal places.
Then, does that mean that the mode becomes a relevant central tendency measure? I would say it isn't because it does not provide much information about the sample. If we already know the probability of each sample being drawn is equal, then what benefit does the mode value provide?
 
  • #8
SumDood_ said:
Then, does that mean that the mode becomes a relevant central tendency measure? I would say it isn't because it does not provide much information about the sample. If we already know the probability of each sample being drawn is equal, then what benefit does the mode value provide?
The question is whether the mode is relevant to this concrete sample?

Alternatively, you could consider the criteria for which the mode is relevant, and whether your sample meets those criteria?
 
  • #9
PeroK said:
The question is whether the mode is relevant to this concrete sample?

Alternatively, you could consider the criteria for which the mode is relevant, and whether your sample meets those criteria?
Well, yes. That is what I am trying to find out. How does one determine whether any value of central tendency is appropriate or not for a given sample?
 
  • #10
SumDood_ said:
How does one determine whether any value of central tendency is appropriate or not for a given sample?
Perhaps that's a good question. Ultimately, it's a matter of experience and intelligence. In general, if the data is spread out with an relativity large number of possible values, which occur at most a couple of times, then the mode is not very useful.

More generally, I would say the mode is not often that useful. Although, you may want to think of an example where it is.
 
  • Like
Likes SumDood_
  • #11
PS the point @FactChecker made was that the median and the mean should be similar for any significant sample. But, in this case, the mode is fairly random. And, therefore, tells you very little.
 
  • Like
Likes SumDood_
  • #12
Median is used when you have outliers, as it's not greatly affected by them, unlike the mean, which is , by contrast. As in, " Bill Gates , you and I have an average net worth of 500 billion "*

* Not literally.
 
  • Like
Likes FactChecker
  • #13
WWGD said:
Median is used when you have outliers, as it's not greatly affected by them, unlike the mean, which is , by contrast. As in, " Bill Gates , you and I have an average net worth of 500 billion "*

* Not literally.
Good point! That is also an advantage of the median. I see the median used often in exactly the situation of your example. For instance, this.

EDIT: Sorry, I read your "median" and thought "mode". I just repeated what you said about median.
This post can be deleted if you want.
 
  • Like
Likes WWGD
  • #14
FactChecker said:
Good point! That is also an advantage of the median. I see the median used often in exactly the situation of your example. For instance, this.

EDIT: Sorry, I read your "median" and thought "mode". I just repeated what you said about median.
This post can be deleted if you want.
On average, I am on track. Bill Gates and some 50,000 of us. By contrast, if the median is, say $30,000, Between FactChecker and WWGD, then the median ( Half the values in this case, IIRC) , then it won't be much different if we include William Gates III.
 
  • #15
FactChecker said:
A sample can have the mode occur at two, widely separated, values. It could also have multiple values that are closely tied and a more samples can make the mode jump around significantly.
As an extreme example, consider the uniform distribution on the real line between 0 and 1. What is its mode? What kind of behavior could you expect for the mode of a sample from that distribution?
This is true. However, that can, to a certain point, also be the case of the median when you have few entries in your sample.

Essentially, there are two different issues with "central tendency" measures. The first is whether the concept itself of central tendency has significance on the given distribution, and the second is the effects of statistical errors due to small sample. These are two different issues actually.

To address the first, if one even talks about a "central tendency", most of the time, one ASSUMES that the measured quantity is somehow "lumped" around a central value. This usually comes down to assuming that the distribution is a "single bump". If your distribution is made up of several "bumps", then the very notion of central tendency is questionable. For instance, if you're talking about body size of a mixed population of rats and dogs, where you have two or more bumps, namely one "around the average size of a rat" and then "several around the average sizes of different dog breeds, from a chiwawa to a Saint Bernard", what conceptual value could a central tendency measure actually have ?

So, even before considering WHAT central tendency measure could possibly be useful, the notion itself of central tendency must have a meaning, which includes the hypothesis that values are somehow "lumped around a central value", which comes down to assuming that the distribution is a "lump".

Once we make that hypothesis, three different estimators, namely sample mean, sample median, and sample mode, have different behaviours according to different properties of the original distribution and the sampling method.

If we have a small sample, this means two things:
1) we will have big statistical errors
2) the probability of getting "outliers" is small

the mean is the most reliable estimator, because it filters best the statistical noise. On a small sample, the heights of the different bins in a histogram are noisy, and the "highest one" could be relatively far away from the "central one" because of these fluctuations, so the mode is not appropriate. Also, the median is one of the sample values, and if you don't have many samples, the possibility that you are close to the "good" value is not very high either.

The bigger your sample gets, and the smaller the statistical noise, the better get these two other estimators such as mode and median, and the worse the mean can get, if there are outliers (that means, if the original distribution has "long tails"). See the Bill Gate example. Mode and median are not affected by rare outliers.

As to the median versus the mode, this will depend on the actual shape of the "bump". If the shape of the bump is "well-peaked", the mode may be a very good estimator. If however the bump is "flat-topped", then the median will do better. See the "uniform number distribution" example.

The median has the extra advantage of being a sample value, while the mode precision is depending on your chosen bin size of your histogram. If the distribution is rather symmetric, then both are good estimators. If your distribution is asymmetric, then you should think of why you need a central tendency. The mode will be closer to what has highest probability to happen, the median will be closer to "half has more, half has less".
 
  • Like
Likes SumDood_ and WWGD

FAQ: Should I be treating the data I have as a Population or Sample?

What is the difference between a population and a sample?

A population includes all members from a defined group that we are studying or collecting information on, while a sample consists of a subset of that population, selected for the actual study. The population is the entire pool from which a statistical sample is drawn.

When should I treat my data as a population?

You should treat your data as a population if you have access to data from every member of the group you are studying. This is often feasible in small, well-defined groups but becomes impractical for larger groups.

When should I treat my data as a sample?

You should treat your data as a sample when you only have access to a subset of the entire population. This is common in large populations where it is impractical or impossible to collect data from every member.

How does treating data as a population or sample affect statistical analysis?

Treating data as a population or sample affects the statistical formulas you use. For example, when calculating the standard deviation, you would divide by N (the number of observations) for a population and by N-1 for a sample to correct for bias. This adjustment is known as Bessel's correction.

What are the implications of incorrectly treating data as a population or sample?

Incorrectly treating data as a population or sample can lead to inaccurate statistical conclusions. For instance, using sample formulas on population data can underestimate variability, while using population formulas on sample data can overestimate confidence in your results.

Similar threads

Back
Top