Polling Margin of Error

  • #36
I don't think they make naive mistakes either. A more realistic "mistake" is to deliberately undersample subsets where you think you alreadsy know the answer to some degree, so you can oversample the subset where you don't.

This trades uncertainty in one subsample for undcertainty in another. And this can improve the overall error.

The problems start when the assumptions on the "well-known" sample turn out to be incorrect. They are aggravated if the undersampling is sufficient to hide the discrepancy between what is ecxpected and what is observed.

In the physical sciences we would say that one is reducing the statistical error at a cost o increased systematic error.
 
  • Like
Likes FactChecker
Physics news on Phys.org
  • #37
Vanadium 50 said:
A more realistic "mistake" is to deliberately undersample subsets where you think you alreadsy know the answer to some degree
I don't think that high quality pollsters do that at all. The corrections, stratifications, and so forth are based on the independent variables that go into your statistical model (demographics), not the dependent variable that comes out of the statistical model (opinion). That is why one is not an edge case of the other. They don't make their sampling decisions based on assumed knowledge of the dependent variables.
 
Last edited:
  • Like
Likes FactChecker
  • #38
Dale said:
I don't think that high quality pollsters do that at all.
Then how do you think they get errors below √N?
 
  • #39
Vanadium 50 said:
A more realistic "mistake" is to deliberately undersample subsets where you think you alreadsy know the answer to some degree, so you can oversample the subset where you don't.
Define "undersample". If there is a subset of the population that has a small variation in the dependent variable of interest the data should show that. Is it a mistake to sample that subset less? Why? Spending time and money to drive that subset standard deviation lower than necessary instead of using it on other subsets where the uncertainties are greater would be a mistake. The goal is to get the best estimate for your time and money.
 
  • #40
If you prefer "sample less", I am OK with that.
 
  • #41
Vanadium 50 said:
If you prefer "sample less", I am OK with that.
My point is that it is often the smart thing to do to get the best answer for the time and money. Increasing the sample size is not the only way to improve the result.
 
  • #42
At the risk of being beat up again for strawmen, if I have a box labeled "10,000 white balls" and a box labeled "10,000 red balls" and a third box labeled "10,000 balls, mixed red and white" I only need to sample the last box.

If the first two boxes say "9000 red (white) and 1000 white (red)" I still only need to sample the third box.

I will get in trouble if the contents of the first two boxes doesn't match the label.
 
  • #43
Vanadium 50 said:
Then how do you think they get errors below √N?
Through stratified sampling of the independent variables, if indeed they do actually get errors below ##\sigma/\sqrt{N}##
 
  • Like
Likes FactChecker
  • #44
Vanadium 50 said:
At the risk of being beat up again for strawmen, if I have a box labeled "10,000 white balls" and a box labeled "10,000 red balls" and a third box labeled "10,000 balls, mixed red and white" I only need to sample the last box.
For what purpose?
Suppose you are studying the emergency stopping distance of car drivers. Suppose that half the cars in the general population have ABS and half do not. Also, suppose that stopping distance of cars with ABS have a standard deviation of 5 feet, but cars without ABS have a standard deviation of 30 feet because some drivers pump the brakes well, some pump brakes too slowly, and others don't pump brakes at all. Every stopping test costs $500, so you can only test 1000 drivers. You should not pick a sample ignoring who has ABS. You will get a more accurate result if you test more without ABS.

Suppose you are polling voters. Older voters are more likely to vote for candidate A and younger voters for candidate B. The general population has 30% over 60 years old, but your poll is on smartphones and your sample only had 15% over 60 years old. You should apply stratified sampling techniques to adjust your sample results to better match the general population.
 
Last edited:
  • #45
Vanadium 50 said:
Then what am I to make of two polls that differ by 2x or 3x the margin of error?
That it is not feasible to obtain an unbiased sample from a population of 160 million voters.
 
  • #46
FactChecker said:
ou should apply stratified sampling techniques to adjust your sample results to better match the general population.
Fine. Let me then ask yet again, how do you beat the √N uncertainty, as hee CNN poll I mentioned claims to.
 
  • #47
Vanadium 50 said:
Fine. Let me then ask yet again, how do you beat the √N uncertainty, as hee CNN poll I mentioned claims to.
It is easy if there are groups that cluster around certain values with small variances within that subsample. The mathematics of it is simple. This example was post #24.

FactChecker said:
Suppose you have a sample from two groups of equal sizes, one clustered closely around 100 and the other clustered closely around -100. By grouping the subsamples, you have two small subsample variances. The end result will be smaller than if you ignored the groups and had a lot of large ##(x_i-0)^2 \approx 100^2## terms to sum.
 
Last edited:
  • #48
FactChecker said:
It is easy if there are groups that cluster around certain values
I don't see it. Take two delta functions. Your uncertainty on the total mean is not zero. It's driven by your uncerrainty in counting how many elements are in each distribution. And you are back to √N.

You can beat it if you don't have to count. But again, now we are moving away from polls.
 
  • #49
Vanadium 50 said:
I don't see it. Take two delta functions. Your uncertainty on the total mean is not zero. It's driven by your uncerrainty in counting how many elements are in each distribution.
Those are often well known about the general population. Age distributions, wealth, education levels, home locations, etc. are all known fairly well from the government population census. A pollster will probably not rely on sampling to determine those characteristics about the general population. He has better sources for that information. On the other hand, he will record those characteristics about his sample so that he can adjust his sample results, if necessary, to better reflect the general population.
Vanadium 50 said:
And you are back to √N.
No. The individual variances within the subgroups may be greatly reduced.
 
  • #50
Vanadium 50 said:
Fine. Let me then ask yet again, how do you beat the √N uncertainty, as hee CNN poll I mentioned claims to.
Vanadium 50 said:
The latest CNN poll has N=2074 and a stated margin of error of 3.0%. It's already hard to reconcile those two numbers, especially at 2σ. It's certainly not the binomial error.
So I did a brief Monte Carlo simulation with a poll result represented as a draw from a binomial distribution with N=2074 and p=0.5. I simulated 1000 such polls. The mean was 0.5002 with a standard deviation of 0.0107. So a margin of error of 3.0% is greater than twice the standard deviation (0.0214). This is not an example of "beat[ing] √N uncertainty".

Since a lot of polls use Bayesian techniques I also calculated the posterior for a 50/50 split on 2074 responses using a flat Beta distributed prior because the Beta distribution is a conjugate prior for Binomial or Bernoulli data. With that I got a 95% credible interval of plus or minus 0.0215, which is almost identical to the Monte Carlo result above.

This example does not appear to be an example where the margin of error is lower than what can be justified based on the sample size. In fact, it seems about the opposite. It seems that there is about 1% additional statistical uncertainty included beyond the idealized uncertainty. This could include the fact that the result was not 50/50, and also possibly that the weighting that was needed for this sample increased the overall variance.
 
Last edited:
  • Like
Likes Klystron, Vanadium 50 and FactChecker
  • #51
There is another theoretical aspect of polling: How is the usual variance equation influenced by the constraint that the total of the percentages must add up to 100%?
I have no experience with this.
 
  • #52
If you have a sample of N, and the fraction voting for Jones is f, the uncertainty on that number is ##\sqrt{Nf (1-f)}##.
 
  • Like
Likes FactChecker
  • #53
Vanadium 50 said:
If you have a sample of N, and the fraction voting for Jones is f, the uncertainty on that number is ##\sqrt{Nf (1-f)}##.
Yes. But what we are talking about is the uncertainty on that number divided by ##N## (times 100 %). For the CNN poll that works out to 1.1 %, which is right in line with my Monte Carlo simulation and the Bayesian posterior.
 
Last edited:
  • #54
I trust @FactChecker 's ability to do algebra to convert that formula to whatever one he is most interested in. (You probably want to vide by Nf and not N in most cases)
 
  • #55
Vanadium 50 said:
I trust @FactChecker 's ability to do algebra to convert that formula to whatever one he is most interested in.
OK, but you are claiming that the CNN poll "beat the ##\sqrt{N}## uncertainty", which it didn't.
 
  • #56
So, I repeated Dale's Monte Carlo, and got an 1σ variation of 1.2%. I did some things slightly differently (e,h, a 48-47-5 true distribution), but would say we agree. There are also a couple of things I did that I didn't like for expedience sake. Did you know Excel doesn't have a POIISSON.INV function?

So I am convinced.

Even so, I think this number is, if not questionable, at least discussable. It implies that the 1σ uncertainty on the sample correction is 0.9%, which is nine respondents in each column. I'll leave it to people to decide for themselves if they believe that a gigantic poll reqyurubf would get the correct answer to better than 1%.

Insofar as the betting odds people are rational actors, they believe they poll errors are uncerestimated, or equivalently this race is even close than the polls suggest. I'm not saying they are right and I am not saying they are wrong - just that that is what they are betting their own money on.
 
  • #57
Vanadium 50 said:
Insofar as the betting odds people are rational actors, they believe they poll errors are uncerestimated
I think that is accurate. The betting odds people are making a prediction on behavior, while the pollsters are (in the best case) making a measurement of opinion. So the uncertainty in the behavior prediction is much greater than the uncertainty in the measurement of opinion. And the uncertainty in the measurement of opinion is also greater than just the margin of error.
 
  • #58
Dale said:
making a prediction on behavior, while the pollsters are (in the best case) making a measurement of opinion
That would have sent one of my social sciences professors into a tizzy. He always argued that polls measure behavior - they measure what people say they think, not what they actually think. :smile:

He was also rumored to make his own moonshine. FWIW.

However, I think you're still dealing with a difference in behavior - what people will say and how people will vote.
 
  • Like
Likes Bystander and Dale
  • #59
The actual results depend on the weather, job demands, attitude regarding whether their vote matters, etc.
Those are sources of variability that are hard to factor in and I am not sure we would want them to try.
 
  • #60
While people did argue "no, this is just opinion", it's pretty clearly really an attempt to prognosticate. Otherwise, why use likely voters? Why not include everyone - resident aliens, illegal aliens, those under 18, and so on. They have opinions as well.

The bigger issue is, of course, that US presidents are elected by the states, not the populace at large. Changing opinions in California or Wyoming makes no difference. So what is being measured is correlated with electoral outcome but not the same.

A must-win district for the Democrats is NE-2, Omaha. This is what got me thinking about thus. Harris is polling 11 points ahead of Biden in the latest poll. That's well above the margin of error, and well above the national shift. Maybe they just really dig her in Omaha. But in an election where both candidates have high floors and low ceilings, an 11 point swing cries out for explanation.

BTW, this is also a CNN poll, contemporaneous with the 2074 subhect national poll. I am hoping these are two completely eparate polls and not that a third of the people surveyed in the national poll are from Omaha.
 
  • #61
Vanadium 50 said:
While people did argue "no, this is just opinion", it's pretty clearly really an attempt to prognosticate. Otherwise, why use likely voters? Why not include everyone - resident aliens, illegal aliens, those under 18, and so on. They have opinions as well.
That is one problem with statistics in general. People like to misuse them. This is similar to how a small p value is often taken to indicate a large or important effect. It is much easier to just get excited over a number than it is to understand what that number actually means.
 
  • Like
Likes Vanadium 50
  • #62
Living and registered to vote in a "swing state", I receive a few 2024 election poll requests per business day including CNN. Many of the pollsters ask several demographic questions before the actual poll. Counting or ignoring your (election) response depends on your demographic answers.

My daughter who lives nearby responded to several identical polls. Our demographics coincide except for age, gender and (wait for it) race. She identifies as Asian American while I mark White as there is no Ashkenazi category or Decline to State. I was dropped from all the polls while my daughter was deluged with followup questions and advanced polls.

Published poll results (NYT, WaPo) did not mention these participation filters. Selecting poll participants makes sense depending on criteria but seems fraught with preconceptions and possible malfeasance.
 
  • #63
Vanadium 50 said:
While people did argue "no, this is just opinion", it's pretty clearly really an attempt to prognosticate.
Yes. There is no point otherwise.
Vanadium 50 said:
The bigger issue is, of course, that US presidents are elected by the states, not the populace at large. Changing opinions in California or Wyoming makes no difference.
All states except Nebraska and Maine are "winner-take-all" for their Electoral College votes. California and Wyoming are probably safe for one side or the other and their individual votes are never split.
An interesting agreement: A lot of states have agreed that they will give all their Electoral College votes to the winner of the national popular vote as long as enough states in the agreement can dominate the Electoral College. That would bring the Presidential election back to being decided by the popular vote using a "Rube Goldberg" mechanism.
Vanadium 50 said:
So what is being measured is correlated with electoral outcome but not the same.
Right. But it is the votes of the "swing States" that really matter. So it is the poll in individual states that are important.
Vanadium 50 said:
A must-win district for the Democrats is NE-2, Omaha.
Nebraska is probably a safe state for Trump. It would require a enormously lopsided vote in Omaha to change that.
Vanadium 50 said:
I am hoping these are two completely eparate polls and not that a third of the people surveyed in the national poll are from Omaha.
That is a safe bet.
 
  • #64
Klystron said:
Living and registered to vote in a "swing state"
I'm sorry.
Klystron said:
as there is no Ashkenazi category
I met a professor in some liberal arts field or other who went on a tear about how the Ashkenazim are really Polish, there are no Mizrahi, and the Levant was originally settled by Muslims 2000 years before Mohammed. Maybe she wrote the poll. :smile:

I don't see any skulduggery in you not being called any more (and might consider it a plus). If they are already overampling in your demographic and undersampling in your daughter's, isn't this what you expect? I also don't expect that they would drop a completed poll - just de-weight it.
 
  • #65
Rather than guessing, it would be interesting to see what methods the pollsters actually use. A lot will be proprietary. I have seen some detailed descriptions, but I can not find them with a casual search. Here are a couple of general descriptions of some issues by CNN and the polling company, SSRS, that they are using now.
If anyone can find more detailed descriptions by respected polling companies, I would be interested in seeing them.
CNN: https://www.cnn.com/2021/09/10/politics/cnn-polling-new-methodology/index.html
SSRS: https://ssrs.com/research-areas/political-election-polling/
 
  • Like
Likes Dale
  • #66
Klystron said:
Selecting poll participants makes sense depending on criteria but seems fraught with preconceptions and possible malfeasance.
Truly random sampling of voters is highly impractical. I don't have any better ideas than stratified sampling.

FactChecker: A lot of states have agreed that they will give all their Electoral College votes to the winner of the national popular vote as long as enough states in the agreement can dominate the Electoral College. That would bring the Presidential election back to being decided by the popular vote using a "Rube Goldberg" mechanism.

If such a system ever made a difference the voters in the states whose votes were switched would be justifiably outraged. Such laws would be immediately rescinded. Indeed if there were ever a chance that such a law would cause Hated Enemy to win against the will of the state's voters then I believe it would be rescinded before the election. In short, the whole thing is yet another symbolic gesture.
 
  • Like
Likes Vanadium 50
  • #67
I was afraid to bring that up, lest this devolve into a discussion on the Electoral College. The important point is that it is what the election rules are, not what they might be. It is pretty clear it is doing what it is designed to do, which may or may not be what any given citizen wants.

I agree with @Hornbein that holding the election and then changing the rules to change the outcome would provoke outrage, and will never happen.

Givem the present estimated distribution of polls and likely state outcomes, we are talking about around a ½% edge for one candidate. Not insignificant, but also not huge - it has been higher in the past.
 
  • #68
FactChecker said:
A lot will be proprietary. I
While I'd like to see this too, this is why I think we never will. They are selling their 'secret sauce". I'd be delighted if they did a data plus analysis dump of past polls: let's have a good look at 2016.

The problem with correcting for sampling bias comes about because of correlations. I can easily re-weight the variables so that all the 1-d distributions match expectations, but what about 2-way? That is, my sample might match the parent distribution in male vs. female and young vs. old. but not in young men vs. older women. If you have 10 yes/no variables and a 1000 subjects, you have one per bucket. You'll never get that right (which is why they do something else)
 
  • #69
Vanadium 50 said:
While I'd like to see this too, this is why I think we never will. They are selling their 'secret sauce". I'd be delighted if they did a data plus analysis dump of past polls: let's have a good look at 2016.
I have seen descriptions that are detailed enough for my satisfaction. I don't remember what polling organization did that.
Vanadium 50 said:
The problem with correcting for sampling bias comes about because of correlations. I can easily re-weight the variables so that all the 1-d distributions match expectations, but what about 2-way? That is, my sample might match the parent distribution in male vs. female and young vs. old. but not in young men vs. older women. If you have 10 yes/no variables and a 1000 subjects, you have one per bucket. You'll never get that right (which is why they do something else)
It seems like correlations of the general population like sex versus age distribution, which can be well established independently of the poll, would be possible to adjust for.
 
  • #70
FactChecker said:
It seems like correlations of the general population like sex versus age distribution, which can be well established independently of the poll, would be possible to adjust for.
Sure. Bur when I am looking at Pacific Islander females with some college but not degree, in a particular age and income band, I have chopped the data up so finely I may not be able to correct: if I expect 0.5 in my sample, what do I do if I have two? If I have zero?

A sample of 2000 and 11 yes/no questions puts on average one entry per bin. Sometimes you'll get 1, sometimes a few, and sometimes none.

This is why the pollsters don't do this.
 

Similar threads

Replies
4
Views
2K
Replies
6
Views
11K
Replies
10
Views
9K
Replies
13
Views
1K
Replies
7
Views
3K
Replies
19
Views
6K
Replies
4
Views
5K
Back
Top