Mean time between lottery wins and probability of fraud by organizers

Jonathan212 · May 31, 2019

It may be simple: out of 1707 draws, 1356 draws produced no winner. That's no winner 79.4% of the time. Was a higher percentage expected given the average K of 6,126,358 and given (50 choose 6) = 15,890,700?

PeroK · May 31, 2019

Jonathan212 said:

It may be simple: out of 1707 draws, 1356 draws produced no winner. That's no winner 79.4% of the time. Was a higher percentage expected given the average K of 6,126,358 and given (50 choose 6) = 15,890,700?

If there is a winner about 20% of the time, then that implies that on average about 20% of the possible sets of numbers are covered. That's about 3.2 million different combinations.

Your figures suggest, therefore, that although 6.1 million tickets are sold, they represent only about 3.2 million combinations. I read yesterday that about 10,000 people play 1, 2, 3, 4, 5, 6 every week, for example. In any case, that would be the likely explanation. With 6 million random tickets I would expect about 5 million different combinations (rough guess). So, these figures are consistent with the hypothesis that players do not chose at random but typically favour certain types of combination.

The only way to verify this, of course, is to obtain figures for the number of combinations typically chosen on a weekly basis.

Note that with these figures, you will have to change your accusation to one where the operators suppress wins - there is no evidence here of excessive wins. It's how few wins there are given the ticket sales that needs to be explained.

PS the above data is consistent with there being an average of 2 winners each time the lottery is won. I.e. as there are 351 weeks when there was a winner there should be about 700 winners in total. Is that data available?

Jonathan212 · May 31, 2019

If you import the above text file to Excel and do the average of W when W > 0, it's 1.32. Not sure why you want that. The number of 1-winner draws is 280, the number of 2-winner draws is 55 etc. It's all in the summary at the beginning and the raw data is further down.

PeroK · May 31, 2019

Jonathan212 said:

If you import the above text file to Excel and do the average of W when W > 0, it's 1.32. Not sure why you want that. The number of 1-winner draws is 280, the number of 2-winner draws is 55 etc. It's all in the summary at the beginning and the raw data is further down.

There are fewer than 500 winners. That suggests that there may be certain combinations - possibly a relatively small number - with a lot of tickets. And that none of these tickets has won yet. At some time, however, one of these tickets will win and create a large number of winners that week. This would bring the average back towards 2 per win.

There may be another explanation. But, if there really are 10,000 people playing 1, 2, 3, 4, 5, 6 every week, then this is a possible explanation.

Jonathan212 · May 31, 2019

If we want to assess a single draw, how extreme a single draw is, given K for this draw, what's the proper way to do it?

(50 choose 6)/K must be ok as a factor for small K's, but it can't be right for K=(50 choose 6) even if people choose with a random number generator because even the random number generator will produce duplicates.

PeroK · May 31, 2019

Jonathan212 said:

If we want to assess a single draw, how extreme a single draw is, given K for this draw, what's the proper way to do it?

(50 choose 6)/K must be ok as a factor for small K's, but it can't be right for K=(50 choose 6) even if people choose with a random number generator because even the random number generator will produce duplicates.

If you only know how many tickets have been sold, but not how widely the tickets are distributed, then there is no way to predict the frequency of a lottery win. But, the total number of winners -over a potentially long time - should be more predictable.

Take an example of a lottery with 100 tickets and 50 players. If, for whatever reason, they all have different numbers, then you'll get one win every two weeks on average; and, only ever one winner.

At the other extreme, if they all have the same numbe, then you will only get one win every 100 weeks, but 50 winners every time.

And, if there is something between the two, with perhaps 40 different numbers, then you will get a win less than once every two weeks but sometimes more than one winner.

The common factor is the total number of winners, which relates only to the total number of tickets sold.

In the real lottery, out of 6.1 tickets sold, you might have only 3.2 million different numbers. Most of these would be held by only a few players: perhaps 1-5. But, some special "lucky" numbers might be held by thousands of different players. This could result in the pattern from your data. Most weeks there are a small numbers of winners, but if the lottery is played long enough, eventually one of the commonly held numbers will turn up and you'll get hundreds or thousands of winners.

In this case, it may take a long time for the number of winners to average out to match the ticket sales.

In the meantime, there is no definite, immediate way to know for sure why there are so few winners - given the number of ticket sales.

Jonathan212 · May 31, 2019

"there is no definite, immediate way to know for sure why there are so few winners - given the number of ticket sales."

Alright, I'm with you on this one. Going back to your simplified lottery, the extremes are

1. a win every 2 weeks
2. 50 wins every 100 weeks

So if we observe a win every single week, that's outside the above range and an anomaly, right? An extreme like the drug extremes previously mentioned. Can't we assign it a number like "p<0.001"?

PeroK · May 31, 2019

Jonathan212 said:

"there is no definite, immediate way to know for sure why there are so few winners - given the number of ticket sales."

Alright, I'm with you on this one. Going back to your simplified lottery, the extremes are

1. a win every 2 weeks
2. 50 wins every 100 weeks

So if we observe a win every single week, that's outside the above range and an anomaly, right? An extreme like the drug extremes previously mentioned. Can't we assign it a number like "p<0.001"?

If you had a win every week, then over time your confidence that the lottery was properly adminstered would reduce.

You're confusing probabilities with confidences.

Jonathan212 · May 31, 2019

Could go in the opposite direction. Assume numbers 1-30 are f times more popular than the rest and calculate f from the observations of W versus K, starting with W = 0.

mfb · Jun 1, 2019

If I interpret the txt right we had 10,457,692,468 tickets sold (that is a lot!) and 465 winners. At 1 in 15,890,700 we would expect 658 winners. To explain this difference with random chance we need a significant share of tickets going to a very small share of combinations. The 8 winners with the very small number of tickets sold (5.8 million) points in this direction, although I would (without calculating) expect more outliers.

Jonathan212 · Jun 2, 2019

The 8 winners with the very small number of tickets sold (5.8 million)

Here's the winning numbers at that draw, played in 8 different tickets.

34 27 13 17 6 13

Surprise, it can't be birthday numbers. It's as if someone knew what would happen and bought the same combination 8 times to ensure he wouldn't have to share too much of the prize.

Jonathan212 · Jun 2, 2019

What is the statistical significance of 465 instead of 658 winners? I think that is:

P( number of winners <= 465 | all numbers are equally popular )

PeroK · Jun 2, 2019

Jonathan212 said:

What is the statistical significance of 465 instead of 658 winners? I think that is:

P( number of winners <= 465 | all numbers are equally popular )

If the hypothesis is that ticket numbers were chosen at random (or equally popular), then that hypothesis would be false with almost 100% confidence. The calculated probability above would be close to zero.

But, we already know that numbers are chosen by people with certain biases. The data, from that point of view, tells us nothing. We would need many more weeks (millions perhaps) to see the full picture.

If you knew the distribution of numbers chosen each week, then you could test the hypothesis that the lottery is fair. Or, you could wait a few hundred million weeks or so.

Jonathan212 · Jun 2, 2019

"Or, you could wait a few hundred million weeks"

But whence that figure of a few hundred million?

PeroK · Jun 2, 2019

Jonathan212 said:

"Or, you could wait a few hundred million weeks"

But whence that figure of a few hundred million?

There are 1`5 million possible numbers. If a small number are very popular, let's d say 10, then one of these most popular numbers comes up only once every 1.5 million weeks.

If, for example, about 10,000 people choose 1, 2, 3, 4, 5, 6 every week, then either you look at the numbers chosen to see this; or, you run the lottery millions of times until this combination comes up and you get the data via the 10,000 winners that week.

Jonathan212 · Jun 2, 2019

Let's say the number 1 is picked 5% of the time, 2 is picked 4% of the time, etc to 50. That's 50 unknowns x1, x2, ..., x50. How do we get 50 equations to solve for these unknowns?

PeroK · Jun 2, 2019

Jonathan212 said:

Let's say the number 1 is picked 5% of the time, 2 is picked 4% of the time, etc to 50. That's 50 unknowns x1, x2, ..., x50. How do we get 50 equations to solve for these unknowns?

I'm not sure what you are learning from this. The question isn't directly how popular each individual number is but how popular different six-number combinations are. I've tended to use "numbers" above as shorthand for "combination of six numbers".

Jonathan212 · Jun 2, 2019

The dependence between numbers played in a ticket must be very weak. Got some data for the frequencies of individual winning numbers and about to post a histogram, unfortunately it's not the numbers played, only the winning numbers which indirectly tell us what people tend to pick.

PeroK · Jun 2, 2019

Jonathan212 said:

The dependence between numbers played in a ticket must be very weak. Got some data for the frequencies of individual winning numbers and about to post a histogram, unfortunately it's not the numbers played, only the winning numbers which indirectly tell us what people tend to pick.

It's too little data. It's only a few hundred winning combinations as a sample of 15 million possibilities.

Jonathan212 · Jun 2, 2019

Oopsa. Looks like this lottery is not as was thought. You play 5 numbers from 1 to 45 and 1 number from 1 to 20. Chances of a ticket winning are then 1 / (45 choose 5) * 1 / 20 = 1 / 24,435,180.

Jonathan212 · Jun 2, 2019

And here are the histograms of winning frequencies for the first 5 numbers and for the 6th number:

Not quite as sloped as expected!

Jonathan212 · Jun 2, 2019

So we had 10,457,692,468 tickets sold (that is a lot!) and 465 winners. At 1 in 24,435,180 we would expect 428 winners. What is the statistical significance of 465 winners when 428 winners are expected in 1707 draws? I want a figure like those "p<0.0021" expressions in drug research.

PeroK · Jun 2, 2019

Jonathan212 said:

So we had 10,457,692,468 tickets sold (that is a lot!) and 465 winners. At 1 in 24,435,180 we would expect 428 winners. What is the statistical significance of 465 winners when 428 winners are expected in 1707 draws? I want a figure like those "p<0.0021" expressions in drug research.

I suppose it depends a lot on to what extent you can trust the data! The numbers do look high, obviously. Are there any other restrictions that we don't know about?

The actual calculation is difficult because of the variations from week to week. You can get an estimate by looking at the probability of getting up to 464 winners in 1707 trials with a probability of 0.25 per trial. This turns out to be 98%.

So, only a 2% chance of 465 or more winners.

Of course, the data could have been at the other extreme as well.

But, it's clear that 465 or more winners is more likely when you can get multiple winners. As I said, the exact calculation would be very complicated.

My guess is you're somewhere in the range of a probability of 4-5% (including the other extreme).

mfb · Jun 2, 2019

Jonathan212 said:

Here's the winning numbers at that draw, played in 8 different tickets.

34 27 13 17 6 13

Surprise, it can't be birthday numbers. It's as if someone knew what would happen and bought the same combination 8 times to ensure he wouldn't have to share too much of the prize.

If someone knew all the numbers in advance then it wouldn't make sense to buy multiple tickets for the same drawing. Too suspicious if the winners have some connection, and with just 5 million tickets you are likely to be the only winner anyway.

We still don't have the actual time series of drawings.@PeroK: For the variance of the expected number of winners the week-by-week data is a higher order correction (taking into account the correlation between the tickets).

If everyone picks numbers randomly and we expect 428 winners then the standard deviation is sqrt(428)=20.7 and 465 is 1.8 standard deviations away (p=0.073). Take into account that people favor some numbers and it gets even more likely. No evidence of manipulation from the total number of winners.

Jonathan212 · Jun 2, 2019

If you compare the 464 winners with the 1707 trials it's hopeless, but if you compare them with the 10,457,692,468 tickets it's easy.

Take into account that people favor some numbers

That's exactly what those histograms disprove, whatever effect there is it is very weak. Could give it a value if you want to, the sum of frequencies for numbers 1 to 30 is 66.48% while it should be 30 / 45 = 66.66%. They are a tiny bit LESS popular than higher numbers!

Why is the standard deviation sqrt(428)?

mfb · Jun 2, 2019

The histograms show no preference for specific numbers but they don’t show preferences for specific combinations.

The variance of a Poisson distribution is the same as its mean, the standard deviation is the square root of the variance.

Jonathan212 · Jun 2, 2019

I can't reproduce that 465 winners number. How did you calculate it?

mfb · Jun 2, 2019

I just summed the entries in the second column in the long table. I get the same result if I multiply the second and third row in the first table and then sum the products.

Here an xls file, that is more convenient than the text file.

Jonathan212 · Jun 3, 2019

The variance of a Poisson distribution is the same as its mean

Isn't Poisson distribution the distribution of the time between wins? I thought it's a binomial distribution we've got here instead, approximated as gaussian.

mfb · Jun 3, 2019

Jonathan212 said:

Isn't Poisson distribution the distribution of the time between wins?

No.

I thought it's a binomial distribution we've got here instead, approximated as gaussian.

That is true as well. A Poisson distribution with a large expectation value is approximately a Gaussian distribution.

Jonathan212 · Jun 12, 2019

Greetings. I'm intending to write this up for a non-expert high-school-level audience. Complete with links for explanations like the origin of "(45 choose 5)", why we look at a normal distribution, etc. But there is one point I haven't yet understood myself. Is it ok to NOT mention Poisson distribution at all and instead say that the number of tickets winning in the 16 years should follow a binomial distribution, which we approximate with a normal distribution like we did in my other question below?

https://www.physicsforums.com/threads/probability-that-1000-coin-flips-results-in-600-tails.965579/

mfb · Jun 12, 2019

Jonathan212 said:

Is it ok to NOT mention Poisson distribution at all and instead say that the number of tickets winning in the 16 years should follow a binomial distribution, which we approximate with a normal distribution like we did in my other question below?

Sure. In that case you need the additional information that the variance of the normal distribution is equal to the mean.

Jonathan212 · Jun 12, 2019

Can't I just ignore that information and instead give the fact that the binomial distribution in

= 1 - BINOMDIST( M - 1 , N , 0.5, 1 )

is approximated by the normal distribution in

= 1 - NORMDIST( M - 1, N * 0.5, SQRT( N * 0.5 * (1-0.5) ), 1 )

where we'd replace 0.5 by 1/24,435,180 and use N = 10,457,692,468 and M = 465 ?

Then the statistical significance of the M = 465 wins (ie the probability of 465 wins or more) is

p = 1 - NORMDIST( 465 - 1, 427.9768951, SQRT( 427.9768776 ), 1 )

p = 0.040816379

That's not the same as your p=0.073 result in #59. Am I doing something wrong?

EDIT: just found the error. You're looking at the "|z| >" value but you should be looking at "z >". And because we want 465 or more, ie > 464, you should have calculated how many standard deviations 464 is from 428, not 465 from 428. That's 1.74129038 standard deviations and we get the same result at z > 1.74129.

Jonathan212 · Jun 12, 2019

In drug research the results are stated like this: p<0.01. How can we do the same in this problem? Ie how can we establish an upper bound for p given that the normal we're looking at is only an approximation to the binomial?

Jonathan212 · Jun 12, 2019

Is there any site where you can calculate extreme binomial integrals like this one without the normal approximation?

= 1 - BINOMDIST( 465 - 1, 10457692468, 1/24435180, 1 )

Mean time between lottery wins and probability of fraud by organizers

Attachments

Attachments

Similar threads

Hot Threads

Recent Insights