Practical problem: need a distribution

CRGreathouse · Apr 14, 2009

This should be an easy question, but I can't think of how to answer it! I never did take enough stats.

I'm looking at n = 77 returned surveys which come from a population of size 627. (At the moment I'm assuming that response is uncorrelated with the answers; it's actually somewhat reasonable in this case, and I don't have the background I'd need to assume otherwise!)

Each survey contains count data: I have
[ ] 0
[ ] 1
[ ] 2
[ ] 3
foos (for various foo). From this I can of course determine the number of foos in the sample, and the BLUE for the total across the population. But what sort of distribution should I use to determine a (say) 90% confidence range? I was toying with misusing a Poisson model here (each respondent acting like a time interval), but even so I wasn't able to determine a CI (must have been doing something very wrong; when I did a normal approximation of the Poisson I came up with a negative lower bound!). In summary:
1. What sort of distribution is appropriate? A simpler one would be better.
2. Very briefly (one sentence or just drop in a link; I'll work out the details) how do I find a CI with that distribution?

Hurkyl · Apr 14, 2009

I think... what you want to do is to estimate the probability distribution over {0, 1, 2, 3} and then take the mean of that probability distribution?

So, you have the unknown variables X = P(0), Y = P(1), Z = P(2), from which you can compute the quantities:
P(My poll of N people got the frequencies A, B, C, (N-A-B-C) | X = x, Y = y, Z = z)

You could probably do some Bayesian thing to estimate

P(X = x, Y = y, Z = z | My poll of N people got the frequencies A, B, C, (N-A-B-C))

from which you can compute a probability distribution on

mu = y + 2z + 3(1-x-y-z)

At least, these are my first instincts. I don't know if this is the right way to go about it.

If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples. link[/url]

Maybe the right way to get a distribution on the mean is to start with this?

CRGreathouse · Apr 14, 2009

Hurkyl said:

At least, these are my first instincts. I don't know if this is the right way to go about it.

You have no idea how far that went in making me feel non-stupid for asking the question.

Hurkyl · Apr 14, 2009

CRGreathouse said:

You have no idea how far that went in making me feel non-stupid for asking the question.

Heh! Oh, if you didn't notice, I added something about doing a goodness-of-fit test to the end of my previous post.

CRGreathouse · Apr 14, 2009

Hurkyl said:

If it was the distribution over {0, 1, 2, 3} you wanted, you could do a Chi-square test with three degrees of freedom if each bucket has enough samples.

I have a {0, 1, 2, 3, 4, 5} distribution (number of objects: discrete), a {0, 2.5, 4, 6, 12, 50, 250} (frequency: continuous, but broken into blocks for the survey), and several categorical or binary distributions.

I don't understand what I'd do with a chi-square. I though that was when you wanted to test if a prior distribution was reasonable given sample data, but I have no prior idea what the distribution was before starting.

Hurkyl · Apr 14, 2009

Well, the thing I'm hoping will work out is that you can somehow obtain a distribution on the mean by integrating over all possible distributions on {0,1,2,3,4,5} that would yield that mean. Maybe goodness-of-fit isn't the way to go about it, but I decided it was a place to start thinking.

mXSCNT · Apr 14, 2009

For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.

CRGreathouse · Apr 14, 2009

mXSCNT said:

For n=77, it's probably safe to assume that the mean number of foo is approximately normally distributed, and it's also probably safe to use the sample estimate of the variance when calculating the confidence interval for the mean.

Unfortunately that won't do. The sample standard deviation I calculated is greater than the mean, so to get even a 70% confidence interval I'd have to include 0. But the lower bound is the most important part of the estimation, and it can't be that low.

mXSCNT · Apr 14, 2009

You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?

CRGreathouse · Apr 14, 2009

mXSCNT said:

You know that you divide the sample std. dev. by sqrt(77) to get the mean std. dev?

My calculation was

S = (0 - mean)^2 * [# of 0 responses] + (1 - mean)^2 * [# of 1 responses] + (2 - mean)^2 * [# of 2 responses]
sigma = sqrt(S/76)

where [# of 0 responses] + [# of 1 responses] + [# of 2 responses] = 77.Edit: Actually this is simplified, since I have more possibilities than {0, 1, 2}, but you get what I mean. I used 76 rather than 77 because this is a sample and not the population.

mXSCNT · Apr 14, 2009

If that doesn't give you a small enough confidence interval, I don't know then.

CRGreathouse · Apr 14, 2009

mXSCNT said:

If that doesn't give you a small enough confidence interval, I don't know then.

My mean is just under 1, since most people report 0. My standard deviation is a touch over 1.

I tend to think that getting results like this means my approach is wrong and I need to model it differently. What do you think of my proposed (mis)use of the Poisson distribution here?

CRGreathouse · Apr 14, 2009

I tried using R (my first time!) to calculate the 5% and 95% using a Poisson distribution:

Code:

> qpois(.05*(1:19),0.96104 * 627)
[1] 562 571 577 582 586 590 593 596 599 602 605 609 612 615 619 623 628 634 643

I don't think this is a great approach, since it doesn't take into account the degree to which the sample will randomly deviate from the population. But at least it gives a reasonable bound: 562 to 643 with 90% confidence. (The true 90% bound should then be wider, though I don't know how much.)

Hurkyl · Apr 14, 2009

I'm mildly confused about what you're actually trying to compute. (Something got lost of obfuscated in the abstraction) But, assuming I have a good idea about it...

One thing to consider is that maybe you just don't have enough data to compute what you want.

Since you're interested in lower bounds, maybe you shouldn't be doing confidence intervals, but instead one-sided tests; i.e. find a 95% confidence interval of the form "A < X" rather than one of the form "A < X < B".

Since you've revealed that your actual data is mostly zeroes, a few ones, and sporadic higher values, it seems more plausible that the Poisson could be used. Does the thing you're actually testing for have qualities that suggest Poisson is accurate? You could always do a goodness-of-fit test to see if a Poisson distribution with the right mean is a decent description of the data.

mXSCNT · Apr 15, 2009

If you do model it as a Poisson distribution, there is a way to estimate it.
http://en.wikipedia.org/wiki/Poisson_distribution#Parameter_estimation
gives an estimator, but not the variance of the estimate--I guess you need to find a book that has that information.

CRGreathouse · Apr 15, 2009

Yes, that says the use the BLUE, which I was already using.

CRGreathouse · Apr 15, 2009

Hurkyl said:

One thing to consider is that maybe you just don't have enough data to compute what you want.

I hope not. I think the main problem is my lack of statistical experience in choosing good models and techniques.

But at least now I have an estimate, even if it is narrower than I think is justified. There's at least one modification I could do to the test, but that would make the problem worse (narrow the range): reducing the population size by the sample size and adding in the known values.

Practical problem: need a distribution

FAQ: Practical problem: need a distribution

1. What is a distribution?

2. Why is understanding distribution important in scientific research?

3. What factors affect the shape of a distribution?

4. How do you determine the appropriate distribution for your data?

5. Can distributions be skewed and why is this important?

Similar threads

Hot Threads

Recent Insights