Is this a Poisson distribution problem?

In summary: Company B)In summary, the author is considering how to statistically analyze data on the quality of hip replacement prostheses, specifically whether A or B is better. The author calculates a Poisson-related λ (average number of events per interval) and finds that 500/20 = 25 failures / year for A, and 1000/20 = 50 failures / year for B. However, if we divide that by the number of prostheses, we reach the opposite conclusion, because 500/100000 = 0.5% > 1000/300000 = 0.33%. The author has difficulty understanding what these numbers mean and does not know how to combine the element of number of objects with the element of time and with the random
  • #1
lavoisier
177
24
Hello,
I have been thinking about this problem for a while, but I can't decide how it should be tackled statistically. I wonder if you can help, please.
Suppose that prostheses for hip replacement are sold mainly by 2 manufacturers, A and B.
Since they started being sold 20 years ago, 100 000 prostheses from A, and 300 000 from B, were implanted in patients.
During these 20 years, mechanical failures that required removal of the prosthesis were recorded for both types, in particular 500 failures for A and 1000 for B. We can assume that failure events were independent from one another, and did not depend on the time after implant: there was just some defect that only became apparent in some prostheses after an essentially random time post-implant.

I don't know what kind of statistics could or should be calculated in such situation, e.g. to make a judgment on the quality of the prostheses, on the propensity of each to break down, on how effective it would be to monitor patients with one prosthesis or the other for possible failures, etc.

I could calculate a Poisson-related λ (average number of events per interval).
It would be 500/20 = 25 failures / year for A, and 1000/20 = 50 failures / year for B.
Then I could calculate the probability of a given number of failures each year (or over different periods) for each type of prosthesis.
However, I have quite a few doubts on this approach.
Isn't the number of events dependent on how many prostheses of each type are present in the population at each time? A bit like radioactive decay, but with variable mass of substance?
For instance, suppose that B was implanted mostly during the first 5 years of the 20-year period we're considering (say 50 000 / year for the first 5 years, and the remaining 50 000 at a rate of ~3300 / year for the next 15 years). Then I would expect that the number of failures was not the same each year, but varied all the time, even day by day as new implants were made and some of them failed and got replaced by a new type.
So isn't my 20-year-averaged Poisson λ ineffective in telling me how many failures I can expect in the future, if I don't consider the dependency of the number of failures on the number of prostheses?
Is there any other theory that would better account for this?

Then, concerning the quality of the prostheses: purely looking at the number of failures seems to say that A is better than B, because historically there have been fewer failures for A than for B.
However, if we divide that by the number of prostheses, we reach the opposite conclusion, because 500/100000 = 0.5% > 1000/300000 = 0.33%.
What I have a hard time figuring out is what these numbers mean - if they mean anything at all.
If I want to know the quality of a mass-produced object, I take a sample of N of them, do some measurements or tests, collect the data, and I can do all sorts of nice statistics, e.g. if n pieces out of N are defective, I can estimate what proportion of them would be defective in the whole population, with its standard error, and thus compare different manufacturers of the same object.
Here instead I don't have any control on the sample I'm observing: I only know the total number of prostheses 'active' at each time, and I observe that at random times some of them fail, with each failure going to add to the count.
But indeed, these events are random. I am not taking 100 patients and measuring directly the quality of their prosthesis, to make a nice table with N and n, split by manufacturer.
So what is the meaning of 0.5% and 0.33% above? Is it an estimate of the proportion of defective prostheses of each type? But how would that make sense, considering that if I had taken the same count at a later time I would have most likely found a larger number of failures for both brands?
How can we combine the element of number of objects with the element of time and with the randomness of the observation of the failure event, into metrics of the quality of these objects and equations that allow us to predict the likelihood of future events?

If you can suggest how I should proceed, I would appreciate it.
Thanks!
L
 
Physics news on Phys.org
  • #2
I'm not clear on what data you have and don't have, can you provide an example of the raw data you have for some of these events?
 
  • #3
I can't speak much to the math, but here's how I would present the data if I had the number of failures for each company by month. The timechart uses the attached file 'failures.txt' containing randomly generated failures by month (500 for Company A and 1000 for Company B), and was created by the machine data analytics tool called Splunk.

Note 1: The failures for Company A are weighted x3 compared to Company B.
Note 2: The obvious correlation between the two companies has to do with generating the example data from the same RAND() column.

timechart.jpg
 

Attachments

  • failures.txt
    15.5 KB · Views: 537
  • search.txt
    689 bytes · Views: 477
  • #4
Thank you @stoomart , this is very interesting.
I don't actually have data, this is a theoretical problem (for now). Your approach of generating simulated data can be useful to study this more pragmatically.
My question is how to make a judgment on the quality of prosthesis A compared to B (e.g. an estimate of what percentage of prostheses of each brand is defective), based on the recorded failures and on the number of prostheses that are 'active' in patients at each time, and the propensity of failure (e.g. the rate of failure per period per implant).
In fact, as the number of failures is quite constant on average in your data, I would expect this to be a situation where their number is also more or less constant. In my example I was thinking more of a case where the number of 'active' prostheses would increase over time, not necessarily linearly, and then I would expect the number of failures to increase with time.

Later I thought a bit more about this, and I think it may be described by a system of differential equations (or at least recurrence equations) accounting for the variation in the number of prostheses of each brand, and within each brand, defective or not defective, as a function of: rate of implant of new ones, rate of drop-out of patients (deaths etc), rate of failure (only for defective ones). This would be a non-linear system of 4 equations, which I probably shouldn't try to solve analytically. If I had simulated data, however, I could try fitting the equations.
Not as easy as I thought...
 
  • #5
lavoisier said:
Not as easy as I thought...
Not easy but it seems possible, I think the key is having data that allows you to correctly calculate the company weights for each interval, something like 'number_sold' and 'sales_began' should work. I'm not sure tracking the number of 'active' prostheses will be too helpful, it seems like it would be terribly difficult if not impossible to maintain accurate data. Definitely an interesting problem to consider.
 
  • #6
My advice is to build a Monte Carlo simulation of it. Even if you think you have a theoretical solution, you should check it against a simulation. These things quickly become too complicated for theoretical analysis. Your problem, with the changing number of remaining devices and a mixed population will probably make the analysis difficult.
 
  • #7
Thank you @FactChecker ; I was indeed not keen on attempting a symbolic solution. I don't even think it's possible in this case.
Funnily enough, another problem came up at work that will probably require a numerical solution of a system of differential equations in R.
Which I have never tried before (I have done that with other software, which essentially took appropriately small dt intervals and solved the equations iteratively as if they were recurrence relations); so it will be interesting...
As for Monte Carlo, my boss is a fan of that method; I will have to find out how that works in practice.
 
  • #8
Found an excellent tool to do this in R: deSolve. Not only it can simulate the time course for many types of differential and even difference equations, given the parameters, but it can also fit experimental data (i.e. find the best estimates for the parameters). I'm going to try it asap.
There's also sysBio, but it's in development at the moment. I can write the equations, no problem.
 
  • Like
Likes FactChecker and stoomart

FAQ: Is this a Poisson distribution problem?

1. What is a Poisson distribution?

A Poisson distribution is a probability distribution that is used to model the number of occurrences of a specific event within a specific time or space, assuming the events occur independently at a constant rate.

2. How do I know if my data follows a Poisson distribution?

To determine if your data follows a Poisson distribution, you can plot a histogram of the data and see if it resembles a bell-shaped curve. Additionally, you can perform a goodness-of-fit test such as the chi-square test to compare your data to the expected Poisson distribution.

3. When should I use a Poisson distribution?

A Poisson distribution is commonly used when analyzing count data, such as the number of customers arriving at a store in a given time period, the number of emails received in a day, or the number of accidents occurring on a road in a year. It is also appropriate when the events occur independently and at a constant rate.

4. What are the assumptions of a Poisson distribution?

The main assumptions of a Poisson distribution are that the events occur independently, at a constant rate, and that the probability of an event occurring in one time interval is the same as in any other time interval. Additionally, the events should be rare, meaning that the probability of multiple events occurring at the same time is low.

5. Can I use a Poisson distribution for non-integer data?

No, a Poisson distribution is only applicable for discrete data, meaning that the data can only take on whole number values. If your data is continuous, you may need to use a different distribution such as the normal distribution.

Back
Top