How can I normalize integer data for a better fit to a normal distribution?

In summary, you are asking for advice on how to convert discrete data (integer k) to a continuous (real) data point using a normal distribution. You state that without doing this, the data is not normal. However, the normal distribution is not a perfect fit for your data. You also state that the data is not random, but come from measurements of a continuous quantity. You ask for advice on how to go about fitting a normal distribution to the data.
  • #1
Mark J.
81
0
Hi
I have a set of observed data from time intervals like 4,5,...
I want to fit this data to normal distribution.
Is there any normalization process I can make to change them from integers to real numbers as long as normal distribution is continuous?

Regards
 
Physics news on Phys.org
  • #2
Mark J. said:
Hi
Is there any normalization process I can make to change them from integers to real numbers as long as normal distribution is continuous?

If the meaning of your question is "Is there a way to fit a normal distribution to discrete data that can be proven correct or optimal by mathematics" , the answer is No, not without more information about what data is.

Have you tried simply converting each integer k to the corresponding "z-score" [itex] z = \frac{ (k - \mu)}{\sigma} [/itex] where [itex] \mu [/itex] is the mean of the data and [itex] \sigma [/itex] is the standard deviation of the data (or the unbiased estimator of the population standard deviation) ?

If you know that the discrete data comes from measurments of a continuous quantity that are rounded off, you might be able to do a process that essentially "smears out" each discrete data point to a possible distribution of continuous data points. Then you can try to fit a normal distribution to the superposition of these distributions. This is very sophisticated technique and I don't have the details of how to do it fresh in my mind. I think the method is called "using convolution kernels".

Explain more about the data.
 
  • #3
Data are just time between two arrivals of buses.
They are not measured to seconds just to minutes for example 5 minutes , 4 minutes etc.
I want to fit normal distribution to this data but as long as they are discrete and not real numbers I think they need some normalization here.
Any advice pls
Regards
 
  • #4
I can't resist observing that the exponential distribution is the one most often used for "interarrival times". What motivates your choice of a normal distribution? Have you don't some preliminary plotting of the data that suggests it is normally distributed?

You didn't say whether you had tried to fit a normal distribution by using the z-scores.

A simplistic way to represent data with roundoff error to the nearest minute would be to replace each observation of k minutes by a uniform distribution of "fake" observations on the interval on the interval [k-0.5,k+0.5] minutes. For example, if you represented the uniform distribution as data to the nearest second, an arrival at time 3 minutes would become a set of fake data points at each second from [180 - 30, to 180 + 29] seconds. Then you would take the mean and standard deviation of the fake data as the parameters of the normal distribution.

That is just my primitive oversimplification of a technique that I've seen used. I haven't looked at your other posts. Are you the poster who is writing a thesis? If you use this technique then you need to find the formal, correct, dignified way to go about it and find the proper terminology for it.

You also must distinguish between "roundoff" and "truncation" error. For example, if a true time of 3 minutes 48 seconds is truncated to a data point of 3 minutes (instead of being rounded to a data point of 4 minutes) then you should represent a datum of 3 minutes as a uniform distribution from 3 to 4 minutes.




Mark J. said:
Data are just time between two arrivals of buses.
They are not measured to seconds just to minutes for example 5 minutes , 4 minutes etc.
I want to fit normal distribution to this data but as long as they are discrete and not real numbers I think they need some normalization here.
Any advice pls
Regards
 
  • #5
You asked this question in another thread. You really can't model inter-arrival times as a normal distribution because they are not normally distributed. You really do need to think about using the appropriate distribution. The number of arrivals in a given time is a counting process such as Poisson. Then the inter-arrival times that result from that counting process are exponentially distributed.
 
  • #6
Thank you to Stephen and Alan I appreciate it.
I know from literature that it should be modeled by Poisson if counting number of arrivals or exponential if studying inter arrivals but histogram suggest me normal distribution.
Anyway thank you for advices.
Regards
 
  • #7
Hi Mark,

I was thinking that you might be frustrated that we just keep telling you that your distribution isn't normal. It occurred to me that maybe we're just not understanding what your data represents. Could you elaborate? I ask because you mention bus arrival times which are not random variables at all, they are scheduled events. So I thought maybe you could be looking at the error in actual arrival times or the actual time between two scheduled arrivals given that there is error in the actual arrival times of consecutive buses. These quantities could reasonably be normally distributed. Maybe if we better understood your data we could offer some help.
 
  • #8
If you absolutely want to test whether your data's histogram has a good fit to a normal distribution then you need to apply what is known as a Goodness-Of-Fit test.

For a normal distribution you use what is known as a Shapiro-Wilk test which will give you a statistics which tells you how 'well' the 'fit' is.

But again I want to give a note of caution to take in what the above posters have said: you need to understand your data not only from a probabilistic or histogram point of view, but more importantly from a process point of view.

Understanding the underlying process and the effect that it has on describing the final distribution is going to be a lot more useful than just trying to fit things to distributions especially if you are looking at something from the point of view of the process as opposed to using results for statistical purposes like say testing whether the errors of a regression are normally distributed.
 
  • #9
Yes thank you,
Actually I am using Chi-square, Kolmogorov and other tests.
The first one seems fine the others just do not fit.
I am attaching with the email one sample of data collected.
Of course ideally they should be scheduled each 5 minutes but actually we see that there are errors that's what I am working on.
Thank you for suggestions.
Regards
 
Last edited:
  • #10
I thought that might be what you were actually doing. Now the statisticians can jump in.
 
  • #11
Mark J. said:
Yes thank you,
Actually I am using Chi-square, Kolmogorov and other tests.
The first one seems fine the others just do not fit.
I am attaching with the email one sample of data collected.
Of course ideally they should be scheduled each 5 minutes but actually we see that there are errors that's what I am working on.
Thank you for suggestions.
Regards

I still don't get it. Am I the only one that doesn't?

Please describe exactly what you want. Then we can help.
 
  • #12
The idea is that observed data are inter-arrival times.
The observers didn't get them in the exact seconds but kind of rounded them in minutes for examples 7.04 7.08 etc
Now to fit this data in a common used distribution for example exponential is impossible as long as it is continuous distribution.
How to arrange the estimation of error for this data or any other approach to this situation?
Regards
 
  • #13
Mark J. said:
Now to fit this data in a common used distribution for example exponential is impossible as long as it is continuous distribution.

It should be straightforward to fit a continuous distribution to the data once you have decided what criteria to use for a fit. A continuous density function f(x) implies a discrete distribution if the data was rounded. For example, the probability of observing x = 7 minutes given that data was rounded to the nearest minute is [itex] \int_{6.5}^{7.5} {f(x) dx} [/itex].

Fit the data to the discrete distribution that is implied by rounding the continuous distribution.

You have to define "kind of rounded" precisely. Did they round or truncate?
 
  • #14
Thank you for your explanation.
The data was rounded to upper floor meaning that :

6.7 was taken as 7
4.3 was taken as 4

Regards
 

FAQ: How can I normalize integer data for a better fit to a normal distribution?

What is a normal distribution?

A normal distribution is a statistical distribution that represents a symmetrical bell-shaped curve, with the mean, median, and mode all being equal. It is commonly used in data analysis to describe the distribution of continuous variables.

Why is it important to have data that follows a normal distribution?

Having data that follows a normal distribution allows for the use of various statistical tests and models, as they assume that the data is normally distributed. It also makes it easier to interpret and compare data, as the majority of observations will fall within a certain range of values.

How do you determine if data follows a normal distribution?

There are several ways to determine if data follows a normal distribution, such as visually inspecting a histogram or using statistical tests like the Shapiro-Wilk test or the Kolmogorov-Smirnov test. These tests compare the observed data to the expected values of a normal distribution and provide a p-value, which indicates the likelihood of the data following a normal distribution.

Can data be transformed to follow a normal distribution?

Yes, data can be transformed to follow a normal distribution through various methods such as logarithmic or square root transformations. However, it is important to consider the purpose of the analysis and the implications of transforming the data before deciding to do so.

What are the potential consequences of using statistical tests that assume normality on non-normal data?

Using statistical tests that assume normality on non-normal data can lead to inaccurate conclusions and misinterpretation of results. It can also affect the validity and reliability of the findings. Therefore, it is important to assess the normality of data before selecting an appropriate statistical test.

Similar threads

Back
Top