Can you Combinie two transition probability matrices?

In summary, the conversation discusses the possibility of combining two transitional probability matrices into one in order to obtain a more accurate representation of a stochastic process. It is suggested that the resulting matrix can be obtained by taking the scalar average of the two matrices, but it is important to ensure that the resulting matrix still adheres to the Kolmogorov axioms, specifically that each row sums to 1. The conversation also mentions the need for a deeper understanding of the process before attempting to combine the matrices. Additionally, it is recommended to create a conditional probability model based on factors such as the type of car being driven. Finally, the conversation suggests obtaining frequency information for each event and then dividing by the relevant total to calculate empirical probabilities.
  • #36
Okay, thank you for explaining that and for your help.
 
Physics news on Phys.org
  • #37
Hi Chiro,

A bit off the topic, but just in relation to discrete markov chains, is there a statistical test for the sample size of measurements to ensure that the transition probability matrix (that you construct) for the process significantly represents the process that you are trying to model?
 
  • #38
There is a general area of statistics that concerns itself with sample size and power calculation in relation to a variety of statistical tests.

But with regards to probabilities, that is a little different because you are more interested in whether the distribution based on your sample is good enough to be used as some kind of population model, and typically most statistical tests are concerned with estimation of specific parameters (like say the mean) or non-parametric parameters (i.e. distribution invariant ones like the median).

One thing you can do is to treat your probabilities as proportions and then consider an interval that corresponds to that probability given a sample size.

So your probabilities are proportions just like a Bernoulli trial, and you can then look at the power and sample size calculations for getting one particular proportion in some region correctly or incorrectly and then look at how that applies to the distribution and individual probabilites.
 
  • #39
chiro said:
Basically what you would have to do is break it up into a small number of intervals (which you have done) and then consider all the branches to get a complete set of conditional distributions.

So instead of making your conditional distribution based on a continuous variable, you make it based on a discrete one.

So in other words you restrict your parking and journey times to fit into "bins" and then you look at each conditional distribution for each of the branches.

For example if you allow the smallest time interval to be ten minutes: then you consider conditional distributions for total times in terms of lumps of these intervals.

So if you have n of these intervals, you will get 2^n branches. Some branches may have zero probabilities, but in general you will have 2^n individual branches corresponding to all the possibilities that you can take.

So an example might be P(Total Journey Time = 30 minutes| First 10 = Travel, Second 10 = Travel, Last Ten = Park) and any other attributes you need.

To count up total journeys, you basically sum up all the positive branches (i.e. when all the times you have a journey) and for parking you do the same for those.

Basically what this will look like is a dependent binomial variable, and what you do is estimate these probabilities from your sample. From this you will have a distribution for n intervals given a history of what you did and by considering whatever subset of these probabilities you wish, you can find things like the expectation.

Hello Chiro,

Can I ask you a couple of questions regarding this post?

To help visualise the problem, here is a sample of raw data.

https://dl.dropbox.com/u/54057365/All/rawdata.JPG

So the first thing is to bin the journey and parking times. So say I select 5 minutes as the bin size, I'll have 288 bins. This is a lot of bins but I presume 5 minutes is reasonable for short journeys and short parking times (trip to shop etc).

Apologies, I don't want to come across as stupid but I don't fully understand what you mean by branches? and 2^288 is a lot of branches?

You provided an example

So an example might be P(Total Journey Time = 30 minutes| First 10 = Travel, Second 10 = Travel, Last Ten = Park) and any other attributes you need.

Could you explain what you mean by "travel" and "park", a car would be parked during a journey??

From the sample data above, I have "binned" the journeys with a journey time of 30-35 minutes.

https://dl.dropbox.com/u/54057365/All/rawdata1.JPG

Would it be possible to show me by way example what you mean by branches and an example of a probability calculation? I would really appreciate it and it would help me understand it better.

I just can't visualise how to calculate (or generate) a journey time or parking time now that we are assuming that they are dependent events. Assuming independence was easier!

Are we generating or calculating a journey time given start time distance and the previous journey time and parking time?

I like the idea of calculating the probabilities first and then getting the expected values but I just can't grasp what probabilities to calculate and how to do it. I know basic stuff!

Thank you for your time, I know that I have asked a lot of questions on this forum and I am grateful for the help.

John
 
Last edited:
  • #40
So by branches, I mean that for each outcome chronologically you have two options (i.e. park next or move based on what you're doing now) where there is a chronological element.

Think of it like modelling whether a stock goes up or down given it's current state and it's complete history.

In terms of estimating probabilities and getting confidence interval, what you are doing is estimating a probability distribution with n probabilities and this is equivalent to estimating the parameters of a binomial if you only have two probabilities per branch point: park or keep moving.

If you can only park or keep moving, then this is only two choices which means you only have one probability. You can use either the standard frequentist results to get the variance of the estimator and also use programs to calculate the power and sample size for some given significance level amongst other things. Something like this:

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

If you have a distribution with more than two outcomes, you will want to use the multi-nomial.

Basically the way I am suggesting you start off is to consider the branch model and then simplify as much as possible without losing the actual nature of your model (i.e. don't make it too simple so that it's accuracy reaches a point to be useless).

So with your binned data, I'm a little confused since you have both parking and journey data in a trip: I was under the impression you could only travel or be parked in one indivisible event so maybe you could clarify what is going on.

The parameters like journey and parking times will be based on your branch distribution, but you can throw in some kind of residual term to take care of the finer details (like say if your branch size is 30 minutes, but you want to take into account variation to get probabilities for say 35 minutes or 46 minutes given that the time is between 30 and 60 minutes).

As for the distance, it would probably make sense for this to be a function of your actual times (even it's an approximation) and if you had more data (like location: for example if you are in a busy city or a wide open space) then you would incorporate this as well.

So now we have to look at calculating a branch.

Basically if you have the data divided up into the branches, then it's just a matter of getting the mean and variance of the bernoulli data and this becomes the basic frequentist interval of which you can use a normal distribution if your sample size is big enough.

So let's say you have a branch for the time t = [120,150) minutes where your branch size is 30 minutes. You will have quite a few possibilities leading up to this time level (if a 30 minute interval is assumed, you will have 2^4 or 16 possibilities) so you will have 16 different sets of data say how frequently you will park or move in those next 30 minutes.

You can calculate this stuff from your data to generate a table full of 1's and 0's and these are used to estimate your probabilities.

Now as you have mentioned, this will be a lot of data but if you have 30 minute intervals for 12 hours in total that's just under 17-million probabilities for the entire distribution which isn't actually too bad (a computer will calculate and deal with this data pretty quickly).

For fine grained stuff, as mentioned above you can add a distribution for getting values in that particular range.

With this technique you can adjust these coarse and fine-grained distributions (i.e. change the interval from 30 to 40 minutes but then change the fine-grained one to something with more complexity).
 
  • #41
Hello Chiro,

Thank you for explaining this. I've spent the day reading examples of Binomial distributions and this is a good method, I understand what your suggesting, well most of it:)

Just to clarify the issue with the data. Each row represents the travel for one car per day.

https://dl.dropbox.com/u/54057365/All/rawdata.JPG

So the journey time is the length of the journey (minutes) and the parking time (minutes) is the time between the end of journey 1 and start of journey 2. The 1st and 2nd rows in the sample only made 2 journeys that day, whereas the 3rd row had a 3rd journey. It had a 4th journey too but I left it out so it would fit on the screen.

Altogether I have 20,000 rows (days of travel).

In my mind, I am treating a journey as an event and naturally I am thinking that at the end of the event there is only one option which is park. Are you considering a journey as an event here? I'm not sure if I'm following you correctly here?

So when you say bin the data, do you mean bin the journey times (for example sort the rows from small to large) work out the probabilities and then do the same for the parking times?

Should I keep the order of the journeys as they are Journey 1, Journey 2, Journey 3 etc? or just treat them as one list? Like this

https://dl.dropbox.com/u/54057365/All/rawdatalist.JPG

and then bin them (by journey time first) like this (the journey time column is sorted)?

https://dl.dropbox.com/u/54057365/All/rawdatalistsorted.JPG

If you treat them as one list then you will you not be able to capture the relationships between successive journeys? For example a 2nd journey could be the return leg of a trip and could be of similar length and time?

I have a couple of questions about the remainder of the post but if it is okay I'll wait until I understand these points before continuing.

Thanks for your time

John
 
Last edited:
  • #42
Basically the way I envisage this is that if you have journey data, then basically the way you calculate the branch data is just to create the data so that you get a 1 or a 0 for whether something is in a particular binary distribution, and these become your bins.

So let's say you have a journey that goes for 1 hour and 20 minutes and then goes into parking for an hour. If you broke up this part of the journey, you have 3 bins with journey and then another two bins (all half hour in length) marked as parking.

This would result in a record with 1 1 1 0 0 and then you can take this data and build all the conditional frequencies and estimate the parameter for the probability as well as the confidence interval.

So in terms of your conditional distributions you will have say something like P(X4|X1=a,X2=b,X3=c) and in this example, the above would be a realization if a,b,c all equal 1.

This is the simplest bin structure with a fixed interval size.

So these give all the conditional distributions for each bin (i.e. the distributions for at some time, you have not only have a specific journey/parking history, but you can for example sum up the journey/parking times and do probability/statistics with it).

So with this you can then use a computer and get distributions for whatever you want (for example: you can find the distributions for all parking times in some interval given a history, basically whatever you want).

If you want finer control, you can add a distribution within each cell or you can have some kind of hierarchical scheme (i.e. with multiple levels), but that's going to be up to you.

Basically more resolution equals more complexity, so if you need to add complexity add it, but if you don't then that's a good idea to review the model and if it's use is adequate then you know where you stand.
 
  • #43
Hey,

oh I understand what you mean now regarding the bins. This is what I was thinking a few days ago but I wasn't sure if it was the same thing.

I've done this with some sample data. The bin sizes are 15 minutes.

https://dl.dropbox.com/u/54057365/All/sampletable.JPG

Would you mind me asking a couple of questions about the conditional frequencies and probabilities?

Would it be possible to show me an example of a conditional frequency and how to calculate it? Apologies if this is basic but I've never done something like this before.

For example, say you wanted to build the conditional frequency for parking time given a journey time of 15 minutes? Is this what your suggesting for the conditional probabilities?

Do you work out the conditional frequency, then the probability and then the expected value?

So basically if you had 3 journeys you would repeat this process 3 times, calculating the expected journey time and parking time?

You suggest making the distances a function of journey times instead of generating distances?

Currently the copula function generates a start time in the morning, the number of trips and the total distance traveled during the day. I'm kinda keen to keep the copula function. I'm guessing that if calculate the individual journey distances as a function of time then they probably won't add up to the total distance generated by copula.

I'm nearly there, I think I'll be confident putting this into practice with large dataset once I know how to calculate the probabilities and expected values.

Thanks again your time

John
 
  • #44
If you have journey times and parking times with parking and lengths and you convert this data to the binned branch structure, then you will have complete distribution for any kind of binned journey and parking history that is accurate to that bin size and you can increase it by adding more data within each bin.

So you will have a tonne of entries with say n bins (n columns) and a 1 and 0 in each cell corresponding whether that entry was used for parking or journey. Also before I forget, another point to note that if the bin size is too large you may a lot of events going on in the one bin and you have think about that possibility.

So say you wanted a conditional distribution given all the information that you spend at least k units (i.e. the bins) parking. Then what you do is instead of using all the record information, you just select the records that meet that criteria.

Now if you have multiple data, you can form a sub-distribution for this data given the variation in this data-set as opposed to the whole data set.

So let's say you have a data-base and then you say "I want a conditional distribution that represents the constraint that you must park in at least three bins". So you use your database or statistical software to return back this data (and this is why using something like SQL is useful if you are doing complicated stuff that your statistical software or excel can't do easily) and you get your constrained data.

Then just like any distribution, you find the number of unique entries and you give it a distribution, and this new distribution is your conditional distribution where you have P(X|There is at least three binned parking slots).

If you calculate the probabilities using the frequencies in this new subset of data you will get a distribution but the state-space is now this subset and not the whole thing.

So basically all the process really is, is starting with a complete set of data that represents the global distribution of data and then getting the right slice of that data and calculating its distribution.

Each conditional distribution refers to a particular slice and that distribution becomes its own random variable which has events that are a subset of the universal random variable that describes all of your data.

You can then do all the probability stuff, expectations, variances whatever but the only thing that has changed is the state-space: conditional ones always look at a subset of the more general state-space and that's all that is different.
 
  • #45
Thanks again,

I'm going to use bin sizes of 5 minutes to get a good level of detail and to ensure that events are not being missed.

Could we just do one example? just to be sure. This sample data has bin size of 15 minutes.

https://dl.dropbox.com/u/54057365/All/sampletable.JPG

So say we take your example:

P(X|There is at least three binned parking slots i.e 45 minutes).

Then this data would be returned from the database.

https://dl.dropbox.com/u/54057365/All/examplesample.JPG

Would you mind demonstrating how you would calculate the probability and the expected value using the frequencies? I understand that this is very basic but its all new to me and would be really helpful to see and example.

One last question, do you think that the time day needs to be factored into the problem or is it already factored in way because ultimately you are working out the parking times and journey times given one or the other?

Thanks for your time
 
Last edited by a moderator:
  • #46
So in your example, you have four records.

Now get all the different possibilities that exist (i.e. all the unique records) and these become your possibilities (i.e. your events) in your probability space.

In this case it seems that P(Bin_n = 1) = 1 for n > 3 (i.e. 4th bin and after) so this implies P(Bin_n = 0) = 0 for those same bins.

You have empty cells so I don't know how you handle them, but you need to handle those.

Then to get probabilities you simply take all the unique records and find the frequency of those in comparison to the others (just like you do with a normal set of unconditional data).

So you get those unique records and then you start to see where the variation is: I showed the above example that for this data set, there is no variation for n > 3 bins but in general you should lots of variations.

Now once you have your records you find out where the variation is and these become random variables in your joint distribution.

So to find the frequencies you look at what the variables are and you get the computer to get the frequencies for a particular attribute. So if you want to find P(Attribute1 = 1) then you return the number of times this occurs and divide it by the total number in the database and that's your frequency.

It's up to you to decide what an event is and a random variable corresponds to: it can be an atomic one (can't be divided up any further) or it can be a conglomerate one (made up things that can be divided), but basically the way to get probabilities is to fix an event definition, do a search for how many times that occurs in your total data set, divide it by the number of total items in the data set and that is your probability for that event.
 
  • #47
Hello Chiro,

I was talking to you about a Markov Cahin model of vehicle velocities a few months ago. I'm making the model but I was wondering if you comment on something.

You have probably forgotten so just to recap first.

I am recording the velocity of a car every second as it makes journeys. I have about 1000 journeys in total. I created a transition probability matrix of the probabilities of transitioning from one velocity to another. I intend to use the model to generate a synthetic velocity-time profile for a car.

So here is an example of a actual velocity profile for a car.

https://dl.dropbox.com/u/54057365/All/vel1.JPG

and this is an example of a velocity profile generated from the markov chain.

https://dl.dropbox.com/u/54057365/All/vel2.JPG

Visually you will notice that the actual profile is much smoother than the one generated by the markov chain. You can see in the actual profile that when a car is accelerating to a high velocity the profile is smooth as the velocity increases, but in the generated cycle the velocity is fluctuating as if it is slowing down and speeding up while accelerating to a high speed.

When you compute summary statistics for the profiles such as average velocity and standard deviation of the velocity, they appear similar in nature. But I'm curious about the jagged appearance of the generated profile.

Could you offer any insight as to what is happening?

Appreciate your comments.

John
 
  • #48
With regards to the jagged-ness, typically what happens is that this sort of thing is treated as a time-series and there all kinds of analysis and techniques that are done including ways to "smooth" the data.

The simplest one is known as Moving Average:

http://en.wikipedia.org/wiki/Moving_average

There are, of course, many more but the MA is a basic one.

As for the summary statistics, you might want to specify the constraints that you want to calculate (for example specific periods, conditional constraints, etc).

If you don't add constraints, then you just use integration and for this you can "bin" the values so that they contain small intervals for probability and then you can use numerical integration techniques to get both the mean and the variance of this approximated distribution.

The numerical techniques do allow depending on the technique to smooth out the value so that you get a kind of interpolation happening rather than an average or some other approximation.

As an example let's say you bin the velocities into bins of interval size 0.05. Let's say the probability for the bin corresponding to [1,1.05) is 0.03. Then one numerical scheme (mid-point) for xf(x)dx over this interval would be [1.05+1]/2 * 0.03 * (1.05-1) with the mid-point numerical integration technique.

In general, you could use any quadrature scheme you wish and depending on the requirements and the data, some might be better than others: just depends on the problem and your needs.

There are other techniques to calculate the mean of a general signal or function not given any probability information if you want to pursue this as well, and you can use this formulation to get the variance of a signal.

When I mean the above, I mean doing it with just the v(t) data instead of with a PDF of v(t) against v(t) like you do with normal xf(x)dx.

I can dig up the thread if you want.
 
  • #49
Thanks very much for your help Chiro. I was thinking of smoothing the data, I will investigate the different methods. I'll try MA and I've used Kernel smoothing before so I'll try that too.
 
  • #50
Hi Chiro,

I have been working away on the travel pattern model that you helped me with a lot a few weeks ago. I need to bin the data at 5 minute intervals. Its important that the output of the journey start/stop times have a resolution of 5 minutes because that is important in electricity grid modelling. A problem that I have been having is that I have no observations in the dataset for a lot of the conditional probabilities.

It was suggested to me that a Bayesian probability approach could over come this.

I don't know much about Bayesian probability but have been reading up on it over the last week, however I have not been bale to find an example of what I am trying to do.

Say for example I have the starting times of journeys in the morning. I binned the data into 15 minute bins so I have 96 bins in total (4*24=96). So for example a journey start time of 08:05 am would be in bin number 29.

As an example here is the data for bins numbers 28-50 (8am until 12.30pm).

https://dl.dropbox.com/u/54057365/All/bin.JPG

I've calculate the frequency density of the bins in the last column.

Would you know how I could do the following (this was suggested to me):

Taking Dirichlet prior distribution over the density of each bin for a multinomial model, you estimate the parameters. This way you get a non-zero probability for each bin. Each parameter is basically some prior parameter plus the frequency of the data in that bin.

Would you know if this could be done in excel?

Appreciate your comments.

Regards

John
 
Last edited by a moderator:
  • #51
Bayesian probability is a fancy way of saying that your parameters have a distribution: in other words, your parameters are a random variable. That's all it is.

It just makes the highest generalization possible and it is useful not just in an abstract way, but in a very applied way as well.

I don't know how you can do the Dirichlet Prior calculations in Excel but you could always create the function from the definition by using either a VBA routine or a single formula entry in the spread-sheet.

Here is the Wikipedia site with the Dirichlet PDF on that site:

http://en.wikipedia.org/wiki/Dirichlet_distribution

If you write a VBA function or some other similar routine to calculate the above, then you can calculate probabilities and moments (like expectation and variance).
 
  • #52
Hello Chiro,

Could I ask you a question/get your opinion about modelling with a Copula function.

I have 3 variables:

1. Time of departure from the home in the morning
2. Total distance traveled in the day
3. The number of journeys made in the day

This is their correlation matrix

https://dl.dropbox.com/u/54057365/All/matrix.JPG
I've seen technical papers modelling variables with correlations of 0.3-0.4.

Variables 1 and 2 are continuous and variable 3 is discrete (1-11 journeys).

My question is, can you model discrete and continuous data with a copula?

I fitted an empirical copula to the data as I found parametric copulas were not modelling the correlation structure well.

Here is a graph of the fitted copula and raw data for variables 2 and 3. The raw data is red and the fitted copula data is blue.

https://dl.dropbox.com/u/54057365/All/copula.JPG

You'll notice 11 "lines" of red data points corresponding to the 11 discrete journey numbers.

My question is would you consider the correlation structure to be well modeled here?

You'll notice how there are no blue data point in the top left (a good thing) but there is blue data points between the red lines.

Appreciate your comments

Thanks

John
 
Last edited by a moderator:
  • #53
In terms of your copula function, I think you'll need to repost the specifics for the Copula function and its associated constraints: it's been a little while since we talked about it (in this and similar threads) so if you could post your query with more specific constraints I'll try and address those.
 
  • #54
Hi,

I tried fitting a Normal copula to the data but it was not modelling the correlation structure well, between the discrete variable (number of journeys) and the 2 continuous variables mentioned above. The normal copula did however model well the correlation between the 2 continuous variables.

Here is the output of the Normal Copula for the for the discrete variable (number of journeys) and a continuous variable (total distance traveled in the day). The correlation structure is not well modeled.

Note the red data are the raw data points and blue points are the generated data by the copula.

https://dl.dropbox.com/u/54057365/All/empirical.JPG


So then I fitted an empirical copula, described here
http://www.vosesoftware.com/ModelRiskHelp/index.htm#Help_on_ModelRisk/Fitting/Vose_Empirical_Copula.htm

Here is a graph of the fitted copula and raw data for the same 2 variables as above. The raw data is red and the fitted copula data is blue.

https://dl.dropbox.com/u/54057365/All/copula.JPG

You can see that it models the correlation structure better.

My main question is, can you model discrete and continuous data within the same copula?

My other question is would you consider the correlation structure between the 2 variables to be well modeled here given their Pearson's correlation is 0.38 from the matrix?

Thanks
 
  • #55
You should be able to do this provided it treats the distribution definitions in the right manner in the algorithm.

Mixed distributions occur quite frequently (particularly in insurance statistics) and if you want say model a multi-variate distribution where one was discrete and the other continuous then you do the same sort of thing as a normal Cartesian product.

As an example let's you have a normal distribution and discrete uniform: then the cartesian product of these sets would look like a "staircase normal" where you would five sets of normal distributions side by side each being one slice for the appropriate discrete event.

Provided your algorithm has treated the data correctly, then this won't be a problem at all.

In fact if this is done correctly, all later statistical techniques should work properly.

You would have to check that the actual algorithm is able to treat the distribution function as it should (the multi-variate) if it deals with either mixed distributions (continuous and discrete in the same distribution) or distributions where you have a mixture of discrete and continuous random variables permuted with all possible combinations of both.
 
  • #56
Hello Chiro,

Can I ask you a quick question, it is a bit usual.

In my previous post I posted a graph of a fitted copula and raw data for 2 variables, number of journeys and distance travelled.

This was it:

https://dl.dropbox.com/u/54057365/All/copula.JPG

This is raw data on its own.
https://dl.dropbox.com/u/54057365/All/data.JPG

Would you be able to explain what makes the copula generate the values between the red "lines"? One might expect the generated blue data to be in "lines" also similar to the red raw data.

I think I know the answer myself, is it because one of the variables is continuous and those points between the lines actually do model the correlation structure? That is not a very good explanation though is it.

Thanks

John
 
Last edited by a moderator:
  • #57
Hello Chiro,

I have a problem which I'm difficult to find a solution to. Hopefully you could offer some insight.

I have a copula function that generates the total distance traveled in a day (i.e. 40 km) and the number of journeys (i.e 4)

The question is how to calculate the distances of the individual journeys.

Originally, I was doing the following:

Sampling distance x1 from the distribution of journeys distances which were made on days were the total distance traveled was 40km.

f(x1 | Distances on days were total distance traveled = 40)

Then, I sample distance x2 from f(x2 | Distances on days were total distance traveled = 40 - x1)

Then sample x3 from f(x2 | Distances on days were total distance traveled = 40 - x1 - x2)

Then x4 = total distance - x1 - x2 - x3

The problem with this is that journey distances don't "make sense".

The problem is that for example if you travel 2 km to the shop or to work chances are that you next journey will be 2 km in order to return home. But this not always the case, you could stop off on the way of a journey. For example x1 could be 5 km, x2 could be 2 km and then x3 could be 7 (5+2).

Could you suggest a better approach? Would there be a way to look at the relationships between consecutive journeys distances? and some way of sampling them? I have all the data.

Appreciate your comments

J
 
  • #58
For this you will need to consider what the distribution is for an individual journey given by the data.

So you will need to look at conditional expectation with regards to the expected journey for all possible journeys in a single day (you have mentioned four) and this is basically E_y[E_x[X|Y]] = E[X] which is known as the law of total expectation

http://en.wikipedia.org/wiki/Law_of_total_expectation

So you are trying to find E[X] for all possible conditional information relative to the choice of Y (which is the number of possible journey times in one data given your data) and the formulas for this are just the formulas for expectation (and if this is data in an excel spreadsheet then convert it to a binned PDF and use that formula).
 
  • #59
Hello Chiro,

Could I ask you a question? You have been very helpful in the past.

I'm trying to the compare the overall similarity of journeys based on some statistics for example average velocity and acceleration etc.

Each journey has for example 4 measurable attributes with equal weights.

I have some baseline statistics and some comparative statistics from other journeys.

The objective is to determine a measure of the how similar the other journeys are to the baseline journey.

Would you be able suggest a suitable measure?

Can you take an average of the percentage differences?
Could you use the norm, of the differencevector: (journey1 - base) and take the norm of this vector?

Appreciate your comments

https://dl.dropbox.com/u/54057365/All/comp1.JPG
 
Last edited by a moderator:
  • #60
I would recommend a couple of things in this instance.

The first would involve a two sample t-test or one of its non-parametric forms to test whether pairs of parameters (i.e. baseline vs other journey) provides evidence of being statistically significantly the same.

You should look into techniques like Bonferroni or other mechanisms that are used to do multiple sets of comparisons where you would test say four pairs of tests in which the significance level would be alpha/4.

The other thing I would recommend is doing a chi-square (Pearsons chi-square good-ness of fit) on the parameters by considering that each attribute is a random variable.

I would personally start out looking at 2-sample t-tests and the non-parametric equivalents first.

I would also look at ANOVA's (also check non-parametric if you need to) to test whether all groups of journeys have the same parameter as the base-line.

So do the ANOVA first and then do the pair-wise comparisons after that while thinking about whether you should use multiple pair-wise comparisons by applying Bonferroni correction of alpha values (i.e. the probability used to reject or accept the hypothesis that they are the same/different).
 
  • #61
Hello Chiro,

Could I ask you a question? You have been very helpful in the past.

I am trying to quantify the difference between two discrete distributions. I have been reading online and there seems to be a few different ways such as a Kolmogorov-Smirnov test and a chi squared test.

My first question is which of these is the correct method for comparing the distributions below?

The distributions are discrete distributions with 24 bins.

My second question is that, it pretty obvious looking at the distributions that they will be statistically significantly different, but is there a method to quantify how different they are? I'm not sure, but a percentage or distance perhaps?

I've been told that if you use a two sample Kolmogorov-Smirnov test, a measure of how different the distributions are will be the p-value. Is that correct?

http://www.mathworks.co.uk/help/stats/kstest2.html

I appreciate your help and comments

Kind Regards

https://dl.dropbox.com/u/54057365/All/phy.JPG
 
Last edited by a moderator:
  • #62
What attribute specifically are you trying to see the difference in?

The Chi-Square test acts like a lot like a 2-norm (think of Pythagoras Theorem) for an n-dimensional vector in the way that you get an analog of "distance" between two vectors.

If you know some kind of attribute (even if its qualitative, you can find a way to give a quantitative description with further clarification), then you can mould a norm or a test-statistic in that manner.
 
  • #63
Hi,

Well I developed a model which simulates car journeys. The distribution of the arrival times home in the evening simulated by the model is "different" than the actual distribution of the arrival times home observed in actual real world data. The model appears to be not that accurate.

What I ideally would like to say is that the distribution produced by the model is some percentage different from the the real world distribution.

Would a Chi squared or Kolmogorov-Smirnov test quantify the difference?

What would you recommend in this case?

Can these tests be used for discrete data? The times are rounded to the nearest hour.

What would you think of summing up the sum up the point wise absolute value of the differences between the two distributions. Would that be a good idea?

abs( Data_bin1_model - Data_bin1_data) + abs( Data_bin2_model - Data_bin2_data) + ...+bs( Data_bin24_model - Data_bin24_data) =

I'd prefer to use a statistical test if there was suitable available.

Thank you for your help.
 
Last edited:
  • #64
I think you will want to go with something like a Pearson Chi-square Goodness-Of-Fit test given what you have said above.
 
  • #65
Hi,

I really struggling with this. Is the P-value form the Chi squared test the percentage difference between the 2 distributions? why did you choose the Chi squared test over the KS test?

Thank you
 
  • #66
Its not a percentage difference but instead a probability corresponding to some variance where p-value = P(chi-square^2 > x) for some x where the x corresponds to the test-statistic (i.e. the X^2 test statistic).

Basically the larger the deviation, the smaller the chance that the two distributions are equal and the larger the deviation, the smaller the p-value.
 
Back
Top