Simple statistical analysis of a limited data set

In summary, the conversation revolves around analyzing trip duration data collected from a commute to and from work. The main factors of interest are start time, day-of-week, and route taken. The speaker is interested in understanding how best to analyze the data and whether normalizing one factor can help in examining another. With limited data points, it is suggested to group the data into weekday and weekend categories and analyze AM and PM separately for each route. The idea of normalizing the data is also proposed, but the reliability of this method may be affected by the low number of data points for alternate routes. It is also noted that data points with unusual circumstances, such as taking a detour, should be excluded from the analysis."
  • #1
DaveC426913
Gold Member
22,986
6,661
I should have studies statistics in school.

I have been collecting some trip data from you commute to and from work, and I want to determine what factors are influencing trip duration. There are multiple factors, which I cannot easily isolate.

The factors I'm most interested in are:
- duration of trip as a function of start time
- duration of trip as a function of day-of-week
- duration of trip as a function of route taken

What I want to do is to understand how best to analyze the data. For instance: if I want to examine a single factor, can I meaningfully 'normalize' the other factors?

Here is a snippet of the data I've collected.

Date AM/PM Start Finish Temp Route Notes (Duration)

Feb 11 Th A 0814 0857 -9 QEW Cold Dry 43m
Feb 11 Th P 1823 1859 -9 QEW Cold Dry 36m
Feb 12 Fr A 0813 0859 -18 QEW Freezing 46m
Feb 12 Fr P 1748 1820 -8 QEW Cold 32m
Feb 19 Th A 0811 0847 -16 Lakeshore Dry 36m
Feb 24 Tu A 0804 0849 -12 QEW Dry 45m
Mar 10 Tu A 0821 0904 +1 Lakeshore Wet 43m
I've only got about 2 dozen entries, so it may be problematic to chop up the data into small sections and analyze each factor in isolation. Is there a way of averaging the data to make better use of it?

For example, when I'm analyzing duration as a function of day-of-week, the AM/PM parameter is a confounding factor. (Notice that PM times are significantly shorter than AM times, which could throw off my results.) But even if the absolute value of PM times is off, surely the trend is still there. Do I have to throw away all the PM data when analyzing AM data for day-of-week trends? Or can I somehow normalize the PM data to help me see overall trends over AM and PM?

This will be of particular importance in that I have very little data (< 6 entries) for alternate routes (Lakeshore).
 
Mathematics news on Phys.org
  • #2
Haven't studied statistics much either but just a couple comments / suggestions

- you have at least 20 different situations (5 days x am/pm) and 24 data points so this may well prove inconclusive. Maybe a first step would be to assume all days are the same (unless you include weekends, then you need at least two categories, weekday/weekend).

Assuming only weekdays, a priori I would expect am and pm to belong to two different distributions and estimate them separately, and same for different routes so 6 different cases to estimate separately.

- what is the best route in the morning (compare average and stdev of driving times separately for each route, ignoring exact hour)

- for a given route in the morning, what is the best starting time (plot driving time as a function of starting time)

- may not work if best starting time is different for each route, but perhaps try to mix normalized times i.e.
Route 1 -> avg1, std1 of am route1 driving times -> t*=(t-avg1)/std1
Then plot the t*s from all routes against starting times
 
  • #3
wabbit said:
assume all days are the same (unless you include weekends, then you need at least two categories, weekday/weekend).
Yep, weekdays only.

wabbit said:
Assuming only weekdays, a priori I would expect am and pm to belong to two different distributions and estimate them separately, and same for different routes so 6 different cases to estimate separately.
Yeah, agree. AM versus PM will have virtually no useful relation. Trying to force one would skew the data.

wabbit said:
- what is the best route in the morning (compare average and stdev of driving times separately for each route, ignoring exact hour)
Unfortunately, I have very little data on the alternate routes. My choice of route is not randomly selected; it is strongly biased toward QEW.

wabbit said:
- may not work if best starting time is different for each route, but perhaps try to mix normalized times i.e.
Route 1 -> avg1, std1 of am route1 driving times -> t*=(t-avg1)/std1
Then plot the t*s from all routes against starting times
This is what I'm wondering. Unfortunately, your shorthand formulae elude me.
 
  • #4
Ah sorry, ##t^*=\frac{t-avg_1^{AM}}{std_1^{AM}}## is the transformation applied to morning driving times t on route 1, t* the normalized time, ##avg_1^{AM}## the average driving time on route1 in the morning, ##std_1^{AM}## the standard deviation of same.
One transformation per sub-distribution, then mix normalized times.

But you say some routes have very few data points - in that case the above fails (or gives an unreliable result) but I think it just comes from trying to extract information that isn't there. At the extreme, if taking a new route one morning (say one that takes a detour or stops for an errand), the corresponding data point is worthless for estimating usual relations and should just be thrown out.*

* unless you have a model of the difference, e.g. if that single-occurence route differs by a 5-minute stop, then you can just deduce these 5 minutes and include the point.
You could try this strategy more generally too, though for differences of driving times between routes I can't think of a simple relevant model beyond the one implicit in the transformation above, which is $$t=avg(route,1_{AM})+std(route,1_{AM})*t^*$$
 
Last edited:
  • #5
wabbit said:
Ah sorry, ##t^*=\frac{t-avg_1^{AM}}{std_1^{AM}}## is the transformation applied to morning driving times t on route 1, t* the normalized time, ##avg_1^{AM}## the average driving time on route1 in the morning, ##std_1^{AM}## the standard deviation of same.
One transformation per sub-distribution, then mix normalized times.
Wow. Er. I'll try to absorb that.

wabbit said:
But you say some routes have very few data points - in that case the above fails (or gives an unreliable result) but I think it just comes from trying to extract information that isn't there. At the extreme, if taking a new route one morning (say one that takes a detour or stops for an errand), the corresponding data point is worthless for estimating usual relations and should just be thrown out.*
Yah. I don't record data for days when I run errands. But for days when the delays are reasonably allotted to traffic, such as detours or accident, they count, since they are a realistic factor in choosing a route.

wabbit said:
* unless you have a model of the difference, e.g. if that single-occurence route differs by a 5-minute stop, then you can just deduce these 5 minutes and include the point.
Yes. Only once did I subtract out a known-duration traffic delay.

wabbit said:
You could try this strategy more generally too, though for differences of driving times between routes I can't think of a simple relevant model beyond the one implicit in the transformation above, which is $$t=avg(route,1_{AM})+std(route,1_{AM})*t^*$$
As above: Wow. Er.
 
  • #6
Well, here's my data to-date. And I've made some primitive graphs from it.
High priority:
Break out into separate AM and PM groups.
Generate average data and plot.
Lower priority:
Edit X-axis labels to pull from lookup table in upper right corner. (Or figure out how to group data)

excel.png
 
  • #7
Interesting. I would focus only on the AM data to start, PM is different and has few points.

There seems to be some variation by day of week. If this persits for AM, you might try computing average time separately for each day, and then look at the reduced times obtained by deducting from each drive the average time of the corresponding day,
 
  • #8
I think there's variation by start time - that's really the primary factor of interest - especially since it's directly under my control.
It's hard to see because the chart graphs AM and PM together across 24 hours, so the variance is too small to see.

I'd like to figure out how to filter the data used in the charts. I read a lot about filtering the data in the table, but that's problematic if I want to show two filters on the same data.

What I'd like is to apply some sort of conditional on the datasource equation for the chart (eg. Y-axis: $K2-$K32 WHERE $D='A') while leaving the raw data alone. Then I can make another chart WHERE $D='P'.
 
  • #9
In Excel at least and I think in other spreadshhet software too you can select the data with column headings and apply a filter here, it hides dome data (also in the charts), but it doesn't lose it, and you can edit the filter to look at different parts of the data, is this what you're looking for? There are also filtered functions such as SUMIF(), etc. which are useful for this sort of thing.
 
  • #10
I can apply that to the data, ye,s but that's interactive. What I was hoping to do was to capture multiple, static charts, all from the same data. I can't post an Excel spreadsheet online.
 
  • #11
Hmm I see. You can create several dynamically filtered tables and charts, all linked to the master table, using pivot tables. Actually that's something I use a lot when doing this sort of data analysis.
 
  • #12
OK, I've filtered out PM entries, alternate route entries and 'special circumstance' entries. Less data but more informative.

One thing I'd like to do is make a 'lookup' table to correlate ordinal day-of-week (1,2,3,4,5) with name.

Some interesting trending happening by weekday...
 

Attachments

  • excel2.png
    excel2.png
    12.8 KB · Views: 468
Last edited:
  • #13
The duration by starting time looks like it has a decent fit to
duration = d0 + max( 0, k·(t-t0) ) for t0 ~ 0730
 
  • #14
So, if I leave the house at 7:30AM, I should be able to make it to work in zero time? :biggrin:
 
  • #15
Just have to deal with that pesky d0 and you could :)
 
  • #16
How do I formulaically plot average?

Basically, $K1-$Kn WHERE $B=(1,2,3,4,5)

So my DoW chart would have a single average point for each day amongst the other data.

I guess I start sprinkling sub total data throughout my page?
 
  • #17
with multiple conditions I think you might need to add an auxiliary column where you put the criterion, say $Gn formula : AND($Bn>=1,Bn<=5)

Then use. SUMIF($K1:$Kn, $G1:$Gn)/COUNTIF($K1:$Kn, $G1:$Gn)

Oh not what you meant - you want 5 sub averages.

Not sure about it, except by adding 5 auxiliary colums, one per day. There must be a better way but can't think of it now.

A pivot table would do that in one shot though.
 
Last edited:
  • #18
I've created 5 cells in a corner of my spreadsheet with array formulae:

{=AVERAGE(IF(B2:B32=<1,2,3,4,5>,K2:K32))}

Hm. These do not render on the graph properly.

Awww. The formulae ignore the filters! My averages are calc'ed from ALL rows, regardless of whether or not they are filtered out.

*sigh* got to explicitly apply filters to the 5 formulae:
{=AVERAGE(IF(B2:B32=<1,2,3,4,5>,IF(D2:D32="A",IF(H2: H32=1,IF(J2:J32="",K2:K32)))))}
 
Last edited:
  • #19
I find the DoW averages intriguing. I wonder whether it's Wednesday that's the anomaly, or if it's Tuesday and Thursday that are the anomalies.
And I wonder why. A lot of people take Wednesday off?

As for the Departure Time, how can I calculate the trend? What's it called? A 'best fit' curve?
Then I have to figure out if Excel can render it.

commute.PNG
 
  • #20
Found the trendline feature. Cool!

OK, so here's something I'm wondering. In real life, the commute will not decrease linearly with an earlier departure time, so the best fit should asymptotically approach some value of Y - likely somewhere between 0:35 and 0:30. But the best fit that Excel renders only has two points and a straight line between them. I don't see how I could make that line more representative.

commute.PNG
 
  • #21
Ah. A breakdown showing Day-of-Week AND Departure time reveals a very different story.
It shows me that I don't have enough data.

I notice that
- Wednesday's durations are biased toward earlier departure times,
- Tuesday's & Thursday's are biased toward later times.
- 2 data points for Monday do not constitute a trend.

commuteB.PNG
 
  • #22
Indeed it looks like they might be drawn from the same distribution. So best would be to go back to just ignoring DOW distinctions and look only at other variables.
 
  • #23
DaveC426913 said:
Found the trendline feature. Cool!

OK, so here's something I'm wondering. In real life, the commute will not decrease linearly with an earlier departure time, so the best fit should asymptotically approach some value of Y - likely somewhere between 0:35 and 0:30. But the best fit that Excel renders only has two points and a straight line between them. I don't see how I could make that line more representative.

View attachment 81092
Looking at the chart, I don't think you have enough data to discriminate between different models. You can however fit any model that seems reasonably motivated, instead of a trendline (that's the kind of thing I was suggesting above with the d0 model, since where I live, if you leave before the crowd you get no traffic and a constant driving time, then after some tipping point the trip duration increases.)
The levelling off of duration at late time is an expected feature but I do not see it in your chart (it looks like you never leave late enough to encounter it).
In any case, one way to do this is to write your model with some parameters, calculte its standard prediction error, and then use the solver to adjust the parameters so as to minimize this error.
What you describe could be formulated as
duration=d0+k*min(t-t0, tmax - t0)+error
In this formulation t0 is here for convenience only, and can be set as the earliest time or an arbitrary time like 730; d0 and k are then your parameters, and you minimize
Sum(error^2)
 
  • #24
wabbit said:
Looking at the chart, I don't think you have enough data to discriminate between different models.
Yeah. That's what I was getting at. When examined by day only, it looks like Wednesday is a fast day. In fact, what the data is really telling me is that I tend to leave earlier on Wednesday than on Tuesdays or Thursdays.

BTW, I had to choose the data sets manually. I don't know how to tell it that, for the Monday Series, only choose where B=1, etc.

wabbit said:
You can however fit any model that seems reasonably motivated, instead of a trendline (that's the kind of thing I was suggesting above with the d0 model, since where I live, if you leave before the crowd you get no traffic and a constant driving time, then after some tipping point the trip duration increases.)
Around here, there is no such thing as no traffic. Rush hour starts at 6AM and goes until - well - 9PM. There's no midday lull. But yeah, it should be worse between 7:30 and ... I don't know ... 11AM?

wabbit said:
The levelling off of duration at late time is an expected feature but I do not see it in your chart (it looks like you never leave late enough to encounter it).
In any case, one way to do this is to write your model with some parameters, calculte its standard prediction error, and then use the solver to adjust the parameters so as to minimize this error.
What you describe could be formulated as

In this formulation t0 is here for convenience only, and can be set as the earliest time or an arbitrary time like 730; d0 and k are then your parameters, and you minimize
I am not sure what I would do with that formula. What is 'error'? Or is that what I'm trying to solve for?
 
  • #25
You would calculate the error (difference between model prediction and observation) as
error = ( d0 + k·min(t-t0, tmax-t0) ) - duration
But I don't think your data actually supports that.
This is something you can test:
- fit a model without tmax (or equivalently force tmax = some very large value) : you get a first total error^2
- then fit the model allowing tmax to change : you get a second sum of error^2
If the second total error is significantly* less than the first, than your extra parameter may have earned its keep. Otherwise reject it.

* there are statistical tests for this, but I don't think you need them here. I expect you'll see no reduction in toral error or at best a small one.
 
Last edited:
  • #26
Right. Yes. I was working that out while you were posting.

Although, I don't understand it :frown: because, as I confessed, I never studied stats.
I mean, I get the idea of what's it supposed to do, I'm just kind of hazy on the application to the data and chart.
Would I create a second curve, using the formula you provided, drop it on the first, then tweak it until I got a good match? (well,, except there is no first curve).
 
  • #27
Yes but use the solver to do the tweaking.
 
  • #28
Okay. Thanks.

Alas, my time has run out. I leave for Cuba in 17 hours.Won't be able to revisit this till well into April.
 

FAQ: Simple statistical analysis of a limited data set

What is simple statistical analysis?

Simple statistical analysis is a method used to summarize and interpret a limited data set in order to gain insights and draw conclusions about a population. It involves calculating basic measures of central tendency (such as mean, median, and mode) and measures of variability (such as range and standard deviation).

Why is it important to conduct simple statistical analysis?

Simple statistical analysis allows us to understand and make sense of a limited data set. It helps us identify patterns and trends, and provides a basis for making decisions or drawing conclusions about a larger population. Without proper analysis, the data could be misinterpreted or lead to incorrect conclusions.

How do you choose the appropriate statistical test for a limited data set?

The appropriate statistical test depends on the type of data and the research question being asked. Some common types of tests include t-tests, ANOVA, and correlation analysis. It is important to understand the nature of the data and the assumptions of each test in order to select the most appropriate one for the research question.

What are some common challenges when conducting simple statistical analysis?

One common challenge is dealing with missing or incomplete data, which can affect the accuracy and representativeness of the results. Another challenge is ensuring that the data is normally distributed, as many statistical tests assume this. Additionally, it is important to consider the limitations of a limited data set and not overgeneralize the results.

How can I effectively communicate the results of simple statistical analysis?

To effectively communicate the results of simple statistical analysis, it is important to present the findings in a clear and concise manner. This can include using visual aids such as charts, graphs, and tables to illustrate the data. It is also important to provide a thorough explanation of the analysis methods used and to avoid making assumptions or drawing conclusions that are not supported by the data.

Back
Top