- #1
kimberley
- 14
- 0
"Sports Illustrated Jinx": Regression to the Mean
GENERAL BACKGROUND
A few weeks ago, my uncles and others were discussing the so-called "Sports Illustrated Jinx", "Sophomore Jinx", and "Heisman Jinx".
Statisticians have said that the Sports Illustrated Jinx, in particular, is not a jinx at all, but rather an issue of Central Tendency and Regression to the Mean. I found the issue interesting and I told the guys I'd look into it. I would love to be able to discuss it with them on Memorial Day, but after looking at it for the last week or so, I'm not sure of my methods of analysis.
Quoting from elsewhere in relevant part:
"Professional athletes and sportswriters sometimes refer to the 'Sports Illustrated [J]inx', in which bad things happen with a player right after he is featured on the cover of the magazine. Now, a player appears on the cover for spectacular performance, particularly unexpected spectacular performance. ... [A] player who is performing unexpectedly well is probably being lucky, and his luck is not going to last long. When the SI cover is announced, the player's run of luck is probably over, and his performance lapses back to his norm. (The same thing is true of All-Stars, who are frequently selected for being hot in the first half [of the season])." "http://www.visi.com/~thornley/david/philosophy/thinking/mean.html"
CAVEAT
With this definition of the "Sports Illustrated Jinx" in mind, and before stating my question(s), a basic caveat. My degree is in political science. I've taken enough courses and performed some independent study work, however, to at least have a fair understanding of the Central Limit Theorem and, with some help from MS Excel/OpenOffice , calculating mean, median, mode, standard deviations, confidence intervals about the mean, prediction intervals for individual outcomes (outliers), standard error of the mean, standard error of the estimate, skew, kurtosis, general Linear Models, Jarque-Bera test for relative "normality"--the latter courtesy of EnumaElish) etc.
CONSTRUCTING THE ISSUE
I've looked at a couple of different examples of players' average performances and exceptional peformances, and I have tried to incorporate some of the common things and anomalies that I've witnessed into the hypothetical example below. Okay, so, here's the issue I'm trying to analyze.
Let's say that a player in the NBA with six years of playing experience (approximately 500 games) has a career scoring average of 15 points per game. Therefore, for sake of simplicity, he's scored 7500 total points over 500 games.
During the last 30 games, however, the player's scoring prowess has far exceeded his career average of 15 points per game. In fact, over the last 30 games he's averaged nearly 25 points per game. Over the last 60 games, his average is 22 points per game. Over his last 100 games, his average is 19 points per game. Assume that nothing has really changed (his playing minutes are pretty much the same, the teams they are playing are those they typically play and cycle through etc.) In other words, our suspicion is that these performances are more than likely individual "outliers" as they are soaring above our typical prediction intervals of 1.96*SD, 2.326*SD, 2.58*SD etc., no matter over what period of time, save the last 30 days, that our Mean is calculated.
Now, when you look at the shape of this player's game-by-game scoring distribution over his entire career, it does not appear to be "normal". In fact, the skew & kurtosis are not even close to being 0 respectively over the 500 game period. The player's scoring production has steadily risen on average, with intermittent peaks and valleys. This seems to be consistent with an individual player "cycling" (i.e., being on a scoring streak and then regressing back to SOME mean). Curiously, when you look at his scoring production over the last 152 games, the skew & kurtosis are very close to 0. The average and median over the last 152 games are also very close to being the same--18.1 and 18.19 respectively.
PLAYER MAKES THE S.I. COVER
Our player has performed so far above his career scoring average in the last 30 games that he's been named "Player of the Week" for 3 of the last 4 weeks and made the cover of "Sports Illustrated".
A SIMPLE AVERAGE MODEL (QUESTION 1)
When analyzing this player's scoring production to try to determine how he might perform in his next game (or over some other period), from which average do we construct things like "Prediction Intervals"?
In other words, is his scoring average and standard deviation over the last 30 games, N=30: Average (30), Standard Deviation (30) the best? I would think not since this looks like a period of exceptional performance (the "Sports Illustrated Jinx"). Alternatively, would we use his career average, N=500: Average (500), Standard Deviation (500)? I would think not also because although a greater sampling period would typically produce the best result if his scoring was "normally distributed", his scoring does not appear to be following a normal distribution over his six year career. In fact, the standard deviation is actually much greater where N=500 as compared to shorter periods of time, the career average is relatively low compared to more recent averages, and prediction intervals about the mean are so wide as to be practically useless.
Thus, should we use the period of N=152, where his scoring takes on the shape of a normal distribution with a skew & kurtosis very close to 0 respectively, and where the Mean and Median are almost identical? THIS IS WHY I ASKED ENUMAELISH ABOUT THE JARQUE-BERA TEST. That is, I was trying to incorporate sample size (games), mean, standard deviation, skew & kurtosis all into one formula to find out over which period the players' scoring is most normally distributed so that I could construct useful (i.e., relevant) Prediction Intervals AND to find the Mean to which his scoring was most likely to regress tommorow night, or over the next week etc.
PREDICTION INTERVAL EQUATION I USED FOR DETECTION OF POSSIBLE SCORING OUTLIERS (where N is normally distributed)
Xbar+/-ZSCORE(1.96,2.326,2.58,3.291 etc.)*STANDARD DEVIATION*SQRT(1+1/152).
A SIMPLE LINEAR MODEL (QUESTION 2)
In the few examples I tried, the Linear Model seemed to be the least consistent in predicting individual outcomes. Sometimes it was completely unreliable and other times it was right on the mark. For instance, when I found the period of time with the highest R-squared value, and then constructed Prediction Intervals for single outcomes using a Linear Regression Line and Standard Error of the Estimate for that time period, often times the model would completely collapse. That is, the player's scoring in a single game, where he scored just 4 points, would of course drop as much as 5 or 6 Standard Errors from the Linear Regression Line. Other times, it would be dead-on and the players scoring would hit 2 Standard Errors of the Estimate above the Linear Regression Line or 2 Standard Errors below it, and then regress right back to the Linear Regression Line.
What's the problem with my Linear Model?
PREDICTION INTERVAL EQUATION ABOUT THE LINEAR REGRESSION LINE WITH HIGHEST R-SQUARED VALUE
LINEAR REGRESSION VALUE+/-ZSCORE(1.96,2.326,2.58,3.291 etc.)*STANDARD ERROR OF THE ESTIMATE*SQRT(1+1/N).THE SI JINX: REGRESSION TO THE MEAN GENERALLY (QUESTION 3)
Assuming for the sake of argument that the Simple Average Model, where N is normally distributed, is the best way to analyze this issue, where N=152, Mean=18.1, SD=2.6, and this player has been posting up points for many of the last 30 games that are AT LEAST (1.96*2.6SD) above the Mean of 18.1, how can I determine when his exceptional performance of averaging 25 points over the last 30 days is going to meet up with the "Jinx", and revert back to more games that average 18.1 points?
CLOSING
I've had fun with this, and learned quite a bit, but am I even on the right track? If not, what's the best way to statistically look at this sort of issue? Thanks a bunch in advance and Happy Memorial Day!
Kimberley
GENERAL BACKGROUND
A few weeks ago, my uncles and others were discussing the so-called "Sports Illustrated Jinx", "Sophomore Jinx", and "Heisman Jinx".
Statisticians have said that the Sports Illustrated Jinx, in particular, is not a jinx at all, but rather an issue of Central Tendency and Regression to the Mean. I found the issue interesting and I told the guys I'd look into it. I would love to be able to discuss it with them on Memorial Day, but after looking at it for the last week or so, I'm not sure of my methods of analysis.
Quoting from elsewhere in relevant part:
"Professional athletes and sportswriters sometimes refer to the 'Sports Illustrated [J]inx', in which bad things happen with a player right after he is featured on the cover of the magazine. Now, a player appears on the cover for spectacular performance, particularly unexpected spectacular performance. ... [A] player who is performing unexpectedly well is probably being lucky, and his luck is not going to last long. When the SI cover is announced, the player's run of luck is probably over, and his performance lapses back to his norm. (The same thing is true of All-Stars, who are frequently selected for being hot in the first half [of the season])." "http://www.visi.com/~thornley/david/philosophy/thinking/mean.html"
CAVEAT
With this definition of the "Sports Illustrated Jinx" in mind, and before stating my question(s), a basic caveat. My degree is in political science. I've taken enough courses and performed some independent study work, however, to at least have a fair understanding of the Central Limit Theorem and, with some help from MS Excel/OpenOffice , calculating mean, median, mode, standard deviations, confidence intervals about the mean, prediction intervals for individual outcomes (outliers), standard error of the mean, standard error of the estimate, skew, kurtosis, general Linear Models, Jarque-Bera test for relative "normality"--the latter courtesy of EnumaElish) etc.
CONSTRUCTING THE ISSUE
I've looked at a couple of different examples of players' average performances and exceptional peformances, and I have tried to incorporate some of the common things and anomalies that I've witnessed into the hypothetical example below. Okay, so, here's the issue I'm trying to analyze.
Let's say that a player in the NBA with six years of playing experience (approximately 500 games) has a career scoring average of 15 points per game. Therefore, for sake of simplicity, he's scored 7500 total points over 500 games.
During the last 30 games, however, the player's scoring prowess has far exceeded his career average of 15 points per game. In fact, over the last 30 games he's averaged nearly 25 points per game. Over the last 60 games, his average is 22 points per game. Over his last 100 games, his average is 19 points per game. Assume that nothing has really changed (his playing minutes are pretty much the same, the teams they are playing are those they typically play and cycle through etc.) In other words, our suspicion is that these performances are more than likely individual "outliers" as they are soaring above our typical prediction intervals of 1.96*SD, 2.326*SD, 2.58*SD etc., no matter over what period of time, save the last 30 days, that our Mean is calculated.
Now, when you look at the shape of this player's game-by-game scoring distribution over his entire career, it does not appear to be "normal". In fact, the skew & kurtosis are not even close to being 0 respectively over the 500 game period. The player's scoring production has steadily risen on average, with intermittent peaks and valleys. This seems to be consistent with an individual player "cycling" (i.e., being on a scoring streak and then regressing back to SOME mean). Curiously, when you look at his scoring production over the last 152 games, the skew & kurtosis are very close to 0. The average and median over the last 152 games are also very close to being the same--18.1 and 18.19 respectively.
PLAYER MAKES THE S.I. COVER
Our player has performed so far above his career scoring average in the last 30 games that he's been named "Player of the Week" for 3 of the last 4 weeks and made the cover of "Sports Illustrated".
A SIMPLE AVERAGE MODEL (QUESTION 1)
When analyzing this player's scoring production to try to determine how he might perform in his next game (or over some other period), from which average do we construct things like "Prediction Intervals"?
In other words, is his scoring average and standard deviation over the last 30 games, N=30: Average (30), Standard Deviation (30) the best? I would think not since this looks like a period of exceptional performance (the "Sports Illustrated Jinx"). Alternatively, would we use his career average, N=500: Average (500), Standard Deviation (500)? I would think not also because although a greater sampling period would typically produce the best result if his scoring was "normally distributed", his scoring does not appear to be following a normal distribution over his six year career. In fact, the standard deviation is actually much greater where N=500 as compared to shorter periods of time, the career average is relatively low compared to more recent averages, and prediction intervals about the mean are so wide as to be practically useless.
Thus, should we use the period of N=152, where his scoring takes on the shape of a normal distribution with a skew & kurtosis very close to 0 respectively, and where the Mean and Median are almost identical? THIS IS WHY I ASKED ENUMAELISH ABOUT THE JARQUE-BERA TEST. That is, I was trying to incorporate sample size (games), mean, standard deviation, skew & kurtosis all into one formula to find out over which period the players' scoring is most normally distributed so that I could construct useful (i.e., relevant) Prediction Intervals AND to find the Mean to which his scoring was most likely to regress tommorow night, or over the next week etc.
PREDICTION INTERVAL EQUATION I USED FOR DETECTION OF POSSIBLE SCORING OUTLIERS (where N is normally distributed)
Xbar+/-ZSCORE(1.96,2.326,2.58,3.291 etc.)*STANDARD DEVIATION*SQRT(1+1/152).
A SIMPLE LINEAR MODEL (QUESTION 2)
In the few examples I tried, the Linear Model seemed to be the least consistent in predicting individual outcomes. Sometimes it was completely unreliable and other times it was right on the mark. For instance, when I found the period of time with the highest R-squared value, and then constructed Prediction Intervals for single outcomes using a Linear Regression Line and Standard Error of the Estimate for that time period, often times the model would completely collapse. That is, the player's scoring in a single game, where he scored just 4 points, would of course drop as much as 5 or 6 Standard Errors from the Linear Regression Line. Other times, it would be dead-on and the players scoring would hit 2 Standard Errors of the Estimate above the Linear Regression Line or 2 Standard Errors below it, and then regress right back to the Linear Regression Line.
What's the problem with my Linear Model?
PREDICTION INTERVAL EQUATION ABOUT THE LINEAR REGRESSION LINE WITH HIGHEST R-SQUARED VALUE
LINEAR REGRESSION VALUE+/-ZSCORE(1.96,2.326,2.58,3.291 etc.)*STANDARD ERROR OF THE ESTIMATE*SQRT(1+1/N).THE SI JINX: REGRESSION TO THE MEAN GENERALLY (QUESTION 3)
Assuming for the sake of argument that the Simple Average Model, where N is normally distributed, is the best way to analyze this issue, where N=152, Mean=18.1, SD=2.6, and this player has been posting up points for many of the last 30 games that are AT LEAST (1.96*2.6SD) above the Mean of 18.1, how can I determine when his exceptional performance of averaging 25 points over the last 30 days is going to meet up with the "Jinx", and revert back to more games that average 18.1 points?
CLOSING
I've had fun with this, and learned quite a bit, but am I even on the right track? If not, what's the best way to statistically look at this sort of issue? Thanks a bunch in advance and Happy Memorial Day!
Kimberley
Last edited by a moderator: