Time Series Models vs Models Used for Cross-sectional Data

  • I
  • Thread starter fog37
  • Start date
  • #1
fog37
1,569
108
Hello,

I understand that a dataset that involves time ##T## as a variable and other variables ##Variable1##, ##Variable2##, ##Variable3##, represents a multivariate time-series, i.e. we can plot each variable ##Variable1##, ##Variable2##, ##Variable3## vs time ##T##.

When dealing with regular cross-sectional datasets (no time involved), we can use statistical models like linear regression, logistic regression, random forest, decision trees, etc. to perform classification and regression...

Can we use these same models if the dataset has the time variable ##T##? I don't think so. For example, let's consider the fictional dataset:

1712792752073.png


Or do we need to use specialized time series models like ARMA, SARIMA, Vector Autoregression (VAR), Long Short-Term Memory Networks (LSTMs), Convolutional Neural Networks (CNNs), etc.?

I don't think it would make sense to create a linear regression model like this ## Variable3 = a \cdot Variable1 +b \cdot Variable2+ c \cdot Time##...Or does it?

What kind of model would I use with a dataset like that?

Thank you!
 

Attachments

  • 1712792518764.png
    1712792518764.png
    5.6 KB · Views: 37
Physics news on Phys.org
  • #2
fog37 said:
Or do we need to use specialized time series models like ARMA, SARIMA, Vector Autoregression (VAR), Long Short-Term Memory Networks (LSTMs), Convolutional Neural Networks (CNNs), etc.?
In general, these are far superior for analyzing time series.
fog37 said:
I don't think it would make sense to create a linear regression model like this ## Variable3 = a \cdot Variable1 +b \cdot Variable2+ c \cdot Time##...Or does it?
There are far too many times when one variable is affected by prior values of variables, including its own prior values. The influence may be delayed by several time steps. Good time series analysis models will consider all of that.
 
  • #3
fog37 said:
What kind of model would I use with a dataset like that?
A model which is linearly dependent on time.
 
  • Like
Likes fog37
  • #4
pbuk said:
A model which is linearly dependent on time.
The goal of multivariate time series analysis appears to be the creation of a model that is able to predict the value of the chosen target variable ##Y## at the unknown time ##t## given the known values of the target variable ##Y## itself at previous times as well as the values of other ##X## variables at previous times....
The traditional ARMA, SARIMA, VAR models expect the time data to be stationary. It is possible to use other models by creating new variables (feature engineering) which are lagged versions of the ##X## variables and use them as inputs for models like random forest, SVM, and all other ML models or DL models...

On the other hand, when working with cross-sectional data, we also have a target variable ##Y## and our model predicts its values based on the value of the other predictors ##X##. But the values of the ##X## variables and the values of target variable ##Y## are all at the same time ##t##.
The prediction of the target variable ##Y## is also at time ##t##.

Thank you!
 
  • #5
fog37 said:
On the other hand, when working with cross-sectional data, we also have a target variable ##Y## and our model predicts its values based on the value of the other predictors ##X##. But the values of the ##X## variables and the values of target variable ##Y## are all at the same time ##t##.
The prediction of the target variable ##Y## is also at time ##t##.
I think this statement is too strong. Just because a statistical analysis does not automatically consider the time sequence of different variable values does not mean that they are all from the same time. It is completely valid to use the height of a boy at 10 years old to predict or estimate his height at 20 years old. Time series analysis is specifically designed to account for the timing, but the other techniques can also be used in a limited way, within reason.
 
  • Like
Likes fog37
  • #6
Simple regression is the norm for financial time series as autocorrelation is close enough to zero. Just need to take log returns to get a stationary series
 
  • Wow
Likes FactChecker
  • #7
BWV said:
Simple regression is the norm for financial time series as autocorrelation is close enough to zero. Just need to take log returns to get a stationary series
That surprises me. Certainly autocorrelations dominate all the macroeconomic data that I have analyzed from many countries. The current values of GDP, inflation, employment, job creation, etc. all depend primarily on the prior values. It surprises me that financial data is so different.
 
  • #8
FactChecker said:
That surprises me. Certainly autocorrelations dominate all the macroeconomic data that I have analyzed from many countries. The current values of GDP, inflation, employment, job creation, etc. all depend primarily on the prior values. It surprises me that financial data is so different.
If traders can abnormally profit from asset prices with autocorrelation then autocorrelation will cease to exist. You can’t trade GDP or employment. Therefore one can safely ignore it in the returns of financial assets. Also why lognormal distributions work well enough
 
  • Like
Likes fog37
  • #9
BWV said:
If traders can abnormally profit from asset prices with autocorrelation then autocorrelation will cease to exist. You can’t trade GDP or employment. Therefore one can safely ignore it in the returns of financial assets. Also why lognormal distributions work well enough
Are you talking about the first difference, rather than the value? Traders make profits when the stock price rises or falls. I can see that. But clearly the price of Apple stock today is very highly correlated with its price last month. (Although I'm not sure that I could say that about bitcoin or Truth Social. ;-) )
 
  • Like
Likes fog37
  • #10
FactChecker said:
Are you talking about the first difference, rather than the value? Traders make profits when the stock price rises or falls. I can see that. But clearly the price of Apple stock today is very highly correlated with its price last month. (Although I'm not sure that I could say that about bitcoin or Truth Social. ;-) )
The log return, which is the first difference, is mostly all the matters with stock time series. Stock prices themselves are not stationary (vol is proportional to price) so you cant do OLS on them. But that is true with most macro econ data as well, you cant do OLS on 50 years of US total GDP on the left vs, say interest rates and unemployment rates on the right, its nonsensical.
 
  • Like
Likes FactChecker
  • #11
FactChecker said:
I think this statement is too strong. Just because a statistical analysis does not automatically consider the time sequence of different variable values does not mean that they are all from the same time. It is completely valid to use the height of a boy at 10 years old to predict or estimate his height at 20 years old. Time series analysis is specifically designed to account for the timing, but the other techniques can also be used in a limited way, within reason.
Thank you.

Just to be clear, if we have a dataset with only two columns, hence two variables like Height ##H## and weight ##W##, we could create a simple linear regression model to predict ##H## from ##W##. The dataset does not have time ##t## as a variable. This cross-sectional data has been collected at one point in time. The "point in time" could be over a week, over a year, over a month, over 6 months, etc.

Is it correct to think that the prediction of ##H## given by the linear regression model is not be tied to time considerations but is a general prediction that does not factor time into the analysis? The assumption is that the height prediction would be acceptable regardless of when it is calculated in time.

Thank you.
 
  • #12
BWV said:
The log return, which is the first difference, is mostly all the matters with stock time series.
I think that is too strong a statement. There are institutional investors or people who are interested in long-term investments, even to the extent of owning a company. For them, financial indicators like P/E ratios and capitalization are very important.
 
  • #13
fog37 said:
Thank you.

Just to be clear, if we have a dataset with only two columns, hence two variables like Height ##H## and weight ##W##, we could create a simple linear regression model to predict ##H## from ##W##. The dataset does not have time ##t## as a variable. This cross-sectional data has been collected at one point in time. The "point in time" could be over a week, over a year, over a month, over 6 months, etc.
Do you mean that the time-step varies significantly by unknown amounts from row to row or that it is a constant, unknown time step? I don't know how to use time series analysis in the first case, but the second case is very typical for time series analysis. The time series analysis that I am familiar with, only dealt with the regular, constant, time step, but the magnitude of the time step was not considered. I wouldn't be surprised if there are techniques that consider the size of a varying time step.
Certainly, in the example of height versus weight, time series analysis would provide the best predictor if the time-step is regular.
fog37 said:
Is it correct to think that the prediction of ##H## given by the linear regression model is not be tied to time considerations but is a general prediction that does not factor time into the analysis? The assumption is that the height prediction would be acceptable regardless of when it is calculated in time.
That might be the best that you can do with that data if the time steps are irregular and unknown. But sometimes you might be able to do better. For instance, in your height & weight example, suppose you wanted to estimate ##W_{current}## given ##H_{current}## and prior history. I think that a time series analysis would do a good job if you add the prior values ##W_{prior}/H_{prior}## as a variable.

PS. I have done work for a customer who wanted to know the relationship between several variables so that he could predict one of the variables. We had data over time in regular, yearly steps. Clearly, the variables were related. But after intense analysis of every type I could think of, the prior variable value was such a good predictor that no other variables were remotely significant in explaining the residual errors.

PPS. The customer was very unhappy with that answer and had someone else work on it for months. As far as I know, they never got a better result.
 
  • Like
Likes fog37
  • #14
FactChecker said:
I think that is too strong a statement. There are institutional investors or people who are interested in long-term investments, even to the extent of owning a company. For them, financial indicators like P/E ratios and capitalization are very important.
Oh sure, time series of financial ratios that normalize share prices are used all the time relative to log returns and other variables. FWIW E/P is easier to work with than P/E
 
  • #15
FactChecker said:
Do you mean that the time-step varies significantly by unknown amounts from row to row or that it is a constant, unknown time step? I don't know how to use time series analysis in the first case, but the second case is very typical for time series analysis. The time series analysis that I am familiar with, only dealt with the regular, constant, time step, but the magnitude of the time step was not considered. I wouldn't be surprised if there are techniques that consider the size of a varying time step.
Certainly, in the example of height versus weight, time series analysis would provide the best predictor if the time-step is regular.

That might be the best that you can do with that data if the time steps are irregular and unknown. But sometimes you might be able to do better. For instance, in your height & weight example, suppose you wanted to estimate ##W_{current}## given ##H_{current}## and prior history. I think that a time series analysis would do a good job if you add the prior values ##W_{prior}/H_{prior}## as a variable.

PS. I have done work for a customer who wanted to know the relationship between several variables so that he could predict one of the variables. We had data over time in regular, yearly steps. Clearly, the variables were related. But after intense analysis of every type I could think of, the prior variable value was such a good predictor that no other variables were remotely significant in explaining the residual errors.

PPS. The customer was very unhappy with that answer and had someone else work on it for months. As far as I know, they never got a better result.
Yes, I am considering either a dataset where the time column contain time values that are separated by equal amounts (ex: day 1, day 2, day 3, ...., day 10) and a dataset that does not have the time column at all (ex: weight ##W## and height ##H##)...

Thank you for your good inputs.
 
Back
Top