The effect of cross validation on correlation coefficient

It is unclear why this is happening, but it could be due to the different means at each fold affecting the overall correlation.
  • #1
lyuriedin
3
0
I have two variables where the regression line is just the mean as a constant. As such, the correlation is zero. However, when I perform k-fold cross validation (in Weka) this becomes non-zero.

I have no idea why this is. The regression line for whatever the test set is will always be a constant, where the correlation will be zero. Because some of the data will be taken out to act as the validation set at each fold the mean will be different at each fold, but the correlation will still be the same no matter what. The only thing I can think of is that it is computing the correlation between training means with respect to the actual mean, but even then these should sum to zero.

Can anybody clear this up for me?
 
Physics news on Phys.org
  • #2
It is possible that the k-fold cross validation is computing the correlation between the training means and the validation set means. This could explain why the correlation is not zero. Additionally, it is possible that the k-fold cross validation is calculating the correlation of the training data at each fold, which would also lead to a non-zero correlation.
 

FAQ: The effect of cross validation on correlation coefficient

What is cross validation and why is it important in relation to correlation coefficient?

Cross validation is a statistical method used to assess the performance of a predictive model. It involves partitioning the data into subsets and using one subset to train the model and the remaining subsets to test the model's performance. This is important in relation to correlation coefficient because it helps to avoid overfitting and provides a more accurate estimation of the model's performance.

How does cross validation affect the correlation coefficient?

Cross validation can affect the correlation coefficient in two ways. Firstly, it helps to reduce the bias in the estimation of the correlation coefficient by using multiple subsets of data. Secondly, it helps to improve the generalizability of the correlation coefficient by testing the model on different subsets of data.

What is the difference between k-fold cross validation and leave-one-out cross validation?

K-fold cross validation involves dividing the data into k subsets and using each subset as a testing set while the remaining subsets are used as training sets. The process is repeated k times, with each subset being used as the testing set once. Leave-one-out cross validation is similar, except it uses all but one data point as the training set and the remaining data point as the testing set. This process is repeated for each data point. The main difference is that k-fold cross validation is less computationally intensive, but leave-one-out cross validation provides a less biased estimate of the model's performance.

Can cross validation be applied to any type of data?

Cross validation can be applied to most types of data, including numerical, categorical, and time-series data. However, the specific method of cross validation used may vary depending on the type of data and the model being evaluated.

What are the limitations of cross validation in relation to correlation coefficient?

One limitation of cross validation is that it assumes the data is independent and identically distributed, which may not always be the case in real-world data. Additionally, cross validation may not be suitable for evaluating models with a large number of parameters, as it can be computationally intensive. Finally, cross validation cannot completely eliminate the risk of overfitting, so it should be used in conjunction with other methods of model validation.

Similar threads

Back
Top