Exploring the Limitations of KNN with Large p: LDA/QDA Performance

  • Thread starter brojesus111
  • Start date
  • Tags
    performance
In summary: So the LDA should do better on the test set while the QDA should do better on the training set.In summary, when predicting a test observation's response using observations within 10% of each feature's range, we would use 37% of the available observations. This shows a drawback of KNN when p is large, as there are very few training observations near any given test observation. When the Bayes decision boundary is linear, LDA is expected to perform better on both the training and test sets. However, when the Bayes decision boundary is nonlinear, QDA is expected to perform better on the test set, while LDA is expected to perform better on the training set. This is because LDA has a lower variance
  • #1
brojesus111
39
0

Homework Statement



1. We have a set of observations on p = 100 features. The observations are uniformly distributed on each feature, and each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10 % of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

2. Now argue based on the above that a drawback of KNN when p is large is that there are very few training observations near any given test observation.
LDA/QDA

3. If Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

4. If the Bayes decision boundary is nonlinear, do we expect LDA or QDA to perform better on the training set? On the test set?

The Attempt at a Solution



1. Is it just .1^100?

2. For example, if we need to use observations within 99% of the feature's range, then we would use .99100 = .366 to make a prediction, which is roughly 37% of available observations. That tells us that for the 1% left, there's still a lot of space left so observations are far away from any test observation.

3. For the test set, since we know it is linear, then the LDA should do better. For the training set, I think QDA since it is more complex and more flexible.

4. Since it is nonlinear, the QDA should do better on the test set since we know it is nonlinear. On the training set, the QDA is more flexible than the LDA so it should do better.

Also, out of curiosity, if we have K=1 for the KNN, is our training error rate 0 or close to 0?
 
Physics news on Phys.org
  • #2
Actually, the only difference between LDA and QDA is that QDA assumes that each class has its own covariance matrix. So does it actually matter whether it's the test or training data? The LDA should do better when it's linear and the QDA should do better when it's non-linear.

Is this correct?
 
  • #3
More reasons why I think my revised post above is right. The LDA's boundary is linear and it has a lower variance while the QDA's boundary is non-linear and it has a higher variance.
 

FAQ: Exploring the Limitations of KNN with Large p: LDA/QDA Performance

What is the purpose of exploring the limitations of KNN with large p?

The purpose of this study is to investigate the performance of KNN (K-Nearest Neighbors) algorithm when applied to datasets with a large number of predictors (p). This will help in understanding the limitations of KNN and when it may not be the best choice for classification tasks.

What is KNN and how does it work?

KNN is a non-parametric classification algorithm that works by finding the K closest data points to a given test data and assigning the most common class among these data points as the predicted class for the test data. It uses a distance metric, such as Euclidean distance, to measure the similarity between data points.

What is LDA and QDA?

LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) are two popular classification algorithms that use a linear and quadratic decision boundary, respectively, to separate different classes in a dataset. They are parametric methods that assume a specific distribution of the data and aim to find the best line or curve that separates the classes.

How do LDA/QDA performance compare to KNN in high-dimensional datasets?

The performance of LDA/QDA is expected to be better than KNN in high-dimensional datasets because they are parametric methods that assume a specific distribution of the data and are better suited for datasets with a large number of predictors. However, this study aims to explore the limitations of KNN and see if there are any scenarios where it may outperform LDA/QDA in high-dimensional datasets.

What are the potential implications of the findings from this study?

The findings of this study can have implications for the choice of classification algorithm in real-world applications. If the limitations of KNN in high-dimensional datasets are identified, it can help in determining when to use KNN and when to use other methods such as LDA/QDA. This can lead to better classification results and more informed decision making in various fields such as medicine, finance, and marketing.

Similar threads

Back
Top