Outlier identification of multivariate data

In summary, the conversation discusses the need to extract matrices that are significantly different from the others in a dataset of 1500 matrices. The approach suggested is to treat the matrices as points in 6D space and compute their distance from the average matrix, with the largest distances indicating outliers. Other ways to improve the selection process are mentioned, such as using weighted distances. The conversation ends with the decision to use the zscore for univariate outlier identification.
  • #1
serbring
271
2
Dear all,

I have more than 1500 matrices that are the occurency frequency of a bivariate dataset, something like the following: [0.1, 0.3,0.1;0.4,0.05,0.05]. I need to extract the few of them that are really different from the others. Instead of visually analyze each of them and trying to hardly identify something, is there any way to programmatically carry it out maybe with an index similar to like the zscore for outlier identification in univariate data?

Thanks

Best regards

Serbring
 
Physics news on Phys.org
  • #2
I do not quite understand what your matrices are : what is on each line exactly? And more importantly, are all the matrices of the same form (2x3 in your example)?

If the latter is true, a generic approach could be to treat all matrices as points in 6D space, and look for points far away from the others.

A basic implementation of this idea would be to compute the average of the 1500 matrices, ##\bar M={1\over N}\sum_{i}M_i##, and then for each matrix, its distance from this average ## d_i=d(M_i,\bar M)=\sqrt{\sum_{j,k}(M_{i;jk}-\bar M_{jk})^2}##.
The largest ##d_i##'s then give you (an interpretation of) the outliers.

You can improve on that by using more information about the matrices and what it means for two of them to be near each other, but perhaps the above might serve as a starting point.
 
Last edited:
  • #3
Hi Wabbit,

Thanks for your reply. Each matrix is the joint probability between two signals (torque and speed of a shaft) and all the matrices have the same size.

Is that distance the Euclidian distance, right? Which information may add?

Thanks
 
  • #4
Still unclear to me what the matrices are, I don't see an obvious way in which "joint probability between torque and speed" should take the form of a 2x3 matrix. Also I assume you mean frequency not probability.

Other than that, yes, this is the Euclidean distance but that was just a choice by default - any distance will do. This is one of the areas of improvement: for instancr if torque is more important than speed than you may use a different distance, etc.

Have you tried this basic version to see if it gives you something usable?
 
  • #5
Thanks, Yes, I meant frequency. Well the two signals are firsly binned and then I compute the discrete joint frequency distribution by using the Matlab function "hist3".

I've tried the method and it is rather effective, what I'm trying to understand is how to set properly a "threshold" for outlier identification.

How may I weight more torque than speed? I might be useful to me.
 
  • #6
OK your description is now clear. I didn't think it was that because your rows do not sum to 1.

First the threshold, you could just set the percentage of outliers (that's what a zscore does under a fancy name), or you could look at a chart of ##d_i##as a function of rank i.e. sort them first and see if there are some points or group standing out (like a break in the curve...). Actually, you re looking at a univariate distribution now (the ##d_i##'s) so anything you usually do in that case is applicable, that was the reason for introducing them.

A weighted distance would be
##d'_i=d'(M_i,\bar M)=\sqrt{\sum_{j,k}w_{jk}(M_{i;jk}-\bar M_{jk})^2}## where ##w_{jk}\geq 0, \sum_{j,k}w_{jk}=1## is a set of weights you like.

You can pick any weights that make sense for the problem - say the second bin matters more than the others: increase the ##w_{j2}##. Torque distribution matters more than speed: increase ##w_{1k}##, etc.

There are other ways of getting a better selection, but these would require you to think more about what properties of the matrices are important to you and what criterion for two matrices being "close" makes the most sense - then translating those thoughts into mathematical form.

But as long as this one works, why bother :)
 
  • #7
Oh great that I may use any univariate outlier identification, so I'll use the zscore.

Really thanks for your help, on Friday I was going crazy to visually analyze all those matrices. I'll keep you updated ;)
 

FAQ: Outlier identification of multivariate data

1. What is an outlier in multivariate data?

An outlier in multivariate data is a data point that deviates significantly from the rest of the data points in terms of its values or characteristics. It can be an extreme value or a data point that does not follow the general trend of the data. Outliers can affect the overall analysis and interpretation of the data, and therefore, it is important to identify and handle them appropriately.

2. How do you identify outliers in multivariate data?

There are various methods for identifying outliers in multivariate data, such as visual inspection using scatter plots or box plots, statistical tests like z-score or Mahalanobis distance, and machine learning algorithms like k-nearest neighbors. It is recommended to use a combination of these methods for a more accurate identification of outliers.

3. Why is it important to identify outliers in multivariate data?

Identifying outliers in multivariate data is important because they can significantly impact the results of statistical analyses and machine learning models. Outliers can skew the data and affect the overall distribution, leading to incorrect conclusions and predictions. By identifying and handling outliers, we can ensure the accuracy and reliability of our data analysis.

4. How can outliers be handled in multivariate data?

Outliers in multivariate data can be handled in various ways, depending on the specific situation and goals of the analysis. Some options include removing the outliers from the dataset, transforming the data, or using robust statistical methods that are less affected by outliers. It is important to carefully consider the potential impact and consequences of each approach before deciding on the best course of action.

5. Can outliers be useful in multivariate data analysis?

In some cases, outliers can provide valuable insights and information about the data. They can represent rare or unusual events that are important to consider in the analysis. However, it is crucial to evaluate the outliers carefully and determine if they are genuine data points or errors before incorporating them into the analysis. Outliers should not be blindly ignored or used without proper consideration.

Similar threads

Replies
2
Views
3K
Replies
25
Views
5K
Replies
8
Views
5K
Back
Top