Looking for advice in clusterization

  • I
  • Thread starter Frank Einstein
  • Start date
  • Tags
    Time series
  • #1
Frank Einstein
170
1
TL;DR Summary
I need to know how to cluster data measured at different time instants.
Hello everyone. I have a machine with a series of sensors. All sensors send a signal each minute. I want to know if any of those sensors are redundant. The data is available as an Excel file, where the columns are the variables and the rows are the measurements. I have 1000 rows.

To do this, I have used DBSCAN in Python as

Data clusterization:
scaler = StandardScaler()
data_normalized = scaler.fit_transform(data)
data_normalized = data_normalized.T
dbscan = DBSCAN(eps=15, min_samples=2)
clusters = dbscan.fit_predict(data_normalized)

However, I think that there has to be a better way to find relationships between variables (each sensor or columns of the data file).

Could someone please point me towards a methodology more suitable for my goals?
Any answer is appreciated.
Tanks for reading.
Best regards.
Frank.
 
Physics news on Phys.org
  • #2
You can just look at the correlation matrix. If two inputs are highly correlated then you can probably drop one.
 
  • Like
Likes WWGD, FactChecker and Frank Einstein
  • #3
Dale said:
You can just look at the correlation matrix. If two inputs are highly correlated then you can probably drop one.
Thanks. I can calculate them with ease as well.
 

FAQ: Looking for advice in clusterization

1. What is clusterization in data science?

Clusterization, or clustering, is a technique used in data science to group a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups. It is an unsupervised learning method used for pattern recognition, data compression, and anomaly detection.

2. What are the common algorithms used for clusterization?

Some common clustering algorithms include K-Means, Hierarchical Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and Gaussian Mixture Models (GMM). Each algorithm has its strengths and weaknesses, and the choice of algorithm depends on the specific characteristics of the data and the goals of the analysis.

3. How do I choose the number of clusters in K-Means clustering?

Choosing the number of clusters (k) in K-Means clustering can be done using methods such as the Elbow Method, where you plot the sum of squared distances from each point to its assigned cluster center and look for an "elbow" point where the rate of decrease sharply slows. Another method is the Silhouette Score, which measures how similar an object is to its own cluster compared to other clusters.

4. What are the advantages and disadvantages of hierarchical clustering?

Hierarchical clustering has the advantage of not requiring the number of clusters to be specified in advance, and it produces a dendrogram that can be useful for understanding the data structure. However, it is computationally intensive, especially for large datasets, and can be sensitive to noise and outliers.

5. How can I evaluate the quality of my clustering results?

Clustering quality can be evaluated using various metrics such as the Silhouette Score, Davies-Bouldin Index, and Dunn Index. Additionally, visual methods like plotting the clusters can provide intuitive insights. Internal validation methods assess the clustering structure without external information, while external validation methods compare the clustering results to a ground truth.

Back
Top