Looking for the most suitable distance for binary clustering

  • #1
Frank Einstein
170
1
TL;DR Summary
I have a set of data of people loading into a server and I must find the most adequate distance to cluster them.
Hello everyone.

I have a pandas dataset in python which has n+1 columns and t rows. The first column is a timestamp that goes second by second during a time interval, and the other columns are the names of the people who log in the server. The t rows of the other columns indicate if the person is logged with an "1" and a "0" if the person isn't logged in the exact second.

I have used a Hierarchical clustering with Hamming distance and linkage average.

However, I am not sure if the Hamming distance is the most suitable measure to calculate the clustering between the users, specially after reading this article in which a comparison between 76 distances is defined.

I am not an expert in clustering, so I would like to know what other people think that would be the most adequate distance measure to group the users.

As far as I know, positive and negative matches are important in this case, so the Sokal Michenner distance might be suitable?

Any recomendation is welcome.
Best regards an thanks for reading.
 
Physics news on Phys.org
  • #2
I think it would help to start by explaining why you are clustering users. A metric's suitability is defined by what your end objective is.
 

FAQ: Looking for the most suitable distance for binary clustering

What is binary clustering?

Binary clustering is a type of clustering where the data points are divided into two distinct groups or clusters. This technique is often used when the goal is to separate data into two categories, such as "yes/no" or "true/false".

Why is choosing the right distance metric important in binary clustering?

The distance metric determines how the similarity between data points is calculated. Choosing the right distance metric is crucial because it directly impacts the quality and accuracy of the clustering results. Different distance metrics can lead to different cluster formations, which can affect the interpretation and usability of the clusters.

What are some common distance metrics used in binary clustering?

Some common distance metrics used in binary clustering include Euclidean distance, Manhattan distance, Cosine similarity, Jaccard distance, and Hamming distance. Each of these metrics has its own strengths and is suitable for different types of data and clustering objectives.

How can I determine the most suitable distance metric for my binary clustering task?

To determine the most suitable distance metric, you should consider the nature of your data and the specific requirements of your clustering task. You can experiment with different distance metrics and evaluate their performance using clustering validation techniques such as silhouette score, Davies-Bouldin index, or by visually inspecting the clusters. Additionally, domain knowledge and prior research can provide insights into which metric might be most appropriate.

Can I use multiple distance metrics for binary clustering?

Yes, it is possible to use multiple distance metrics for binary clustering. One approach is to combine different metrics using techniques like weighted averages or ensemble methods. This can help capture different aspects of the data and potentially improve clustering performance. However, combining metrics should be done carefully to ensure that it enhances rather than confuses the clustering results.

Similar threads

Replies
2
Views
1K
Replies
2
Views
1K
Replies
244
Views
10K
Replies
31
Views
3K
Replies
2
Views
2K
Back
Top