Skip to the content.

Home

Comparison of Clustering Algorithms

Many types of clustering algorithms are available within Python’s sklearn library. Below is a comparison of three types of clustering algorithms: K-means, Hierarchical clustering, and DBSCAN (density) clustering:

Data Prep

Before Cleaning

For our clustering algorithms, we will use the same dataset that was used for PCA. Below is a snippet of this dataset:

Pre-Clustering Data

This is the data that clustering will be performed on, before cleaning. Note that the "label" in this case would be placement. Note also the many quantitative features. Time measurements (game length and time eliminated) are in seconds. Link to data: Pre-Clustering Data

After Cleaning

Cleaned Clustering Data

This is the data that clustering will be performed on, after cleaning. Note that the "placement" column has been removed, and both non-quantitative columns have also been removed. Link to data: Cleaned Clustering Data

Normalization

Scaled Data

This is the data that clustering will be performed on, after standardizing such that the mean of each column is 0 and the standard deviation of each column is 1 - this is also known as normalization. Link to data: Scaled Clustering Data

PCA Reduction

PCA 3D Data Snippet

This is the data that clustering will be performed on, after having performed PCA to reduce the normalized data to three dimensions. Link to data: PCA Clustering Data

Code and Results

All code used can be found here: Code

K-Means

The Silhouette method, which measures how close clusters are to each other, was used to determine the appropriate values of K to perform K means with. Below is an image detailing the output of the silhouette method:

Silhouette Output

This is the output from performing the silhouette method on our PCA-reduced data to find optimal values of K. Although we can see that there are no values which have very high scores, values of 2,5, and 8 have the highest.

Values of 2, 5, and 8 for K were chosen. Intuitively, the values of 2 and 8 both make sense. Recall that our label attribute is “placement”, which is an integer from 1 to 8 detailing the placement a player earned at the time of their elimination (1 means the player scored first place; 8 means the player scored 8th, or last place, in the game). With K=2, hopefully the algorithm would create one cluster containing players who scored in the top four, and another for players who ended the game in the bottom four. With K=8, the goal would be that the algorithm would cluster all players neatly based upon their placement in the game. A K=5 is less interpretable, since this does not divide the number of players in the game evenly, but we will use it since it had a high silhouette score.

Below are the resulting plots created after performing K-Means for values of 2,5, and 8 for K. Centroids are denoted with a large red X, and data points are mostly transparent so the centroids can be seen, but are colored according to their placement label:

Silhouette Output

This is the output from performing the silhouette method on our PCA-reduced data to find optimal values of K. Although we can see that there are no values which have very high scores, values of 2,5, and 8 have the highest.

K=2 Plot

This is the 3D plot after performing K-means with K=2. Note the centroids denoted by large red X's

K=5 Plot

This is the 3D plot after performing K-means with K=5. Note the centroids denoted by large red X's

K=8 Plot

This is the 3D plot after performing K-means with K=8. Note the centroids denoted by large red X's

Overall, the plots are difficult to interpret, because of the large amount of data (over 10000 points, very densely distributed) and the difficulty of seeing how the centroids correspond to different clusters that may exist. We saw in our silhouette scores that no clusters were very far from each other, and, as we increase the numbers of clusters and centroids, we can observe this to be the case (particularly when K=5 and K=8, some centroids cannot be seen due to how close they are in proximity to other centroids). The K=8 plot gives us the least amount of usable information. We see centroids that may roughly correspond to placement, but many are very close together to the point where placement may not be an appropriate division between clusters. Two centroids appear to be much different and lie on a different plane than the other six, but the color distribution of the data points indicate that one or two placement labels may have data distributed between centroids as opposed to one centroid corresponding to one particular color, which may not be very helpful. The K=5 plot similarly has one centroid that appears to lie on a different plane than the other four. We do see that centroids in this plot appear to be more associated with a specific color/placement value, but there is one centroid which must be so close to another that it cannot be directly seen on the plot. The most helpful plot is likely the simplest for K=2. We see two centroids, one that appears to correspond with lower placement (lighter colored data) and one that lies in a center of darker colors (higher placement). These plots indicate that K-means likely is not that helpful of a clustering algorithm in this case, where lots of data is so densely distributed.

Hierarchical Clustering

When performing hierarchical clustering, Ward’s linkage method was used in order to minimize the variance within each cluster. To compare with the most useful result from our previous K-Means, 2 clusters were chosen to be chosen by the algorithm. A dendrogram showing the results of the clustering is below

Dendrogram

This is the dendrogram produced with hierarchical clustering. Note two clusters, one of smaller size and one of large size.

This potentially gives us a clearer picture of what K-means with K=2 showed us. Hierarchical clustering generated two clusters, one small and one quite large. These would need to be related back to original placement labels to determine if these clusters are useful in predicting player placement.

DBSCAN

DBSCAN was performed on the data and 5 clusters were found. An image of the data colored by DBSCAN cluster label is shown below:

DBSCAN Plot

This is the plot produced by visualizing the PCA-reduced 3D data and labeling according to cluster produced by DBSCAN

The image is difficult to decipher, again because of how dense the data is packed together. DBSCAN found 5 clusters, which again is slightly problematic, since 8 players are not neatly divisible by this number. This may mean that some placements of players have very similar metrics regardless of placement, which would be problematic for trying to predict placement. More investigation into relating these clusters back to placement labels would have to be done to find whether the DBSCAN clustering was useful.

Conclusions

Overall, the density of the particular dataset used for clustering proved problematic for these algorithms to distinguish identifiable clusters or groups of data with similar characteristics. It may be possible to group players into broad categories using K-Means, such as those that finished in the top 4 vs those that finished in the bottom 4. Because this data signifies all challenger players, or the highest-ranked players to have been playing the game at the time of data collection, it is likely that there are not very identifiable differences in these players’ performances, regardless of their final placement in a particular game, and it is likely that their metrics reflect this. Or, it is possible that the metrics chosen are simply not related to player placement in any way that clustering algorithms can identify neatly. Clustering is an excellent way to find more information about unlabeled data; in this case, since we have labeled data, clustering may not be as useful.

Home