You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There exist many mature libraries in python for machine learning and in particular clustering: [scikit-learn](https://scikit-learn.org/stable/index.html)[@scikitlearn], [TensorFlow](https://www.tensorflow.org/)[@tensorflow2015], [PyTorch](https://pytorch.org/)[@pytorch], [scikit-learn-extra](https://scikit-learn-extra.readthedocs.io/en/stable/), and even several specifically for time series data: [aeon](https://www.aeon-toolkit.org/en/latest/index.html), [sktime](https://www.sktime.net/en/stable/index.html)[@sktime], [tslearn](https://tslearn.readthedocs.io/en/stable/)[@tslearn].
36
+
There exist many mature libraries in Python for machine learning and in particular clustering: [scikit-learn](https://scikit-learn.org/stable/index.html)[@scikitlearn], [TensorFlow](https://www.tensorflow.org/)[@tensorflow2015], [PyTorch](https://pytorch.org/)[@pytorch], [scikit-learn-extra](https://scikit-learn-extra.readthedocs.io/en/stable/), and even several specifically for time series data: [aeon](https://www.aeon-toolkit.org/en/latest/index.html), [sktime](https://www.sktime.net/en/stable/index.html)[@sktime], [tslearn](https://tslearn.readthedocs.io/en/stable/)[@tslearn].
37
37
38
-
However, although being fundamental to clustering tasks and being an active research topic, very few internal CVIs are implemented in standard python libraries (only 3 in [scikit-learn](https://scikit-learn.org/stable/index.html), more were available in R but few were maintained and kept in CRAN [@Charrad2014nbclust]). Thus for a given CVI, there is currently no corresponding maintained and public implementation. This is despite the well-known limitations of all existing CVIs [@Arbelaitz2013], [@Gagolewski2021], [@Gurrutxaga2011], [@Theodoridis2009Chap16] and the need to use the right one(s) according to the specific dataset at hand, similarly to matching the right clustering method with the given problem. A crucial step towards developing better CVIs would be an easy access to an implementation of existing CVIs in order to facilitate larger comparative studies.
38
+
However, although being fundamental to clustering tasks and being an active research topic, very few internal CVIs are implemented in standard Python libraries (only 3 in [scikit-learn](https://scikit-learn.org/stable/index.html), more were available in R but few were maintained and kept in CRAN [@Charrad2014nbclust]). Thus for a given CVI, there is currently no corresponding maintained and public implementations. This is despite the well-known limitations of all existing CVIs [@Arbelaitz2013; @Gagolewski2021; @Gurrutxaga2011; @Theodoridis2009Chap16] and the need to use the right one(s) according to the specific dataset at hand, similarly to matching the right clustering method with the given problem. A crucial step towards developing better CVIs would be an easy access to an implementation of existing CVIs in order to facilitate larger comparative studies.
39
39
40
40
In addition, all CVIs rely on the definition of a distance between datapoints and most of them on the notion of cluster center.
41
41
42
42
For static data, the distance between datapoints is usually the euclidean distance and the cluster center is defined as the usual average. Libraries such as [scipy](https://docs.scipy.org/doc/scipy/index.html), [numpy](https://numpy.org/doc/stable/), [scikit-learn](https://scikit-learn.org/stable/index.html), etc. offer a large selection of distance measures that are compatible with their main functions.
43
43
44
44
For time-series data however, the common distance used is Dynamic Time Warping (DTW) [@Berndt1994UsingDTW] and the barycenter of a group of time series is then not defined as the usual mean, but as the DTW Barycentric Average (DBA) [@Petitjean2011global]. Unfortunately, DTW and DBA are not compatible with the libraries mentioned above. This, among other reasons, made additional machine learning libraries specialized in time series data such as [aeon](https://www.aeon-toolkit.org/en/latest/index.html), [sktime](https://www.sktime.net/en/stable/index.html) and [tslearn](https://tslearn.readthedocs.io/en/stable/) necessary.
45
45
46
-
PyCVI fills that gap by implementing 12 state-of-the-art internal CVIs: Hartigan [@Strauss1975], Calinski-Harabasz [@Calinski1974dendrite], GapStatistic [@Tibshirani2001Estimating], Silhouette [@rousseeuw1987silhouettes], ScoreFunction [@Saitta2007Bounded], Maulik-Bandyopadhyay [@Maulik2002Performance], SD [@Halkidi2000Quality], SDbw [@halkidi2001clustering], Dunn [@Dunn1974Well], Xie-Beni [@Xie1991validity], XB* [@Kim2005New] and Davies-Bouldin [@Davies1979Cluster]. Furthermore, in PyCVI their definition is extended in order to make them compatible with DTW and DBA in addition to static data. Finally, PyCVI is entirely compatible with [scikit-learn](https://scikit-learn.org/stable/index.html), [scikit-learn-extra](https://scikit-learn-extra.readthedocs.io/en/stable/), [aeon](https://www.aeon-toolkit.org/en/latest/index.html) and [sktime](https://www.sktime.net/en/stable/index.html), in order to be easily integrated into any clustering pipeline in python. To ensure a fast a reliable computation of DTW and DBA, PyCVI relies on the [aeon](https://www.aeon-toolkit.org/en/latest/index.html) library.
46
+
PyCVI fills that gap by implementing 12 state-of-the-art internal CVIs: Hartigan [@Strauss1975], Calinski-Harabasz [@Calinski1974dendrite], GapStatistic [@Tibshirani2001Estimating], Silhouette [@rousseeuw1987silhouettes], ScoreFunction [@Saitta2007Bounded], Maulik-Bandyopadhyay [@Maulik2002Performance], SD [@Halkidi2000Quality], SDbw [@halkidi2001clustering], Dunn [@Dunn1974Well], Xie-Beni [@Xie1991validity], XB* [@Kim2005New] and Davies-Bouldin [@Davies1979Cluster]. Furthermore, in PyCVI their definition is extended in order to make them compatible with DTW and DBA in addition to static data. Finally, PyCVI is entirely compatible with [scikit-learn](https://scikit-learn.org/stable/index.html), [scikit-learn-extra](https://scikit-learn-extra.readthedocs.io/en/stable/), [aeon](https://www.aeon-toolkit.org/en/latest/index.html) and [sktime](https://www.sktime.net/en/stable/index.html), in order to be easily integrated into any clustering pipeline in Python. To ensure a fast a reliable computation of DTW and DBA, PyCVI relies on the [aeon](https://www.aeon-toolkit.org/en/latest/index.html) library.
47
47
48
48
# Example
49
49
50
50

51
51
52
52

53
53
54
-
We experimented 3 cases: [static data](https://github.com/deric/clustering-benchmark), time-series data [@UCRArchive2018] with euclidean distance and then with DTW as distance measure and and DBA as center of clusters. In addition, we used different clustering methods from different libraries: KMeans [@lloyd1982least] and AgglomerativeClustering [@Ward1963] from [scikit-learn](https://scikit-learn.org/stable/index.html), TimeSeriesKMeans from [aeon](https://www.aeon-toolkit.org/en/latest/index.html) and KMedoids [@Kaufman1990Partitioning] from [scikit-learn-extra](https://scikit-learn-extra.readthedocs.io/en/stable/) to showcase PyCVI integration with other clustering libraries.
54
+
We experimented with 3 cases: [static data](https://github.com/deric/clustering-benchmark), time-series data [@UCRArchive2018] with euclidean distance and then with DTW as distance measure and and DBA as center of clusters. In addition, we used different clustering methods from different libraries: KMeans [@lloyd1982least] and AgglomerativeClustering [@Ward1963] from [scikit-learn](https://scikit-learn.org/stable/index.html), TimeSeriesKMeans from [aeon](https://www.aeon-toolkit.org/en/latest/index.html) and KMedoids [@Kaufman1990Partitioning] from [scikit-learn-extra](https://scikit-learn-extra.readthedocs.io/en/stable/) to showcase PyCVI integration with other clustering libraries.
55
55
56
-
As a first example, we individually ran all CVIs implemented in PyCVI, selected the best clustering according to each CVI and plotted the selected clustering. In addition, we computed the variation of information (VI) between each selected clustering and the true clustering. High VI values mean large distances between the true clustering and the computed clusterings, meaning computed clusterings of poor quality. In \autoref{fig:barton}, we can see the difference of quality when assuming the correct number of clusters between the AgglomerativeClustering and the KMeans clustering method on static data. This is independent of the CVI used, meaning that a poor clustering quality will be due to the clustering method.
56
+
As a first example, we individually ran all CVIs implemented in PyCVI, selected the best clustering according to each CVI and plotted the selected clustering. In addition, we computed the variation of information (VI) between each selected clustering and the true clustering. High VI values mean large distances between the true clustering and the computed clusterings, meaning computed clusterings are of poor quality. In \autoref{fig:barton}, we can see the difference of quality when assuming the correct number of clusters between the AgglomerativeClustering and the KMeans clustering method on static data. This is independent of the CVI used, meaning that a poor clustering quality will be due to the clustering method.
57
57
58
58
In \autoref{fig:barton}, since the quality of clusterings generated by KMeans is bad due to the clustering method, the poor selection results gives us no information about the correct clustering, nor about the quality of the CVIs used. This motivates further research on clustering methods. However, using AgglomerativeClustering, the quality of the clustering is excellent, as indicated by a null VI. The corresponding selection results shown in the corresponding histogram tells us that the CVIs used here are not adapted to this dataset. This was expected since most CVIs rely on the cluster center to compute a good separation between clusters. The dataset here consisting of concentric circles, most CVIs fail to measure how well separated the clusters actually are. This illustrates the need of further research on CVIs, which is facilitated by PyCVI, notably in the case of concentric subgroups.
59
59
@@ -63,7 +63,7 @@ In a second example, we demonstrate cases of successful clustering and clusterin
63
63
64
64

65
65
66
-
In \autoref{fig:cvis}, we used `CVIAggregator` with first all CVIs implemented in PyCVI and then only with some of the implemented CVIs, as it could be done in practice when known characteristics of the dataset can help identify unadapted CVIs. We see that in both cases, the data was correctly clustered by the clustering method and the best clustering correctly selected. This is in spite of clusters of non-convex shapes in the first case and clusters "touching" each other in the second
66
+
In \autoref{fig:cvis}, we used `CVIAggregator` with first all CVIs implemented in PyCVI and then only with some of the implemented CVIs, as it could be done in practice when known characteristics of the dataset can help identify unadapted CVIs. We see that in both cases, the data was correctly clustered by the clustering method and the best clustering correctly selected. This is in spite of clusters of non-convex shapes in the first case and clusters "touching" each other in the second.
67
67
68
68
The code of these examples is available on the [GitHub repository](https://github.com/nglm/pycvi) of the package, and its [documentation](https://pycvi.readthedocs.io/en/latest/).
0 commit comments