Marcilio C. P. de Souto, Ivan G. Costa, Daniel S. A. de Araujo, Teresa
B. Ludermir, Alexander Schliep
The use of clustering methods for the
discovering of subtypes of cancers has drawn a great deal of attention in the
scientific community. While bioinformaticians have
proposed new clustering methods, which take advantage of characteristics of the
gene expression data, the medical community has a preference for the use of
“classical” clustering methods. Indeed, so far, there are no
study performing a large-scale evaluation of different clustering
methods in this context.
We present a first large scale analysis of
seven different clustering methods and four proximity measures for the analysis
of 35 cancer gene expression data sets. Our results showed that the finite
mixture of Gaussians, followed closely by k-means, presented the best
performance in terms of the recovering of the true structure of data sets.
These methods also presented, on average, the smallest difference between the
true number of classes in the data sets and the best number clusters as
indicated by our validation criteria. Furthermore, hierarchical methods, which
have been widely used by the medical community, presented a recovery
performance poorer than that of the other methods evaluated. Moreover, as a
stable basis for the assessment and comparison of different clustering methods
for cancer gene expression data, this study provides a common group of data
sets (benchmark data sets) to be shared among researchers and used for
comparison with new methods.
The filtered datasets are found in here. For each dataset link there is a directory
containing two files:
-
datasetName_description.htm:
file describing the datasets. This file contains information about the dataset,
such as: abstract of the original paper, link to the original dataset, details
about samples and filtering parameters.
-
datasetName_database.txt:
file containing the dataset itself and the labels of the samples and genes. The
first row of the file corresponds to the labels of the samples used for the
authors in the original work. In the second row are the labels used in the
context of this paper. The first column of the dataset file contains the
original labels of the genes. The dataset is a n x m
matrix, where n are the genes and m the samples.
Tables with results from data sets, against
distinct algorithms and number of clusters are found here.