Supplemental Material

Supplementary Material: Clustering Cancer Gene Expression Data: a Comparative Study

Marcilio C. P. de Souto, Ivan G. Costa, Daniel S. A. de Araujo, Teresa B. Ludermir, Alexander Schliep

The use of clustering methods for the discovering of subtypes of cancers has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods, which take advantage of characteristics of the gene expression data, the medical community has a preference for the use of “classical” clustering methods. Indeed, so far, there are no study performing a large-scale evaluation of different clustering methods in this context.

We present a first large scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results showed that the finite mixture of Gaussians, followed closely by k-means, presented the best performance in terms of the recovering of the true structure of data sets. These methods also presented, on average, the smallest difference between the true number of classes in the data sets and the best number clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, presented a recovery performance poorer than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparison with new methods.

Data Sets

The filtered datasets are found in here. For each dataset link there is a directory containing two files:

- datasetName_description.htm: file describing the datasets. This file contains information about the dataset, such as: abstract of the original paper, link to the original dataset, details about samples and filtering parameters.

- datasetName_database.txt: file containing the dataset itself and the labels of the samples and genes. The first row of the file corresponds to the labels of the samples used for the authors in the original work. In the second row are the labels used in the context of this paper. The first column of the dataset file contains the original labels of the genes. The dataset is a n x m matrix, where n are the genes and m the samples.

Results

Tables with results from data sets, against distinct algorithms and number of clusters are found here.