miRNA: The discriminant power of RNA features for pre-miRNA recognition
Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). These feature sets used by current tools for pre-miRNA recognition differ in construction and dimension. Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. Current tools achieve similar predictive performance even though the feature sets used - and their computational cost - differ widely. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests. More diverse feature sets produce classifiers with significantly higher classification performance compared to feature sets composed only of sequence-structure patterns. However, small or non-significant differences were found among the estimated classification performances of classifiers induced using sets with diversification of features, despite the wide differences in their dimension. Based on these results, we applied a feature selection method to reduce the computational cost of computing the feature set, while maintaining discriminant power. We obtained a lower-dimensional feature set, which achieved a sensitivity of 90% and a specificity of 95\%. Our feature set achieves a sensitivity and specificity within 0.1% of the maximal values obtained with any feature set (SELECT, Section 2) while it is 34 times faster to compute. Even compared to another feature set (FS2, see Section 2), which is the computationally least expensive feature set of those from the literature which perform within 0.1% of the maximal values, it is 34 times faster to compute. The results obtained by the tools used as references in the experiments carried out showed that five out of these six tools have lower sensitivity or specificity. In miRNA discovery the number of putative miRNA loci is in the order of millions. Analysis of putative pre-miRNAs using a computationally expensive feature set would be wasteful or even unfeasible for large genomes. Comprising even false positive rates or accuracy as low as 5% is not an option, as would lead to hundreds of thousands of additional pre-miRNA candidates for verification. Consequently, to make the analysis of putative miRNA using ab-initio tools computationally feaseable, new tools with low computational cost and high predictive performance are needed. In this work, we propose a relatively inexpensive feature set and explore most of the learning aspects implemented in current ab-initio pre-miRNA prediction tools, which may lead to the development of efficient ab-initio pre-miRNA discovery tools.
The material to reproduce the main results can be downloaded from here.
PublicationsLopes et al.. Automatic learning of pre-miRNAs from different species. Technical report, Jul 2015. arXiv:1508.00412. Lopes et al.. The discriminant power of RNA features for pre-miRNA recognition. BMC Bioinformatics 2014, 15:124. Lopes et al.. The discriminant power of RNA features for pre-miRNA recognition. Technical report, Oct 2013. arXiv:1312.5778.