Computer Science and Engineering
 Gothenburg University | Chalmers

Home page

Home page  Contact us  Site map 



Completed Projects

GeneFinder: Ab initio gene finding

Ab initio gene finding or prediction is still an interesting problem. Current gene finder work well but especially predictions of eukaryotic genes have much space for improvement. Based on Anders Krogh's HMMGene our gene finder is used to test ideas to improve the predictions. One idea is to incorporate external data like gene expression to aid the signal and sequence composition driven prediction. The gene finder is implemented in Python using the python bindings of our own free (LGPL) HMM library General Hidden Markov Model library (GHMM).

ComplexDiseases: Analysis of complex disease data

Finding the genetic causes of complex diseases such as Autism and ADHD is complicated by ambiguities and subjectivities in the diagnostic process and the simultaneous involvement of multiple genes and environmental factors. We investigate the application of mixture model based clustering on fused geno- and phenotype data. This joint analysis might yield further insight into the complex interactions between geno- and phenotypes which underlie a specific disease pattern.

GenExpTimecourses: Analysis of gene expression time-courses

The molecular processes of life are dynamic over time. Microarray experiments measuring the expression levels of a multitude of genes over time are one way of gaining insight into the dynamic processes. As a first analysis groups of similar expression patterns are routinely identified. We have developed an approach which allows to use prior knowledge, is flexible and very robust to noise. The method is implemented in the software GQL which allows control of the analysis process by use of graphical user interfaces. Currently, we are extending our framework to allow integration of further data related to transcription or protein interactions. Furthermore, we are also investigating methodologies for validating clustering of genes with functional annotation.

ArrayCGH: Analyzing comparative genomic hybridization data

Detecting Chromosomal aberrations from ArrayCGH and gene expression ArrayCGH experimental data Chromosomal aberrations such as deletions or duplications of chromosomal regions are a crucial contributing factor to cancer. The aberrations can be detected by observing the relative hybridization intensities of healthy vs. diseased patients for BAC-clones covering complete genomes. A Hidden Markov Model with a inhomogeneous Markov Chain allows to reflect dependencies between overlapping clones.

SCG: Bioinformatics for Single-Cell Genomics

Genome assembly is one of the fundamental problems in Bioinformatics. Assembly can be either reference guided--when we have a reference genome that is similar to the genome we want to assemble--or de novo - when the genome is reconstructed only from reads available from sequencing machines. With sequencing getting cheaper by the day, researchers are interested in assembling genomes of more and more organisms. The main bottleneck here is the lack of reliable de novo assembly tools for Next Generation Sequencing data (the cheaper but shorter reads). We wish to investigate various aspects of the de novo assembly problem such as read filtering and correcting, contig building, scaffolding, etc.

CSIMixtures: Context-specific independence mixture modeling for sequence motifs

The modeling and analysis of sequence motives is one central task in the elucidation of biological processes such as gene regulation. The choice of model class is crucial to obtain a representation of the motive suitable for the biological application. For instance previous studies showed that for transcription factors which bind to divergent binding sites, mixtures of multiple PWMs increase performance. However, estimating a conventional mixture distribution for each position will in many cases cause overfitting. We avoid this problem by employing a context-specific independence (CSI) framework. In CSI mixtures model complexity is automatically adapted to match the variability found in a given data set.

ProteinComplexes: Delineation of protein complexes in yeast

The delineation of protein complexes from protein-protein interaction data is not as trivial as it may seem. We developed a simple probabilistic framework to cluster purifications while preserving the partial order relation among purifications. With a simple graph-based approach motivated by the asymmetric relationship between purifications, we can visualize overlapping components of protein complexes as supported by the experiment.

Tiling: Design of Tiling Arrays

Genomic tiling arrays are universal arrays in the sense that they cover complete genomes or chromosomes uniformly, in contrast to most other types of DNA microarrays for which specific sites of interest such as genes or splice sites are defined a priori. We define the problem of choosing optimal oligonucleotide probes from large candidate sets and provide efficient, linear-time in most instances, algorithms for solving it.

MicrorarrayDetection: Detecting biological agents with DNA Micorarray

DNA-Microarrays, well known for measuring gene expression levels, can be used for detecting presence or absence of biological targets (viruses of bacteria) from hybridization patterns of oligonucleotide probes and genomic DNA of agents. Due to sequence similarity of possible targets the use of non-unique oligonucleotides becomes necessary. With use of statistical group testing and phylogenetic information about targets, even the detection of novel targets becomes viable.

HomologyClassification: Detecting remote homologs as a classification problem

Detecting whether two proteins are homologs is one of the fundamental problems in bioinformatics. Classically, their sequence similarity is measured with a sequence alignment score and a decision about homology is made using score statistics. How well one can solve this classification problem is strongly influenced by the assumptions necessary for the statistics to hold. We use an approach based on Support Vector Machines to address this problem.

DrosophilaDevelopment: Gene regulation during early Drosophila development

In-Situ Hybridization experiments elucidate the spatial distribution of expressed mRNA in organisms. In particular for Drosophila large amounts of data for several developmental stages are available, complementing the DNA-microarray gene expression experiments. We have developed a image processing pipeline and a framework for joint analysis, which allows to detect co-located co-expressed genes from fused data sets.

RemoteHomologues: Identifying clusters of remote homologues

Detecting proteins which share a common ancestor is an important step in understanding protein structure and function. Multi-domain proteins normally cause problems due to spurious similarities they induce; with a simple graph-based approach based on the concept of asymmetric similarity we were able to clearly outperform PSI-Blast.

Tuberculosis: Image processing and systems biology of macrophage infection

Tuberculosis is one of the most widespread diseases in the world, with about a third of the population infected. While most infections are asymptomatic, latent tuberculosis can progress into an acute and life-threatening condition. As most infections occur in third-world countries where medical practice often remains below standard due to challenging circumstances, and high prevalence of AIDS leads to more active TB, the prolonged misuse of antibiotics has led to multiresistent strains, and several first-line and second-line antibiotics have been found to be ineffective. This project aims at modeling the macrophage infection mechanism using high-throughput experimentation and the development of novel algorithms to the associated computational challenges, in order to gain a systems level understanding of the infection process, which might facilitate new hypotheses about potential new drug targets.

MASCAAT: Meta-Learning for Selection and Combination of Clustering Algorithms Applied to Gene Expression Analysis

Whether to cluster at all, which clustering method to use and how many clusters to choose are pressing questions in bioinformatics. Mostly, decisions are made by users of clustering software based on experience guided by benchmarking or indicators for reliability of solutions or model-fit. However, as clustering algorithms always produce solutions, often inappropriate methods or parameters are used and invalid results produced. Meta-learning refers to the application of machine learning techniques in choosing methods and guiding in setting parameters. We intend to build a computational framework to perform cluster validation and apply meta-learning to the problem of analyzing gene expression time-courses. More information at the Project Page. Joint work funded funded by CAPES (Brazil) and DAAD (Germany) under the program Probral.

MouseAtlas: Recognition, analysis and visualization of gene expression patterns in optical tomograms (OPT) of embryonal mice

We work in collaboration with Ralf Spörle from the Department of Developmental Genetics, Christian Hege, head of the Visualization Department at the Konrad-Zuse Zentrum (ZIB) and Bernd Fischer, Professor at the University of Lübeck, on the construction of an atlas of gene expression patterns in embryonal mice. The central piece is the construction of a non-linear registration, that maps numerous in-situ tomograms onto an annotated standard model. This mapping yields then an automatical anatomical annotation of high-resolution 3D spatial expression patterns as well as the fusion of all patterns into one standard model. The mapped expression patterns can then be viewed and analyzed together within the standard model. Analysis of the data involves statistical group testing for functional territories.

miRNA: The discriminant power of RNA features for pre-miRNA recognition

Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). These feature sets used by current tools for pre-miRNA recognition differ in construction and dimension. Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. Current tools achieve similar predictive performance even though the feature sets used - and their computational cost - differ widely. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests.

CellDiff: Understanding transcriptional regulation in cell differentiation

The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the hematopoietic system. Gene expression data of cells of various distinguishable developmental stages fosters the elucidation of the underlying molecular processes, which change gradually over time and lock cells in certain lineages. We developed a statistical framework for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes and their similarities and differences.