Development of machine learning methods for the discrimination between coding and non-coding conserved sequences
Tesi di Dottorato
Data di Pubblicazione:
2007
Citazione:
Development of machine learning methods for the discrimination between coding and non-coding conserved sequences / M. Re' ; relatore: Carmela Gissi ; coordinatore: Giuliana Zanetti. DIPARTIMENTO DI SCIENZE BIOMOLECOLARI E BIOTECNOLOGIE, 2007. 20. ciclo, Anno Accademico 2006/2007.
Abstract:
In the last ten years, numerous complete and almost complete genome sequences have been made available to the research community but the completion of the inventory of coding genes of eukaryotic genomes at least has proved an elusive goal. Classical ab-initio gene prediction methods have been invaluable in the annotation of genome sequences, but show notable weaknesses with respect to genes with unusual structural features, while annotation on the basis of similarity to known genes does not allow the detection of genuinely novel genes.
The identification of sequences under evolutionary constraint by means of comparison of genome sequences is a powerful technique for inferring the locations of functional elements in a genome. As whole-genome sequencing efforts extend beyond traditional model organisms to include a wide diversity of species, comparative genomics analyses will be further empowered to reveal insights into genomes and their evolution. The discovery and annotation of functional genomic elements is a necessary step toward a detailed understanding of genome biology, and sequence comparisons have been demonstrated to be an integral tool for this task.
In recent years, an ever increasing amount of evidence suggests that, despite initial assumptions, a large proportion of the sequences conserved between related genomes do not represent coding regions. Other experiments also demonstrate that the classical opinion that long stretches of conserved genomic sequences are predominantly protein coding regions has to be revised due to the presence of long conserved non coding functional elements such as modular clusters of well conserved transcription factor binding sites. Thus the discrimination between conserved coding and non-coding sequences is an important objective for comparative genomics.
Single statistics such as the synonymous versus non-synonymous substitution ratio can be used, in isolation, to establish whether conserved regions are likely to be protein-coding or non coding. However, such approaches tend to exhibit either low sensitivity (with high specificity) or high sensitivity (at the cost of low specificity). This may be in part because the extent and nature of selective pressures acting during evolution on protein-coding sequences are not only inhomogeneous between different organisms but also between genes belonging to the same organism.
More than six years after the completion of the human genome sequence our inability to correctly classify as coding or non-coding the entire set of sequences conserved between human and many other organisms (ranging from closely related mammalian species such as mouse, rat and dog to fish and birds) clearly indicates a lack of understanding of the mechanisms underlying the molecular evolution of many classes of genomic elements. This is particularly true for non-coding functional elements, likely because they play a more diverse range of functional roles (and thus evolve under more diverse and complex constraints) than initially appreciated. Given this lack of knowledge of molecular evolution, the development of reliable methods for the discrimination between conserved coding and non coding sequences is complicated by the absence of tests aimed to detect evolutionary dynamics associated with conserved non-coding regions.
One possible solution, in the absence of novel insights into the evolution of non coding sequences, is a more general and effective use of the well known evolutionary patterns characterizing protein coding sequences in union with methods based on the concept of ‘learning by induction’ which underlies tools such as Neural Networks and Support Vectors Machines (SVM), classifiers able to ‘learn’ to discriminate between instances belonging to two or
Tipologia IRIS:
13 - Tesi di dottorato discussa entro ottobre 2010
Elenco autori:
M. Re'
Link alla scheda completa: