More is better, different is helpful : MitoZoa database improvements and the usage of mitochondrial gene order diversity as reannotation criterion
Poster
Data di Pubblicazione:
2011
Citazione:
More is better, different is helpful : MitoZoa database improvements and the usage of mitochondrial gene order diversity as reannotation criterion / R. Lupi, P. D’Onorio De Meo, M. D’Antonio, G. Pavesi, F. Griggio, G. Pesole, T. Castrignanò, C. Gissi - In: BITS 2011, VIII Annual Meeting of the Bioinformatics Italian Society / [a cura di] F. Geraci, R. Marangoni, M. Pellegrini, M.E. Renda. - Pisa : Edizioni ETS, 2011 Jun 20. - ISBN 978-884673069-5. - pp. 117-118 (( Intervento presentato al 8. convegno BITS : Annual Meeting of the Bioinformatics Italian Society tenutosi a Pisa nel 2011.
Abstract:
Motivation (1401)
MitoZoa (MZ) is a specialized database collecting complete and nearly complete mitochondrial genomes (mtDNA) of Metazoa, and focused on the correction of the numerous annotation inaccuracies affecting mt entries. Indeed, these inaccuracies can prevent truthful analyses of some under-investigated mt features, such as gene order (GO) and non-coding regions (NCR), or the correct retrieval of gene sequences such as those for tRNAs and rRNAs. MZ is coupled to a reannotation pipeline that allows identification and correction of entry errors, and to an automatic protocol for standardizations of gene and NCR names, as standardization is a prerequisite for the implementation of fast and easy retrieval of data/sequences on GO, NCR, congeneric species data, and other curated mt features. MZ can be queried at www.caspur.it/mitozoa with a user-friendly interface and, since its publication (Lupi et al 2010), it has been accessed more than 6,000 times in 14 months.
With the aim to guarantee a regular updating and improve the quality of stored data, we have refined/added several steps to the reannotation pipeline embedded in MZ. Among novelties, we have assessed GO comparisons both to identify annotation errors, especially in congeneric species, and to built a database of “reference GO”, that can help to investigate the dynamics and evolution of GO at both short and long phylogenetic distances.
Methods (1793)
The MZ reannotation pipeline v.2.0 is made of several Python scripts: 14 scripts for validation/rectification of entry annotation elements and for MZ update, and 5 scripts aimed at MZ management, statistics calculation, NCR annotation and GO string generation/manipulation.
The bi-monthly MZ update includes the addition of newly published mt entries, selected automatically through a specific query to EMBL, and, as novelty, the check of possible changes in RefSeq/EMBL entries corresponding to pre-existing MZ entries, especially at level of sequence, feature table and organism classification fields.
Other novelties of the MZ reannotation pipeline v.2.0 are: 1) the identification and annotation standardization of mt pseudogenes and events of editing/translational frameshift in protein genes; 2) the creation of “simulation files” (sim-file), i.e. tab-delimited files having coherent structure and standard syntax, where a record of all identified errors and related reannotation events is saved before real modification of the entries. The sim-file can be easily edited also from human, thus it permits to manually modify the most troublesome annotations and to keep the whole flow of data under control; 3) the introduction of GO comparisons for reannotation purposes: all entries of congeneric species, belonging to both the updating dataset and the current MZ release, are inspected for the presence of difference in gene order/content and inconsistencies are verified by literature, if necessary. Finally, the full set of GOs has been clustered on the basis of their string-identity: the resulting “non-redundant” dataset permits to measure the diversity of GO among and within the major metazoan lineages. The implementation of a GO similarity search method is currently in progress.
Results (1716)
MitoZoa has been updated six times since its construction, thus the current Rel.7 (Jan 2011) contains 2,259 complete and 374 partial mtDNAs, with a total of 755 new entries compared to Rel.1 (Jul 2009). On the whole, 62% of the Rel.7 entries have been reannotated, and 76% of all reannotation events involve tRNAs.
Thanks to the integration of the GO comparison procedure in the reannotation pipeline, we have identified several cases of different GO at congeneric level. Most of them have been confirmed by literature check and
Tipologia IRIS:
03 - Contributo in volume
Elenco autori:
R. Lupi, P. D’Onorio De Meo, M. D’Antonio, G. Pavesi, F. Griggio, G. Pesole, T. Castrignanò, C. Gissi
Link alla scheda completa:
Titolo del libro:
BITS 2011, VIII Annual Meeting of the Bioinformatics Italian Society