Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 47
Filtrar
1.
BMC Bioinformatics ; 22(1): 561, 2021 Nov 23.
Artigo em Inglês | MEDLINE | ID: mdl-34814826

RESUMO

BACKGROUND: Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. RESULTS: We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89-92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. CONCLUSIONS: Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy.


Assuntos
Algoritmos , Redes Neurais de Computação , Animais , Genoma , Humanos
2.
RNA ; 25(12): 1714-1730, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31506380

RESUMO

The origin of the genetic code remains enigmatic five decades after it was elucidated, although there is growing evidence that the code coevolved progressively with the ribosome. A number of primordial codes were proposed as ancestors of the modern genetic code, including comma-free codes such as the RRY, RNY, or GNC codes (R = G or A, Y = C or T, N = any nucleotide), and the X circular code, an error-correcting code that also allows identification and maintenance of the reading frame. It was demonstrated previously that motifs of the X circular code are significantly enriched in the protein-coding genes of most organisms, from bacteria to eukaryotes. Here, we show that imprints of this code also exist in the ribosomal RNA (rRNA). In a large-scale study involving 133 organisms representative of the three domains of life, we identified 32 universal X motifs that are conserved in the rRNA of >90% of the organisms. Intriguingly, most of the universal X motifs are located in rRNA regions involved in important ribosome functions, notably in the peptidyl transferase center and the decoding center that form the original "proto-ribosome." Building on the existing accretion models for ribosome evolution, we propose that error-correcting circular codes represented an important step in the emergence of the modern genetic code. Thus, circular codes would have allowed the simultaneous coding of amino acids and synchronization of the reading frame in primitive translation systems, prior to the emergence of more sophisticated start codon recognition and translation initiation mechanisms.


Assuntos
Evolução Molecular , Código Genético , Motivos de Nucleotídeos , Biossíntese de Proteínas , Ribossomos/genética , Ribossomos/metabolismo , Modelos Biológicos , Modelos Moleculares , Conformação Molecular , Conformação de Ácido Nucleico , RNA Ribossômico/química , RNA Ribossômico/genética , Ribossomos/química , Relação Estrutura-Atividade
3.
Nucleic Acids Res ; 47(D1): D411-D418, 2019 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-30380106

RESUMO

OrthoInspector is one of the leading software suites for orthology relations inference. In this paper, we describe a major redesign of the OrthoInspector online resource along with a significant increase in the number of species: 4753 organisms are now covered across the three domains of life, making OrthoInspector the most exhaustive orthology resource to date in terms of covered species (excluding viruses). The new website integrates original data exploration and visualization tools in an ergonomic interface. Distributions of protein orthologs are represented by heatmaps summarizing their evolutionary histories, and proteins with similar profiles can be directly accessed. Two novel tools have been implemented for comparative genomics: a phylogenetic profile search that can be used to find proteins with a specific presence-absence profile and investigate their functions and, inversely, a GO profiling tool aimed at deciphering evolutionary histories of molecular functions, processes or cell components. In addition to the re-designed website, the OrthoInspector resource now provides a REST interface for programmatic access. OrthoInspector 3.0 is available at http://lbgi.fr/orthoinspectorv3.


Assuntos
Bases de Dados Genéticas , Genômica , Algoritmos , Bactérias/genética , Classificação , Eucariotos/genética , Evolução Molecular , Previsões , Ontologia Genética , Internet , Filogenia , Proteoma , Homologia de Sequência do Ácido Nucleico , Software , Especificidade da Espécie
4.
BMC Bioinformatics ; 21(1): 513, 2020 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-33172385

RESUMO

BACKGROUND: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon-intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. RESULTS: We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. CONCLUSIONS: Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon-intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.


Assuntos
Fases de Leitura Aberta/genética , Primatas/metabolismo , Proteoma , Sequência de Aminoácidos , Animais , Bases de Dados de Proteínas , Deleção de Genes , Humanos , Mutagênese Insercional , Proteínas Tirosina Fosfatases Semelhantes a Receptores/química , Proteínas Tirosina Fosfatases Semelhantes a Receptores/genética , Proteínas Tirosina Fosfatases Semelhantes a Receptores/metabolismo , Alinhamento de Sequência
5.
BMC Genomics ; 21(1): 293, 2020 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-32272892

RESUMO

BACKGROUND: The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations. RESULTS: We describe the construction of a new benchmark, called G3PO (benchmark for Gene and Protein Prediction PrOgrams), designed to represent many of the typical challenges faced by current genome annotation projects. The benchmark is based on a carefully validated and curated set of real eukaryotic genes from 147 phylogenetically disperse organisms, and a number of test sets are defined to evaluate the effects of different features, including genome sequence quality, gene structure complexity, protein length, etc. We used the benchmark to perform an independent comparative analysis of the most widely used ab initio gene prediction programs and identified the main strengths and weaknesses of the programs. More importantly, we highlight a number of features that could be exploited in order to improve the accuracy of current prediction tools. CONCLUSIONS: The experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. We believe that the baseline results associated with the complex gene test sets in G3PO provide useful guidelines for future studies.


Assuntos
Biologia Computacional/métodos , Eucariotos/genética , Anotação de Sequência Molecular/métodos , Animais , Curadoria de Dados , Evolução Molecular , Humanos , Filogenia
6.
RNA Biol ; 17(4): 571-583, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-31960748

RESUMO

Three-base periodicity (TBP), where nucleotides and higher order n-tuples are preferentially spaced by 3, 6, 9, etc. bases, is a well-known intrinsic property of protein-coding DNA sequences. However, its origins are still not fully understood. One hypothesis is that the periodicity reflects a primordial coding system that was used before the emergence of the modern standard genetic code (SGC). Recent evidence suggests that the X circular code, a set of 20 trinucleotides allowing the reading frames in genes to be retrieved locally, represents a possible ancestor of the SGC. Motifs from the X circular code have been found in the reading frame of protein-coding regions in extant organisms from bacteria to eukaryotes, in many transfer RNA (tRNA) genes and in important functional regions of the ribosomal RNA (rRNA), notably in the peptidyl transferase centre and the decoding centre. Here, we have used a powerful correlation function to search for periodicity patterns involving the 20 trinucleotides of the X circular code in a large set of bacterial protein-coding genes, as well as in the translation machinery, including rRNA and tRNA sequences. As might be expected, we found a strong circular code periodicity 0 modulo 3 in the protein-coding genes. More surprisingly, we also identified a similar circular code periodicity in a large region of the 16S rRNA. This region includes the 3' major domain corresponding to the primordial proto-ribosome decoding centre and containing numerous sites that interact with the tRNA and messenger RNA (mRNA) during translation. Furthermore, 3D structural analysis shows that the periodicity region surrounds the mRNA channel that lies between the head and the body of the SSU. Our results support the hypothesis that the X circular code may constitute an ancestral translation code involved in reading frame retrieval and maintenance, traces of which persist in modern mRNA, tRNA and rRNA despite their long evolution and adaptation to the SGC.


Assuntos
Bactérias/genética , Proteínas de Bactérias/genética , Biologia Computacional/métodos , Ribossomos/genética , Algoritmos , Bactérias/metabolismo , Evolução Molecular , Código Genético , Periodicidade , RNA Bacteriano/genética , RNA Ribossômico/genética , RNA de Transferência/genética
7.
Bioinformatics ; 34(19): 3390-3392, 2018 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-29741582

RESUMO

Summary: Comparative studies of protein sequences are widely used in evolutionary and comparative genomics studies, but there is a lack of efficient tools to identify conserved regions ab initio within a protein multiple alignment. PROBE provides a fully automatic analysis of protein family conservation, to identify conserved regions, or 'blocks', that may correspond to structural/functional domains or motifs. Conserved blocks are identified at two different levels: (i) family level blocks indicate sites that are probably of central importance to the protein's structure or function, and (ii) sub-family level blocks highlight regions that may signify functional specialization, such as binding partners, etc. All conserved blocks are mapped onto a phylogenetic tree and can also be visualized in the context of the multiple sequence alignment. PROBE thus facilitates in-depth studies of sequence-structure-function-evolution relationships, and opens the way to block-level phylogenetic profiling. Availability and implementation: Freely available on the web at http://www.lbgi.fr/∼julie/probe/web.


Assuntos
Evolução Molecular , Proteínas/genética , Software , Sequência de Aminoácidos , Biologia Computacional , Sequência Conservada , Filogenia , Alinhamento de Sequência
8.
Mol Biol Evol ; 34(8): 2016-2034, 2017 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-28460059

RESUMO

Cilia (flagella) are important eukaryotic organelles, present in the Last Eukaryotic Common Ancestor, and are involved in cell motility and integration of extracellular signals. Ciliary dysfunction causes a class of genetic diseases, known as ciliopathies, however current knowledge of the underlying mechanisms is still limited and a better characterization of genes is needed. As cilia have been lost independently several times during evolution and they are subject to important functional variation between species, ciliary genes can be investigated through comparative genomics. We performed phylogenetic profiling by predicting orthologs of human protein-coding genes in 100 eukaryotic species. The analysis integrated three independent methods to predict a consensus set of 274 ciliary genes, including 87 new promising candidates. A fine-grained analysis of the phylogenetic profiles allowed a partitioning of ciliary genes into modules with distinct evolutionary histories and ciliary functions (assembly, movement, centriole, etc.) and thus propagation of potential annotations to previously undocumented genes. The cilia/basal body localization was experimentally confirmed for five of these previously unannotated proteins (LRRC23, LRRC34, TEX9, WDR27, and BIVM), validating the relevance of our approach. Furthermore, our multi-level analysis sheds light on the core gene sets retained in gamete-only flagellates or Ecdysozoa for instance. By combining gene-centric and species-oriented analyses, this work reveals new ciliary and ciliopathy gene candidates and provides clues about the evolution of ciliary processes in the eukaryotic domain. Additionally, the positive and negative reference gene sets and the phylogenetic profile of human genes constructed during this study can be exploited in future work.


Assuntos
Cílios/genética , Ciliopatias/genética , Animais , Movimento Celular/genética , Cílios/metabolismo , Ciliopatias/metabolismo , Bases de Dados de Ácidos Nucleicos , Eucariotos , Células Eucarióticas , Evolução Molecular , Flagelos/genética , Flagelos/metabolismo , Genômica , Humanos , Filogenia , Análise de Sequência de DNA/métodos
9.
BMC Bioinformatics ; 17(1): 271, 2016 Jul 07.
Artigo em Inglês | MEDLINE | ID: mdl-27387560

RESUMO

BACKGROUND: A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. RESULTS: Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including 'core blocks', 'regions' and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity. CONCLUSIONS: LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.


Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Sequência de Aminoácidos , Humanos , Proteínas/genética , Homologia de Sequência de Aminoácidos
10.
Bioinformatics ; 31(3): 447-8, 2015 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-25273105

RESUMO

SUMMARY: We previously developed OrthoInspector, a package incorporating an original algorithm for the detection of orthology and inparalogy relations between different species. We have added new functionalities to the package. While its original algorithm was not modified, performing similar orthology predictions, we facilitated the prediction of very large databases (thousands of proteomes), refurbished its graphical interface, added new visualization tools for comparative genomics/protein family analysis and facilitated its deployment in a network environment. Finally, we have released three online databases of precomputed orthology relationships. AVAILABILITY: Package and databases are freely available at http://lbgi.fr/orthoinspector with all major browsers supported. CONTACT: odile.lecompte@unistra.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Gráficos por Computador , Bases de Dados Factuais , Proteômica/métodos , Análise de Sequência de Proteína/métodos , Software , Humanos , Anotação de Sequência Molecular , Filogenia
11.
Bioinformatics ; 30(17): 2432-9, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-24825613

RESUMO

MOTIVATION: The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. RESULTS: We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. AVAILABILITY AND IMPLEMENTATION: Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS.


Assuntos
Análise de Sequência de Proteína/métodos , Algoritmos , Animais , Teorema de Bayes , Bases de Dados de Proteínas , Humanos , Macaca mulatta , Filogenia , Alinhamento de Sequência , Software
12.
BMC Bioinformatics ; 15: 111, 2014 Apr 17.
Artigo em Inglês | MEDLINE | ID: mdl-24742296

RESUMO

BACKGROUND: Small insertion and deletion polymorphisms (Indels) are the second most common mutations in the human genome, after Single Nucleotide Polymorphisms (SNPs). Recent studies have shown that they have significant influence on genetic variation by altering human traits and can cause multiple human diseases. In particular, many Indels that occur in protein coding regions are known to impact the structure or function of the protein. A major challenge is to predict the effects of these Indels and to distinguish between deleterious and neutral variants. When an Indel occurs within a coding region, it can be either frameshifting (FS) or non-frameshifting (NFS). FS-Indels either modify the complete C-terminal region of the protein or result in premature termination of translation. NFS-Indels insert/delete multiples of three nucleotides leading to the insertion/deletion of one or more amino acids. RESULTS: In order to study the relationships between NFS-Indels and Mendelian diseases, we characterized NFS-Indels according to numerous structural, functional and evolutionary parameters. We then used these parameters to identify specific characteristics of disease-causing and neutral NFS-Indels. Finally, we developed a new machine learning approach, KD4i, that can be used to predict the phenotypic effects of NFS-Indels. CONCLUSIONS: We demonstrate in a large-scale evaluation that the accuracy of KD4i is comparable to existing state-of-the-art methods. However, a major advantage of our approach is that we also provide the reasons for the predictions, in the form of a set of rules. The rules are interpretable by non-expert humans and they thus represent new knowledge about the relationships between the genotype and phenotypes of NFS-Indels and the causative molecular perturbations that result in the disease.


Assuntos
Inteligência Artificial , Mutação INDEL , Fenótipo , Proteínas/genética , Humanos , Cinesinas/genética
13.
Nucleic Acids Res ; 40(Web Server issue): W71-5, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22641855

RESUMO

A major challenge in the post-genomic era is a better understanding of how human genetic alterations involved in disease affect the gene products. The KD4v (Comprehensible Knowledge Discovery System for Missense Variant) server allows to characterize and predict the phenotypic effects (deleterious/neutral) of missense variants. The server provides a set of rules learned by Induction Logic Programming (ILP) on a set of missense variants described by conservation, physico-chemical, functional and 3D structure predicates. These rules are interpretable by non-expert humans and are used to accurately predict the deleterious/neutral status of an unknown mutation. The web server is available at http://decrypthon.igbmc.fr/kd4v.


Assuntos
Doença/genética , Mutação de Sentido Incorreto , Polimorfismo de Nucleotídeo Único , Software , Estudos de Associação Genética , Humanos , Internet , Bases de Conhecimento , Fenótipo , Proteínas/química , Proteínas/genética
14.
Genomics ; 101(3): 178-86, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23147676

RESUMO

TFIIH is a eukaryotic complex composed of two subcomplexes, the CAK (Cdk activating kinase) and the core-TFIIH. The core-TFIIH, composed of seven subunits (XPB, XPD, P62, P52, P44, P34, and P8), plays a crucial role in transcription and repair. Here, we performed an extended sequence analysis to establish the accurate phylogenetic distribution of the core-TFIIH in 63 eukaryotic organisms. In spite of the high conservation of the seven subunits at the sequence and genomic levels, the non-enzymatic P8, P34, P52 and P62 are absent from one or a few unicellular species. To gain insight into their respective roles, we undertook a comparative genomic analysis of the whole proteome to identify the gene sets sharing similar presence/absence patterns. While little information was inferred for P8 and P62, our studies confirm the known role of P52 in repair and suggest for the first time the implication of the core TFIIH in mRNA splicing via P34.


Assuntos
Evolução Molecular , Complexos Multiproteicos/genética , Filogenia , Fator de Transcrição TFIIH/genética , Animais , Quinases Ciclina-Dependentes/genética , Proteínas de Ligação a DNA , Humanos , Subunidades Proteicas/genética , Transcrição Gênica
15.
Front Bioinform ; 3: 1178926, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37151482

RESUMO

Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.

16.
J Fungi (Basel) ; 9(4)2023 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-37108879

RESUMO

In fungi, the most abundant transcription factor (TF) class contains a fungal-specific 'GAL4-like' Zn2C6 DNA binding domain (DBD), while the second class contains another fungal-specific domain, known as 'fungal_trans' or middle homology domain (MHD), whose function remains largely uncharacterized. Remarkably, almost a third of MHD-containing TFs in public sequence databases apparently lack DNA binding activity, since they are not predicted to contain a DBD. Here, we reassess the domain organization of these 'MHD-only' proteins using an in silico error-tracking approach. In a large-scale analysis of ~17,000 MHD-only TF sequences present in all fungal phyla except Microsporidia and Cryptomycota, we show that the vast majority (>90%) result from genome annotation errors and we are able to predict a new DBD sequence for 14,261 of them. Most of these sequences correspond to a Zn2C6 domain (82%), with a small proportion of C2H2 domains (4%) found only in Dikarya. Our results contradict previous findings that the MHD-only TF are widespread in fungi. In contrast, we show that they are exceptional cases, and that the fungal-specific Zn2C6-MHD domain pair represents the canonical domain signature defining the most predominant fungal TF family. We call this family CeGAL, after the highly characterized members: Cep3, whose 3D structure is determined, and GAL4, a eukaryotic TF archetype. We believe that this will not only improve the annotation and classification of the Zn2C6 TF but will also provide critical guidance for future fungal gene regulatory network analyses.

17.
BMC Genomics ; 13: 5, 2012 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-22217008

RESUMO

BACKGROUND: The data from high throughput genomics technologies provide unique opportunities for studies of complex biological systems, but also pose many new challenges. The shift to the genome scale in evolutionary biology, for example, has led to many interesting, but often controversial studies. It has been suggested that part of the conflict may be due to errors in the initial sequences. Most gene sequences are predicted by bioinformatics programs and a number of quality issues have been raised, concerning DNA sequencing errors or badly predicted coding regions, particularly in eukaryotes. RESULTS: We investigated the impact of these errors on evolutionary studies and specifically on the identification of important genetic events. We focused on the detection of asymmetric evolution after duplication, which has been the subject of controversy recently. Using the human genome as a reference, we established a reliable set of 688 duplicated genes in 13 complete vertebrate genomes, where significantly different evolutionary rates are observed. We estimated the rates at which protein sequence errors occur and are accumulated in the higher-level analyses. We showed that the majority of the detected events (57%) are in fact artifacts due to the putative erroneous sequences and that these artifacts are sufficient to mask the true functional significance of the events. CONCLUSIONS: Initial errors are accumulated throughout the evolutionary analysis, generating artificially high rates of event predictions and leading to substantial uncertainty in the conclusions. This study emphasizes the urgent need for error detection and quality control strategies in order to efficiently extract knowledge from the new genome data.


Assuntos
Evolução Molecular , Genômica , Análise de Sequência de DNA/normas , Sequência de Aminoácidos , Animais , Artefatos , Biologia Computacional , Humanos , Dados de Sequência Molecular , Filogenia , Controle de Qualidade , Reprodutibilidade dos Testes , Alinhamento de Sequência , Homologia de Sequência
18.
BMC Genomics ; 13: 297, 2012 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-22748146

RESUMO

BACKGROUND: Membrane trafficking involves the complex regulation of proteins and lipids intracellular localization and is required for metabolic uptake, cell growth and development. Different trafficking pathways passing through the endosomes are coordinated by the ENTH/ANTH/VHS adaptor protein superfamily. The endosomes are crucial for eukaryotes since the acquisition of the endomembrane system was a central process in eukaryogenesis. RESULTS: Our in silico analysis of this ENTH/ANTH/VHS superfamily, consisting of proteins gathered from 84 complete genomes representative of the different eukaryotic taxa, revealed that genomic distribution of this superfamily allows to discriminate Fungi and Metazoa from Plantae and Protists. Next, in a four way genome wide comparison, we showed that this discriminative feature is observed not only for other membrane trafficking effectors, but also for proteins involved in metabolism and in cytokinesis, suggesting that metabolism, cytokinesis and intracellular trafficking pathways co-evolved. Moreover, some of the proteins identified were implicated in multiple functions, in either trafficking and metabolism or trafficking and cytokinesis, suggesting that membrane trafficking is central to this co-evolution process. CONCLUSIONS: Our study suggests that membrane trafficking and compartmentalization were not only key features for the emergence of eukaryotic cells but also drove the separation of the eukaryotes in the different taxa.


Assuntos
Membrana Celular/metabolismo , Genômica/métodos , Transporte Proteico/fisiologia , Proteínas/metabolismo , Evolução Biológica , Citocinese/fisiologia , Filogenia , Proteínas/química , Proteínas/classificação
19.
Mol Syst Biol ; 7: 539, 2011 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-21988835

RESUMO

Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.


Assuntos
Mineração de Dados/métodos , Proteínas/análise , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Biologia de Sistemas , Algoritmos , Sequência de Aminoácidos , Sequência de Bases , Bases de Dados Factuais , Dados de Sequência Molecular , Proteínas/química , Software , Biologia de Sistemas/instrumentação , Biologia de Sistemas/métodos
20.
PLoS Comput Biol ; 7(12): e1002269, 2011 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-22144877

RESUMO

The identification of single copy (1-to-1) orthologs in any group of organisms is important for functional classification and phylogenetic studies. The Metazoa are no exception, but only recently has there been a wide-enough distribution of taxa with sufficiently high quality sequenced genomes to gain confidence in the wide-spread single copy status of a gene.Here, we present a phylogenetic approach for identifying overlooked single copy orthologs from multigene families and apply it to the Metazoa. Using 18 sequenced metazoan genomes of high quality we identified a robust set of 1,126 orthologous groups that have been retained in single copy since the last common ancestor of Metazoa. We found that the use of the phylogenetic procedure increased the number of single copy orthologs found by over a third more than standard taxon-count approaches. The orthologs represented a wide range of functional categories, expression profiles and levels of divergence.To demonstrate the value of our set of single copy orthologs, we used them to assess the completeness of 24 currently published metazoan genomes and 62 EST datasets. We found that the annotated genes in published genomes vary in coverage from 79% (Ciona intestinalis) to 99.8% (human) with an average of 92%, suggesting a value for the underlying error rate in genome annotation, and a strategy for identifying single copy orthologs in larger datasets. In contrast, the vast majority of EST datasets with no corresponding genome sequence available are largely under-sampled and probably do not accurately represent the actual genomic complement of the organisms from which they are derived.


Assuntos
Dosagem de Genes , Genoma/genética , Genômica/métodos , Filogenia , Animais , Bases de Dados Genéticas , Evolução Molecular , Etiquetas de Sequências Expressas , Humanos , Família Multigênica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA