Búsqueda | Portal de Búsqueda de la BVS

1.

Spliceator: multi-species splice site prediction using convolutional neural networks.

Scalzitti, Nicolas; Kress, Arnaud; Orhand, Romain; Weber, Thomas; Moulinier, Luc; Jeannin-Girardon, Anne; Collet, Pierre; Poch, Olivier; Thompson, Julie D.

BMC Bioinformatics ; 22(1): 561, 2021 Nov 23.

Artículo en Inglés | MEDLINE | ID: mdl-34814826

RESUMEN

BACKGROUND: Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. RESULTS: We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89-92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. CONCLUSIONS: Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy.

Asunto(s)

Algoritmos , Redes Neurales de la Computación , Animales , Genoma , Humanos

2.

Circular code motifs in the ribosome: a missing link in the evolution of translation?

Dila, Gopal; Ripp, Raymond; Mayer, Claudine; Poch, Olivier; Michel, Christian J; Thompson, Julie D.

RNA ; 25(12): 1714-1730, 2019 12.

Artículo en Inglés | MEDLINE | ID: mdl-31506380

RESUMEN

The origin of the genetic code remains enigmatic five decades after it was elucidated, although there is growing evidence that the code coevolved progressively with the ribosome. A number of primordial codes were proposed as ancestors of the modern genetic code, including comma-free codes such as the RRY, RNY, or GNC codes (R = G or A, Y = C or T, N = any nucleotide), and the X circular code, an error-correcting code that also allows identification and maintenance of the reading frame. It was demonstrated previously that motifs of the X circular code are significantly enriched in the protein-coding genes of most organisms, from bacteria to eukaryotes. Here, we show that imprints of this code also exist in the ribosomal RNA (rRNA). In a large-scale study involving 133 organisms representative of the three domains of life, we identified 32 universal X motifs that are conserved in the rRNA of >90% of the organisms. Intriguingly, most of the universal X motifs are located in rRNA regions involved in important ribosome functions, notably in the peptidyl transferase center and the decoding center that form the original "proto-ribosome." Building on the existing accretion models for ribosome evolution, we propose that error-correcting circular codes represented an important step in the emergence of the modern genetic code. Thus, circular codes would have allowed the simultaneous coding of amino acids and synchronization of the reading frame in primitive translation systems, prior to the emergence of more sophisticated start codon recognition and translation initiation mechanisms.

Asunto(s)

Evolución Molecular , Código Genético , Motivos de Nucleótidos , Biosíntesis de Proteínas , Ribosomas/genética , Ribosomas/metabolismo , Modelos Biológicos , Modelos Moleculares , Conformación Molecular , Conformación de Ácido Nucleico , ARN Ribosómico/química , ARN Ribosómico/genética , Ribosomas/química , Relación Estructura-Actividad

3.

OrthoInspector 3.0: open portal for comparative genomics.

Nevers, Yannis; Kress, Arnaud; Defosset, Audrey; Ripp, Raymond; Linard, Benjamin; Thompson, Julie D; Poch, Olivier; Lecompte, Odile.

Nucleic Acids Res ; 47(D1): D411-D418, 2019 01 08.

Artículo en Inglés | MEDLINE | ID: mdl-30380106

RESUMEN

OrthoInspector is one of the leading software suites for orthology relations inference. In this paper, we describe a major redesign of the OrthoInspector online resource along with a significant increase in the number of species: 4753 organisms are now covered across the three domains of life, making OrthoInspector the most exhaustive orthology resource to date in terms of covered species (excluding viruses). The new website integrates original data exploration and visualization tools in an ergonomic interface. Distributions of protein orthologs are represented by heatmaps summarizing their evolutionary histories, and proteins with similar profiles can be directly accessed. Two novel tools have been implemented for comparative genomics: a phylogenetic profile search that can be used to find proteins with a specific presence-absence profile and investigate their functions and, inversely, a GO profiling tool aimed at deciphering evolutionary histories of molecular functions, processes or cell components. In addition to the re-designed website, the OrthoInspector resource now provides a REST interface for programmatic access. OrthoInspector 3.0 is available at http://lbgi.fr/orthoinspectorv3.

Asunto(s)

Bases de Datos Genéticas , Genómica , Algoritmos , Bacterias/genética , Clasificación , Eucariontes/genética , Evolución Molecular , Predicción , Ontología de Genes , Internet , Filogenia , Proteoma , Homología de Secuencia de Ácido Nucleico , Programas Informáticos , Especificidad de la Especie

4.

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes.

Meyer, Corentin; Scalzitti, Nicolas; Jeannin-Girardon, Anne; Collet, Pierre; Poch, Olivier; Thompson, Julie D.

BMC Bioinformatics ; 21(1): 513, 2020 Nov 10.

Artículo en Inglés | MEDLINE | ID: mdl-33172385

RESUMEN

BACKGROUND: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon-intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. RESULTS: We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. CONCLUSIONS: Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon-intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.

Asunto(s)

Sistemas de Lectura Abierta/genética , Primates/metabolismo , Proteoma , Secuencia de Aminoácidos , Animales , Bases de Datos de Proteínas , Eliminación de Gen , Humanos , Mutagénesis Insercional , Proteínas Tirosina Fosfatasas Similares a Receptores/química , Proteínas Tirosina Fosfatasas Similares a Receptores/genética , Proteínas Tirosina Fosfatasas Similares a Receptores/metabolismo , Alineación de Secuencia

5.

A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms.

Scalzitti, Nicolas; Jeannin-Girardon, Anne; Collet, Pierre; Poch, Olivier; Thompson, Julie D.

BMC Genomics ; 21(1): 293, 2020 Apr 09.

Artículo en Inglés | MEDLINE | ID: mdl-32272892

RESUMEN

BACKGROUND: The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations. RESULTS: We describe the construction of a new benchmark, called G3PO (benchmark for Gene and Protein Prediction PrOgrams), designed to represent many of the typical challenges faced by current genome annotation projects. The benchmark is based on a carefully validated and curated set of real eukaryotic genes from 147 phylogenetically disperse organisms, and a number of test sets are defined to evaluate the effects of different features, including genome sequence quality, gene structure complexity, protein length, etc. We used the benchmark to perform an independent comparative analysis of the most widely used ab initio gene prediction programs and identified the main strengths and weaknesses of the programs. More importantly, we highlight a number of features that could be exploited in order to improve the accuracy of current prediction tools. CONCLUSIONS: The experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. We believe that the baseline results associated with the complex gene test sets in G3PO provide useful guidelines for future studies.

Asunto(s)

Biología Computacional/métodos , Eucariontes/genética , Anotación de Secuencia Molecular/métodos , Animales , Curaduría de Datos , Evolución Molecular , Humanos , Filogenia

6.

Identification of a circular code periodicity in the bacterial ribosome: origin of codon periodicity in genes?

Michel, Christian J; Thompson, Julie D.

RNA Biol ; 17(4): 571-583, 2020 04.

Artículo en Inglés | MEDLINE | ID: mdl-31960748

RESUMEN

Three-base periodicity (TBP), where nucleotides and higher order n-tuples are preferentially spaced by 3, 6, 9, etc. bases, is a well-known intrinsic property of protein-coding DNA sequences. However, its origins are still not fully understood. One hypothesis is that the periodicity reflects a primordial coding system that was used before the emergence of the modern standard genetic code (SGC). Recent evidence suggests that the X circular code, a set of 20 trinucleotides allowing the reading frames in genes to be retrieved locally, represents a possible ancestor of the SGC. Motifs from the X circular code have been found in the reading frame of protein-coding regions in extant organisms from bacteria to eukaryotes, in many transfer RNA (tRNA) genes and in important functional regions of the ribosomal RNA (rRNA), notably in the peptidyl transferase centre and the decoding centre. Here, we have used a powerful correlation function to search for periodicity patterns involving the 20 trinucleotides of the X circular code in a large set of bacterial protein-coding genes, as well as in the translation machinery, including rRNA and tRNA sequences. As might be expected, we found a strong circular code periodicity 0 modulo 3 in the protein-coding genes. More surprisingly, we also identified a similar circular code periodicity in a large region of the 16S rRNA. This region includes the 3' major domain corresponding to the primordial proto-ribosome decoding centre and containing numerous sites that interact with the tRNA and messenger RNA (mRNA) during translation. Furthermore, 3D structural analysis shows that the periodicity region surrounds the mRNA channel that lies between the head and the body of the SSU. Our results support the hypothesis that the X circular code may constitute an ancestral translation code involved in reading frame retrieval and maintenance, traces of which persist in modern mRNA, tRNA and rRNA despite their long evolution and adaptation to the SGC.

Asunto(s)

Bacterias/genética , Proteínas Bacterianas/genética , Biología Computacional/métodos , Ribosomas/genética , Algoritmos , Bacterias/metabolismo , Evolución Molecular , Código Genético , Periodicidad , ARN Bacteriano/genética , ARN Ribosómico/genética , ARN de Transferencia/genética

7.

PROBE: analysis and visualization of protein block-level evolution.

Kress, Arnaud; Lecompte, Odile; Poch, Olivier; Thompson, Julie D.

Bioinformatics ; 34(19): 3390-3392, 2018 10 01.

Artículo en Inglés | MEDLINE | ID: mdl-29741582

RESUMEN

Summary: Comparative studies of protein sequences are widely used in evolutionary and comparative genomics studies, but there is a lack of efficient tools to identify conserved regions ab initio within a protein multiple alignment. PROBE provides a fully automatic analysis of protein family conservation, to identify conserved regions, or 'blocks', that may correspond to structural/functional domains or motifs. Conserved blocks are identified at two different levels: (i) family level blocks indicate sites that are probably of central importance to the protein's structure or function, and (ii) sub-family level blocks highlight regions that may signify functional specialization, such as binding partners, etc. All conserved blocks are mapped onto a phylogenetic tree and can also be visualized in the context of the multiple sequence alignment. PROBE thus facilitates in-depth studies of sequence-structure-function-evolution relationships, and opens the way to block-level phylogenetic profiling. Availability and implementation: Freely available on the web at http://www.lbgi.fr/â¼julie/probe/web.

Asunto(s)

Evolución Molecular , Proteínas/genética , Programas Informáticos , Secuencia de Aminoácidos , Biología Computacional , Secuencia Conservada , Filogenia , Alineación de Secuencia

8.

Insights into Ciliary Genes and Evolution from Multi-Level Phylogenetic Profiling.

Nevers, Yannis; Prasad, Megana K; Poidevin, Laetitia; Chennen, Kirsley; Allot, Alexis; Kress, Arnaud; Ripp, Raymond; Thompson, Julie D; Dollfus, Hélène; Poch, Olivier; Lecompte, Odile.

Mol Biol Evol ; 34(8): 2016-2034, 2017 08 01.

Artículo en Inglés | MEDLINE | ID: mdl-28460059

RESUMEN

Cilia (flagella) are important eukaryotic organelles, present in the Last Eukaryotic Common Ancestor, and are involved in cell motility and integration of extracellular signals. Ciliary dysfunction causes a class of genetic diseases, known as ciliopathies, however current knowledge of the underlying mechanisms is still limited and a better characterization of genes is needed. As cilia have been lost independently several times during evolution and they are subject to important functional variation between species, ciliary genes can be investigated through comparative genomics. We performed phylogenetic profiling by predicting orthologs of human protein-coding genes in 100 eukaryotic species. The analysis integrated three independent methods to predict a consensus set of 274 ciliary genes, including 87 new promising candidates. A fine-grained analysis of the phylogenetic profiles allowed a partitioning of ciliary genes into modules with distinct evolutionary histories and ciliary functions (assembly, movement, centriole, etc.) and thus propagation of potential annotations to previously undocumented genes. The cilia/basal body localization was experimentally confirmed for five of these previously unannotated proteins (LRRC23, LRRC34, TEX9, WDR27, and BIVM), validating the relevance of our approach. Furthermore, our multi-level analysis sheds light on the core gene sets retained in gamete-only flagellates or Ecdysozoa for instance. By combining gene-centric and species-oriented analyses, this work reveals new ciliary and ciliopathy gene candidates and provides clues about the evolution of ciliary processes in the eukaryotic domain. Additionally, the positive and negative reference gene sets and the phylogenetic profile of human genes constructed during this study can be exploited in future work.

Asunto(s)

Cilios/genética , Ciliopatías/genética , Animales , Movimiento Celular/genética , Cilios/metabolismo , Ciliopatías/metabolismo , Bases de Datos de Ácidos Nucleicos , Eucariontes , Células Eucariotas , Evolución Molecular , Flagelos/genética , Flagelos/metabolismo , Genómica , Humanos , Filogenia , Análisis de Secuencia de ADN/métodos

9.

LEON-BIS: multiple alignment evaluation of sequence neighbours using a Bayesian inference system.

Vanhoutreve, Renaud; Kress, Arnaud; Legrand, Baptiste; Gass, Hélène; Poch, Olivier; Thompson, Julie D.

BMC Bioinformatics ; 17(1): 271, 2016 Jul 07.

Artículo en Inglés | MEDLINE | ID: mdl-27387560

RESUMEN

BACKGROUND: A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. RESULTS: Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including 'core blocks', 'regions' and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity. CONCLUSIONS: LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.

Asunto(s)

Teorema de Bayes , Biología Computacional/métodos , Proteínas/química , Alineación de Secuencia/métodos , Secuencia de Aminoácidos , Humanos , Proteínas/genética , Homología de Secuencia de Aminoácido

10.

OrthoInspector 2.0: Software and database updates.

Linard, Benjamin; Allot, Alexis; Schneider, Raphaël; Morel, Can; Ripp, Raymond; Bigler, Marc; Thompson, Julie D; Poch, Olivier; Lecompte, Odile.

Bioinformatics ; 31(3): 447-8, 2015 Feb 01.

Artículo en Inglés | MEDLINE | ID: mdl-25273105

RESUMEN

SUMMARY: We previously developed OrthoInspector, a package incorporating an original algorithm for the detection of orthology and inparalogy relations between different species. We have added new functionalities to the package. While its original algorithm was not modified, performing similar orthology predictions, we facilitated the prediction of very large databases (thousands of proteomes), refurbished its graphical interface, added new visualization tools for comparative genomics/protein family analysis and facilitated its deployment in a network environment. Finally, we have released three online databases of precomputed orthology relationships. AVAILABILITY: Package and databases are freely available at http://lbgi.fr/orthoinspector with all major browsers supported. CONTACT: odile.lecompte@unistra.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Gráficos por Computador , Bases de Datos Factuales , Proteómica/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Humanos , Anotación de Secuencia Molecular , Filogenia

11.

SIBIS: a Bayesian model for inconsistent protein sequence estimation.

Khenoussi, Walyd; Vanhoutrève, Renaud; Poch, Olivier; Thompson, Julie D.

Bioinformatics ; 30(17): 2432-9, 2014 Sep 01.

Artículo en Inglés | MEDLINE | ID: mdl-24825613

RESUMEN

MOTIVATION: The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. RESULTS: We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. AVAILABILITY AND IMPLEMENTATION: Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/â¼julie/SIBIS.

Asunto(s)

Análisis de Secuencia de Proteína/métodos , Algoritmos , Animales , Teorema de Bayes , Bases de Datos de Proteínas , Humanos , Macaca mulatta , Filogenia , Alineación de Secuencia , Programas Informáticos

12.

A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i).

Bermejo-Das-Neves, Carlos; Nguyen, Hoan-Ngoc; Poch, Olivier; Thompson, Julie D.

BMC Bioinformatics ; 15: 111, 2014 Apr 17.

Artículo en Inglés | MEDLINE | ID: mdl-24742296

RESUMEN

BACKGROUND: Small insertion and deletion polymorphisms (Indels) are the second most common mutations in the human genome, after Single Nucleotide Polymorphisms (SNPs). Recent studies have shown that they have significant influence on genetic variation by altering human traits and can cause multiple human diseases. In particular, many Indels that occur in protein coding regions are known to impact the structure or function of the protein. A major challenge is to predict the effects of these Indels and to distinguish between deleterious and neutral variants. When an Indel occurs within a coding region, it can be either frameshifting (FS) or non-frameshifting (NFS). FS-Indels either modify the complete C-terminal region of the protein or result in premature termination of translation. NFS-Indels insert/delete multiples of three nucleotides leading to the insertion/deletion of one or more amino acids. RESULTS: In order to study the relationships between NFS-Indels and Mendelian diseases, we characterized NFS-Indels according to numerous structural, functional and evolutionary parameters. We then used these parameters to identify specific characteristics of disease-causing and neutral NFS-Indels. Finally, we developed a new machine learning approach, KD4i, that can be used to predict the phenotypic effects of NFS-Indels. CONCLUSIONS: We demonstrate in a large-scale evaluation that the accuracy of KD4i is comparable to existing state-of-the-art methods. However, a major advantage of our approach is that we also provide the reasons for the predictions, in the form of a set of rules. The rules are interpretable by non-expert humans and they thus represent new knowledge about the relationships between the genotype and phenotypes of NFS-Indels and the causative molecular perturbations that result in the disease.

Asunto(s)

Inteligencia Artificial , Mutación INDEL , Fenotipo , Proteínas/genética , Humanos , Cinesinas/genética

13.

KD4v: Comprehensible Knowledge Discovery System for Missense Variant.

Luu, Tien-Dao; Rusu, Alin; Walter, Vincent; Linard, Benjamin; Poidevin, Laetitia; Ripp, Raymond; Moulinier, Luc; Muller, Jean; Raffelsberger, Wolfgang; Wicker, Nicolas; Lecompte, Odile; Thompson, Julie D; Poch, Olivier; Nguyen, Hoan.

Nucleic Acids Res ; 40(Web Server issue): W71-5, 2012 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-22641855

RESUMEN

A major challenge in the post-genomic era is a better understanding of how human genetic alterations involved in disease affect the gene products. The KD4v (Comprehensible Knowledge Discovery System for Missense Variant) server allows to characterize and predict the phenotypic effects (deleterious/neutral) of missense variants. The server provides a set of rules learned by Induction Logic Programming (ILP) on a set of missense variants described by conservation, physico-chemical, functional and 3D structure predicates. These rules are interpretable by non-expert humans and are used to accurately predict the deleterious/neutral status of an unknown mutation. The web server is available at http://decrypthon.igbmc.fr/kd4v.

Asunto(s)

Enfermedad/genética , Mutación Missense , Polimorfismo de Nucleótido Simple , Programas Informáticos , Estudios de Asociación Genética , Humanos , Internet , Bases del Conocimiento , Fenotipo , Proteínas/química , Proteínas/genética

14.

Functional insights into the core-TFIIH from a comparative survey.

Bedez, Florence; Linard, Benjamin; Brochet, Xavier; Ripp, Raymond; Thompson, Julie D; Moras, Dino; Lecompte, Odile; Poch, Olivier.

Genomics ; 101(3): 178-86, 2013 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-23147676

RESUMEN

TFIIH is a eukaryotic complex composed of two subcomplexes, the CAK (Cdk activating kinase) and the core-TFIIH. The core-TFIIH, composed of seven subunits (XPB, XPD, P62, P52, P44, P34, and P8), plays a crucial role in transcription and repair. Here, we performed an extended sequence analysis to establish the accurate phylogenetic distribution of the core-TFIIH in 63 eukaryotic organisms. In spite of the high conservation of the seven subunits at the sequence and genomic levels, the non-enzymatic P8, P34, P52 and P62 are absent from one or a few unicellular species. To gain insight into their respective roles, we undertook a comparative genomic analysis of the whole proteome to identify the gene sets sharing similar presence/absence patterns. While little information was inferred for P8 and P62, our studies confirm the known role of P52 in repair and suggest for the first time the implication of the core TFIIH in mRNA splicing via P34.

Asunto(s)

Evolución Molecular , Complejos Multiproteicos/genética , Filogenia , Factor de Transcripción TFIIH/genética , Animales , Quinasas Ciclina-Dependientes/genética , Proteínas de Unión al ADN , Humanos , Subunidades de Proteína/genética , Transcripción Genética

15.

Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events.

Kress, Arnaud; Poch, Olivier; Lecompte, Odile; Thompson, Julie D.

Front Bioinform ; 3: 1178926, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37151482

RESUMEN

Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.

16.

CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach.

Mayer, Claudine; Vogt, Arthur; Uslu, Tuba; Scalzitti, Nicolas; Chennen, Kirsley; Poch, Olivier; Thompson, Julie D.

J Fungi (Basel) ; 9(4)2023 Mar 29.

Artículo en Inglés | MEDLINE | ID: mdl-37108879

RESUMEN

In fungi, the most abundant transcription factor (TF) class contains a fungal-specific 'GAL4-like' Zn2C6 DNA binding domain (DBD), while the second class contains another fungal-specific domain, known as 'fungal_trans' or middle homology domain (MHD), whose function remains largely uncharacterized. Remarkably, almost a third of MHD-containing TFs in public sequence databases apparently lack DNA binding activity, since they are not predicted to contain a DBD. Here, we reassess the domain organization of these 'MHD-only' proteins using an in silico error-tracking approach. In a large-scale analysis of ~17,000 MHD-only TF sequences present in all fungal phyla except Microsporidia and Cryptomycota, we show that the vast majority (>90%) result from genome annotation errors and we are able to predict a new DBD sequence for 14,261 of them. Most of these sequences correspond to a Zn2C6 domain (82%), with a small proportion of C2H2 domains (4%) found only in Dikarya. Our results contradict previous findings that the MHD-only TF are widespread in fungi. In contrast, we show that they are exceptional cases, and that the fungal-specific Zn2C6-MHD domain pair represents the canonical domain signature defining the most predominant fungal TF family. We call this family CeGAL, after the highly characterized members: Cep3, whose 3D structure is determined, and GAL4, a eukaryotic TF archetype. We believe that this will not only improve the annotation and classification of the Zn2C6 TF but will also provide critical guidance for future fungal gene regulatory network analyses.

17.

Controversies in modern evolutionary biology: the imperative for error detection and quality control.

Prosdocimi, Francisco; Linard, Benjamin; Pontarotti, Pierre; Poch, Olivier; Thompson, Julie D.

BMC Genomics ; 13: 5, 2012 Jan 04.

Artículo en Inglés | MEDLINE | ID: mdl-22217008

RESUMEN

BACKGROUND: The data from high throughput genomics technologies provide unique opportunities for studies of complex biological systems, but also pose many new challenges. The shift to the genome scale in evolutionary biology, for example, has led to many interesting, but often controversial studies. It has been suggested that part of the conflict may be due to errors in the initial sequences. Most gene sequences are predicted by bioinformatics programs and a number of quality issues have been raised, concerning DNA sequencing errors or badly predicted coding regions, particularly in eukaryotes. RESULTS: We investigated the impact of these errors on evolutionary studies and specifically on the identification of important genetic events. We focused on the detection of asymmetric evolution after duplication, which has been the subject of controversy recently. Using the human genome as a reference, we established a reliable set of 688 duplicated genes in 13 complete vertebrate genomes, where significantly different evolutionary rates are observed. We estimated the rates at which protein sequence errors occur and are accumulated in the higher-level analyses. We showed that the majority of the detected events (57%) are in fact artifacts due to the putative erroneous sequences and that these artifacts are sufficient to mask the true functional significance of the events. CONCLUSIONS: Initial errors are accumulated throughout the evolutionary analysis, generating artificially high rates of event predictions and leading to substantial uncertainty in the conclusions. This study emphasizes the urgent need for error detection and quality control strategies in order to efficiently extract knowledge from the new genome data.

Asunto(s)

Evolución Molecular , Genómica , Análisis de Secuencia de ADN/normas , Secuencia de Aminoácidos , Animales , Artefactos , Biología Computacional , Humanos , Datos de Secuencia Molecular , Filogenia , Control de Calidad , Reproducibilidad de los Resultados , Alineación de Secuencia , Homología de Secuencia

18.

Evolutionary analysis of the ENTH/ANTH/VHS protein superfamily reveals a coevolution between membrane trafficking and metabolism.

De Craene, Johan-Owen; Ripp, Raymond; Lecompte, Odile; Thompson, Julie D; Poch, Olivier; Friant, Sylvie.

BMC Genomics ; 13: 297, 2012 Jul 02.

Artículo en Inglés | MEDLINE | ID: mdl-22748146

RESUMEN

BACKGROUND: Membrane trafficking involves the complex regulation of proteins and lipids intracellular localization and is required for metabolic uptake, cell growth and development. Different trafficking pathways passing through the endosomes are coordinated by the ENTH/ANTH/VHS adaptor protein superfamily. The endosomes are crucial for eukaryotes since the acquisition of the endomembrane system was a central process in eukaryogenesis. RESULTS: Our in silico analysis of this ENTH/ANTH/VHS superfamily, consisting of proteins gathered from 84 complete genomes representative of the different eukaryotic taxa, revealed that genomic distribution of this superfamily allows to discriminate Fungi and Metazoa from Plantae and Protists. Next, in a four way genome wide comparison, we showed that this discriminative feature is observed not only for other membrane trafficking effectors, but also for proteins involved in metabolism and in cytokinesis, suggesting that metabolism, cytokinesis and intracellular trafficking pathways co-evolved. Moreover, some of the proteins identified were implicated in multiple functions, in either trafficking and metabolism or trafficking and cytokinesis, suggesting that membrane trafficking is central to this co-evolution process. CONCLUSIONS: Our study suggests that membrane trafficking and compartmentalization were not only key features for the emergence of eukaryotic cells but also drove the separation of the eukaryotes in the different taxa.

Asunto(s)

Membrana Celular/metabolismo , Genómica/métodos , Transporte de Proteínas/fisiología , Proteínas/metabolismo , Evolución Biológica , Citocinesis/fisiología , Filogenia , Proteínas/química , Proteínas/clasificación

19.

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Sievers, Fabian; Wilm, Andreas; Dineen, David; Gibson, Toby J; Karplus, Kevin; Li, Weizhong; Lopez, Rodrigo; McWilliam, Hamish; Remmert, Michael; Söding, Johannes; Thompson, Julie D; Higgins, Desmond G.

Mol Syst Biol ; 7: 539, 2011 Oct 11.

Artículo en Inglés | MEDLINE | ID: mdl-21988835

RESUMEN

Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.

Asunto(s)

Minería de Datos/métodos , Proteínas/análisis , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Biología de Sistemas , Algoritmos , Secuencia de Aminoácidos , Secuencia de Bases , Bases de Datos Factuales , Datos de Secuencia Molecular , Proteínas/química , Programas Informáticos , Biología de Sistemas/instrumentación , Biología de Sistemas/métodos

20.

Identifying single copy orthologs in Metazoa.

Creevey, Christopher J; Muller, Jean; Doerks, Tobias; Thompson, Julie D; Arendt, Detlev; Bork, Peer.

PLoS Comput Biol ; 7(12): e1002269, 2011 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-22144877

RESUMEN

The identification of single copy (1-to-1) orthologs in any group of organisms is important for functional classification and phylogenetic studies. The Metazoa are no exception, but only recently has there been a wide-enough distribution of taxa with sufficiently high quality sequenced genomes to gain confidence in the wide-spread single copy status of a gene.Here, we present a phylogenetic approach for identifying overlooked single copy orthologs from multigene families and apply it to the Metazoa. Using 18 sequenced metazoan genomes of high quality we identified a robust set of 1,126 orthologous groups that have been retained in single copy since the last common ancestor of Metazoa. We found that the use of the phylogenetic procedure increased the number of single copy orthologs found by over a third more than standard taxon-count approaches. The orthologs represented a wide range of functional categories, expression profiles and levels of divergence.To demonstrate the value of our set of single copy orthologs, we used them to assess the completeness of 24 currently published metazoan genomes and 62 EST datasets. We found that the annotated genes in published genomes vary in coverage from 79% (Ciona intestinalis) to 99.8% (human) with an average of 92%, suggesting a value for the underlying error rate in genome annotation, and a strategy for identifying single copy orthologs in larger datasets. In contrast, the vast majority of EST datasets with no corresponding genome sequence available are largely under-sampled and probably do not accurately represent the actual genomic complement of the organisms from which they are derived.

Asunto(s)

Dosificación de Gen , Genoma/genética , Genómica/métodos , Filogenia , Animales , Bases de Datos Genéticas , Evolución Molecular , Etiquetas de Secuencia Expresada , Humanos , Familia de Multigenes

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA