Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 33
Filtrar
1.
BMC Bioinformatics ; 25(1): 235, 2024 Jul 11.
Artículo en Inglés | MEDLINE | ID: mdl-38992593

RESUMEN

BACKGROUND: SimSpliceEvol is a tool for simulating the evolution of eukaryotic gene sequences that integrates exon-intron structure evolution as well as the evolution of the sets of transcripts produced from genes. It takes a guide gene tree as input and generates a gene sequence with its transcripts for each node of the tree, from the root to the leaves. However, the sets of transcripts simulated at different nodes of the guide gene tree lack evolutionary connections. Consequently, SimSpliceEvol is not suitable for evaluating methods for transcript phylogeny inference or gene phylogeny inference that rely on transcript conservation. RESULTS: Here, we introduce SimSpliceEvol2, which, compared to the first version, incorporates an explicit model of transcript evolution for simulating alternative transcripts along the branches of a guide gene tree, as well as the transcript phylogenies inferred. We offer a comprehensive software with a graphical user interface and an updated version of the web server, ensuring easy and user-friendly access to the tool. CONCLUSION: SimSpliceEvol2 generates synthetic datasets that are useful for evaluating methods and tools for spliced RNA sequence analysis, such as spliced alignment methods, methods for identifying conserved transcripts, and transcript phylogeny reconstruction methods. The web server is accessible at https://simspliceevol.cobius.usherbrooke.ca , where you can also download the standalone software. Comprehensive documentation for the software is available at the same address. For developers interested in the source code, which requires the installation of all prerequisites to run, it is provided at  https://github.com/UdeS-CoBIUS/SimSpliceEvol .


Asunto(s)
Empalme Alternativo , Evolución Molecular , Filogenia , Programas Informáticos , Empalme Alternativo/genética , Exones/genética , Análisis de Secuencia de ARN/métodos , Simulación por Computador
2.
Bioinformatics ; 40(Suppl 1): i237-i246, 2024 06 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940169

RESUMEN

MOTIVATION: Noncoding RNAs (ncRNAs) express their functions by adopting molecular structures. Specifically, RNA secondary structures serve as a relatively stable intermediate step before tertiary structures, offering a reliable signature of molecular function. Consequently, within an RNA functional family, secondary structures are generally more evolutionarily conserved than sequences. Conversely, homologous RNA families grouped within an RNA clan share ancestors but typically exhibit structural differences. Inferring the evolution of RNA structures within RNA families and clans is crucial for gaining insights into functional adaptations over time and providing clues about the Ancient RNA World Hypothesis. RESULTS: We introduce the median problem and the small parsimony problem for ncRNA families, where secondary structures are represented as leaf-labeled trees. We utilize the Robinson-Foulds (RF) tree distance, which corresponds to a specific edit distance between RNA trees, and a new metric called the Internal-Leafset (IL) distance. While the RF tree distance compares sets of leaves descending from internal nodes of two RNA trees, the IL distance compares the collection of leaf-children of internal nodes. The latter is better at capturing differences in structural elements of RNAs than the RF distance, which is more focused on base pairs. We also consider a more general tree edit distance that allows the mapping of base pairs that are not perfectly aligned. We study the theoretical complexity of the median problem and the small parsimony problem under the three distance metrics and various biologically relevant constraints, and we present polynomial-time maximum parsimony algorithms for solving some versions of the problems. Our algorithms are applied to ncRNA families from the RFAM database, illustrating their practical utility. AVAILABILITY AND IMPLEMENTATION: https://github.com/bmarchand/rna\_small\_parsimony.


Asunto(s)
Conformación de Ácido Nucleico , ARN no Traducido , ARN no Traducido/genética , ARN no Traducido/química , Algoritmos , Evolución Molecular , Análisis de Secuencia de ARN/métodos , Biología Computacional/métodos
3.
J Comput Biol ; 31(4): 277-293, 2024 04.
Artículo en Inglés | MEDLINE | ID: mdl-38621191

RESUMEN

Eukaryotic genes undergo a mechanism called alternative processing, resulting in transcriptome diversity by allowing the production of multiple distinct transcripts from a gene. More than half of human genes are affected, and the resulting transcripts are highly conserved among orthologous genes of distinct species. In this work, we present the definition of orthology and paralogy between transcripts of homologous genes, together with an algorithm to compute clusters of conserved orthologous and paralogous transcripts. Gene-level homology relationships are utilized to define various types of homology relationships between transcripts originating from the same ancestral transcript. A Reciprocal Best Hits approach is employed to infer clusters of isoorthologous and recent paralogous transcripts. We applied this method to transcripts from simulated gene families as well as real gene families from the Ensembl-Compara database. The results are consistent with those from previous studies that compared orthologous gene transcripts. Furthermore, our findings provide evidence that searching for conserved transcripts between homologous genes, beyond the scope of orthologous genes, is likely to yield valuable information.


Asunto(s)
Algoritmos , Humanos , Transcriptoma/genética , Bases de Datos Genéticas , Animales , ARN Mensajero/genética , Biología Computacional/métodos , Familia de Multigenes
4.
Nucleic Acids Res ; 52(D1): D522-D528, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-37956315

RESUMEN

The OpenProt proteogenomic resource (https://www.openprot.org/) provides users with a complete and freely accessible set of non-canonical or alternative open reading frames (AltORFs) within the transcriptome of various species, as well as functional annotations of the corresponding protein sequences not found in standard databases. Enhancements in this update are largely the result of user feedback and include the prediction of structure, subcellular localization, and intrinsic disorder, using cutting-edge algorithms based on machine learning techniques. The mass spectrometry pipeline now integrates a machine learning-based peptide rescoring method to improve peptide identification. We continue to help users explore this cryptic proteome by providing OpenCustomDB, a tool that enables users to build their own customized protein databases, and OpenVar, a genomic annotator including genetic variants within AltORFs and protein sequences. A new interface improves the visualization of all functional annotations, including a spectral viewer and the prediction of multicoding genes. All data on OpenProt are freely available and downloadable. Overall, OpenProt continues to establish itself as an important resource for the exploration and study of new proteins.


Asunto(s)
Bases de Datos de Proteínas , Péptidos , Proteómica , Secuencia de Aminoácidos , Genómica , Internet , Péptidos/genética , Proteoma/genética , Proteómica/métodos , Humanos
5.
Evol Bioinform Online ; 19: 11769343231212075, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38046653

RESUMEN

Background: G-quadruplexes (G4s) are secondary structures in DNA and RNA that impact various cellular processes, such as transcription, splicing, and translation. Due to their numerous functions, G4s are involved in many diseases, making their study important. Yet, G4s evolution remains largely unknown, due to their low sequence similarity and the poor quality of their sequence alignments across several species. To address this, we designed a strategy that avoids direct G4s alignment to study G4s evolution in the 3 species kingdoms. We also explored the coevolution between RBPs and G4s. Methods: We retrieved one-to-one orthologous genes from the Ensembl Compara database and computed groups of one-to-one orthologous genes. For each group, we aligned gene sequences and identified G4 families as groups of overlapping G4s in the alignment. We analyzed these G4 families using Count, a tool to infer feature evolution into a gene or a species tree. Additionally, we utilized these G4 families to predict G4s by homology. To establish a control dataset, we performed mono-, di- and tri-nucleotide shuffling. Results: Only a few conserved G4s occur among all living kingdoms. In eukaryotes, G4s exhibit slight conservation among vertebrates, and few are conserved between plants. In archaea and bacteria, at most, only 2 G4s are common. The G4 homology-based prediction increases the number of conserved G4s in common ancestors. The coevolution between RNA-binding proteins and G4s was investigated and revealed a modest impact of RNA-binding proteins evolution on G4 evolution. However, the details of this relationship remain unclear. Conclusion: Even if G4 evolution still eludes us, the present study provides key information to compute groups of homologous G4 and to reveal the evolution history of G4 families.

6.
Nucleic Acids Res ; 51(D1): D135-D140, 2023 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-35971612

RESUMEN

G-quadruplexes (G4) are 3D structures that are found in both DNA and RNA. Interest in this structure has grown over the past few years due to both its implication in diverse biological mechanisms and its potential use as a therapeutic target, to name two examples. G4s in humans have been widely studied; however, the level of their study in other species remains relatively minimal. That said, progress in this field has resulted in the prediction of G4s structures in various species, ranging from bacteria to eukaryotes. These predictions were analysed in a previous study which revealed that G4s are present in all living kingdoms. To date, eleven different databases have grouped the various G4s depending on either their structures, on the proteins that might bind them, or on their location in the various genomes. However, none of these databases contains information on their location in the transcriptome of many of the implicated species. The GAIA database was designed so as to make this data available online in a user-friendly manner. Through its web interface, users can query GAIA to filter G4s, which, we hope, will help the research in this field. GAIA is available at: https://gaia.cobius.usherbrooke.ca.


Asunto(s)
G-Cuádruplex , Humanos , ARN/química , Bacterias/genética , Eucariontes/genética , Eucariontes/metabolismo , ADN/química
7.
NAR Genom Bioinform ; 4(1): lqac010, 2022 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-35261973

RESUMEN

G-quadruplexes are motifs found in DNA and RNA that can fold into tertiary structures. Until now, they have been studied experimentally mainly in humans and a few other species. Recently, predictions have been made with bacterial and archaeal genomes. Nevertheless, a global comparison of predicted G4s (pG4s) across and within the three living kingdoms has not been addressed. In this study, we aimed to predict G4s in genes and transcripts of all kingdoms of living organisms and investigated the differences in their distributions. The relation of the predictions with GC content was studied. It appears that GC content is not the only parameter impacting G4 predictions and abundance. The distribution of pG4 densities varies depending on the class of transcripts and the group of species. Indeed, we have observed that, in coding transcripts, there are more predicted G4s than expected for eukaryotes but not for archaea and bacteria, while in noncoding transcripts, there are as many or fewer predicted G4s in all species groups. We even noticed that some species with the same GC content presented different pG4 profiles. For instance, Leishmania major and Chlamydomonas reinhardtii both have 60% of GC content, but the former has a pG4 density of 0.07 and the latter 1.16.

8.
Bioinform Adv ; 2(1): vbab044, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36699392

RESUMEN

Motivation: Alternative splicing is a ubiquitous process in eukaryotes that allows distinct transcripts to be produced from the same gene. Yet, the study of transcript evolution within a gene family is still in its infancy. One prerequisite for this study is the availability of methods to compare sets of transcripts while accounting for their splicing structure. In this context, we generalize the concept of pairwise spliced alignments (PSpAs) to multiple spliced alignments (MSpAs). MSpAs have several important purposes in addition to empowering the study of the evolution of transcripts. For instance, it is a key to improving the prediction of gene models, which is important to solve the growing problem of genome annotation. Despite its essentialness, a formal definition of the concept and methods to compute MSpAs are still lacking. Results: We introduce the MSpA problem and the SplicedFamAlignMulti (SFAM) method, to compute the MSpA of a gene family. Like most multiple sequence alignment (MSA) methods that are generally greedy heuristic methods assembling pairwise alignments, SFAM combines all PSpAs of coding DNA sequences and gene sequences of a gene family into an MSpA. It produces a single structure that represents the superstructure and models of the gene family. Using real vertebrate and simulated gene family data, we illustrate the utility of SFAM for computing accurate gene family superstructures, MSAs, inferring splicing orthologous groups and improving gene-model annotations. Availability and implementation: The supporting data and implementation of SFAM are freely available at https://github.com/UdeS-CoBIUS/SpliceFamAlignMulti. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

9.
Bioinformatics ; 37(Suppl_1): i120-i132, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252921

RESUMEN

MOTIVATION: It is largely established that all extant mitochondria originated from a unique endosymbiotic event integrating an α-proteobacterial genome into an eukaryotic cell. Subsequently, eukaryote evolution has been marked by episodes of gene transfer, mainly from the mitochondria to the nucleus, resulting in a significant reduction of the mitochondrial genome, eventually completely disappearing in some lineages. However, in other lineages such as in land plants, a high variability in gene repertoire distribution, including genes encoded in both the nuclear and mitochondrial genome, is an indication of an ongoing process of Endosymbiotic Gene Transfer (EGT). Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is expected to shed light on a number of open questions regarding the evolution of eukaryotes, including rooting of the eukaryotic tree. RESULTS: We address the problem of inferring the evolution of a gene family through duplication, loss and EGT events, the latter considered as a special case of horizontal gene transfer occurring between the mitochondrial and nuclear genomes of the same species (in one direction or the other). We consider both EGT events resulting in maintaining (EGTcopy) or removing (EGTcut) the gene copy in the source genome. We present a linear-time algorithm for computing the DLE (Duplication, Loss and EGT) distance, as well as an optimal reconciled tree, for the unitary cost, and a dynamic programming algorithm allowing to output all optimal reconciliations for an arbitrary cost of operations. We illustrate the application of our EndoRex software and analyze different costs settings parameters on a plant dataset and discuss the resulting reconciled trees. AVAILABILITY AND IMPLEMENTATION: EndoRex implementation and supporting data are available on the GitHub repository via https://github.com/AEVO-lab/EndoRex.


Asunto(s)
Evolución Molecular , Transferencia de Gen Horizontal , Algoritmos , Duplicación de Gen , Genoma , Filogenia , Simbiosis/genética
10.
Bioinformatics ; 37(13): 1920-1922, 2021 07 27.
Artículo en Inglés | MEDLINE | ID: mdl-33051656

RESUMEN

MOTIVATION: A phylogenetic tree reconciliation is a mapping of one phylogenetic tree onto another which represents the co-evolution of two sets of taxa (e.g. parasite-host co-evolution, gene-species co-evolution). The reconciliation framework was extended to allow modeling the co-evolution of three sets of taxa such as transcript-gene-species co-evolutions. Several web-based tools have been developed for the display and manipulation of phylogenetic trees and co-phylogenetic trees involving two trees, but there currently exists no tool for visualizing the joint reconciliation between three phylogenetic trees. RESULTS: Here, we present DoubleRecViz, a web-based tool for visualizing double reconciliations between phylogenetic trees at three levels: transcript, gene and species. DoubleRecViz extends the RecPhyloXML model-developed for gene-species tree reconciliation-to represent joint transcript-gene and gene-species tree reconciliations. It is implemented using the Dash library, which is a toolbox that provides dynamic visualization functionalities for web data visualization in Python. AVAILABILITY AND IMPLEMENTATION: DoubleRecViz is available through a web server at https://doublerecviz.cobius.usherbrooke.ca. The source code and information about installation procedures are also available at https://github.com/UdeS-CoBIUS/DoubleRecViz. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Evolución Molecular , Programas Informáticos , Algoritmos , Internet , Filogenia
11.
Nucleic Acids Res ; 49(D1): D380-D388, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33179748

RESUMEN

OpenProt (www.openprot.org) is the first proteogenomic resource supporting a polycistronic annotation model for eukaryotic genomes. It provides a deeper annotation of open reading frames (ORFs) while mining experimental data for supporting evidence using cutting-edge algorithms. This update presents the major improvements since the initial release of OpenProt. All species support recent NCBI RefSeq and Ensembl annotations, with changes in annotations being reported in OpenProt. Using the 131 ribosome profiling datasets re-analysed by OpenProt to date, non-AUG initiation starts are reported alongside a confidence score of the initiating codon. From the 177 mass spectrometry datasets re-analysed by OpenProt to date, the unicity of the detected peptides is controlled at each implementation. Furthermore, to guide the users, detectability statistics and protein relationships (isoforms) are now reported for each protein. Finally, to foster access to deeper ORF annotation independently of one's bioinformatics skills or computational resources, OpenProt now offers a data analysis platform. Users can submit their dataset for analysis and receive the results from the analysis by OpenProt. All data on OpenProt are freely available and downloadable for each species, the release-based format ensuring a continuous access to the data. Thus, OpenProt enables a more comprehensive annotation of eukaryotic genomes and fosters functional proteomic discoveries.


Asunto(s)
Bases de Datos de Proteínas , Eucariontes/genética , Genoma , Anotación de Secuencia Molecular , Sistemas de Lectura Abierta/genética , Espectrometría de Masas , Isoformas de Proteínas/genética , Proteogenómica , Ribosomas/metabolismo , Interfaz Usuario-Computador
12.
Genome Biol Evol ; 12(4): 381-395, 2020 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-32186700

RESUMEN

Horizontal gene transfer is a common mechanism in Bacteria that has contributed to the genomic content of existing organisms. Traditional methods for estimating bacterial phylogeny, however, assume only vertical inheritance in the evolution of homologous genes, which may result in errors in the estimated phylogenies. We present a new method for estimating bacterial phylogeny that accounts for the presence of genes acquired by horizontal gene transfer between genomes. The method identifies and corrects putative transferred genes in gene families, before applying a gene tree-based summary method to estimate bacterial species trees. The method was applied to estimate the phylogeny of the order Corynebacteriales, which is the largest clade in the phylum Actinobacteria. We report a collection of 14 phylogenetic trees on 360 Corynebacteriales genomes. All estimated trees display each genus as a monophyletic clade. The trees also display several relationships proposed by past studies, as well as new relevant relationships between and within the main genera of Corynebacteriales: Corynebacterium, Mycobacterium, Nocardia, Rhodococcus, and Gordonia. An implementation of the method in Python is available on GitHub at https://github.com/UdeS-CoBIUS/EXECT (last accessed April 2, 2020).


Asunto(s)
Corynebacterium/genética , Evolución Molecular , Transferencia de Gen Horizontal , Genoma Bacteriano , Modelos Genéticos , Filogenia , Corynebacterium/crecimiento & desarrollo , Genómica , Especificidad de la Especie
13.
NAR Genom Bioinform ; 2(2): lqaa035, 2020 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-33575590

RESUMEN

It has been demonstrated that RNA G-quadruplexes (G4) are structural motifs present in transcriptomes and play important regulatory roles in several post-transcriptional mechanisms. However, the full picture of RNA G4 locations and the extent of their implication remain elusive. Solely computational prediction analysis of the whole transcriptome may reveal all potential G4, since experimental identifications are always limited to specific conditions or specific cell lines. The present study reports the first in-depth computational prediction of potential G4 region across the complete human transcriptome. Although using a relatively stringent approach based on three prediction scores that accounts for the composition of G4 sequences, the composition of their neighboring sequences, and the various forms of G4, over 1.1 million of potential G4 (pG4) were predicted. The abundance of G4 was computationally confirmed in both 5' and 3'UTR as well as splicing junction of mRNA, appreciate for the first time in the long ncRNA, while almost absent of most of the small ncRNA families. The present results constitute an important step toward a full understanding of the roles of G4 in post-transcriptional mechanisms.

14.
NAR Genom Bioinform ; 2(4): lqaa086, 2020 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-33575631

RESUMEN

Predicting RNA structure is crucial for understanding RNA's mechanism of action. Comparative approaches for the prediction of RNA structures can be classified into four main strategies. The three first-align-and-fold, align-then-fold and fold-then-align-exploit multiple sequence alignments to improve the accuracy of conserved RNA-structure prediction. Align-and-fold methods perform generally better, but are also typically slower than the other alignment-based methods. The fourth strategy-alignment-free-consists in predicting the conserved RNA structure without relying on sequence alignment. This strategy has the advantage of being the faster, while predicting accurate structures through the use of latent representations of the candidate structures for each sequence. This paper presents aliFreeFoldMulti, an extension of the aliFreeFold algorithm. This algorithm predicts a representative secondary structure of multiple RNA homologs by using a vector representation of their suboptimal structures. aliFreeFoldMulti improves on aliFreeFold by additionally computing the conserved structure for each sequence. aliFreeFoldMulti is assessed by comparing its prediction performance and time efficiency with a set of leading RNA-structure prediction methods. aliFreeFoldMulti has the lowest computing times and the highest maximum accuracy scores. It achieves comparable average structure prediction accuracy as other methods, except TurboFoldII which is the best in terms of average accuracy but with the highest computing times. We present aliFreeFoldMulti as an illustration of the potential of alignment-free approaches to provide fast and accurate RNA-structure prediction methods.

15.
BMC Bioinformatics ; 20(Suppl 20): 640, 2019 Dec 17.
Artículo en Inglés | MEDLINE | ID: mdl-31842741

RESUMEN

BACKGROUND: It is now well established that eukaryotic coding genes have the ability to produce more than one type of transcript thanks to the mechanisms of alternative splicing and alternative transcription. Because of the lack of gold standard real data on alternative splicing, simulated data constitute a good option for evaluating the accuracy and the efficiency of methods developed for splice-aware sequence analysis. However, existing sequence evolution simulation methods do not model alternative splicing, and so they can not be used to test spliced sequence analysis methods. RESULTS: We propose a new method called SimSpliceEvol for simulating the evolution of sets of alternative transcripts along the branches of an input gene tree. In addition to traditional sequence evolution events, the simulation also includes gene exon-intron structure evolution events and alternative splicing events that modify the sets of transcripts produced from genes. SimSpliceEvol was implemented in Python. The source code is freely available at https://github.com/UdeS-CoBIUS/SimSpliceEvol. CONCLUSIONS: Data generated using SimSpliceEvol are useful for testing spliced RNA sequence analysis methods such as methods for spliced alignment of cDNA and genomic sequences, multiple cDNA alignment, orthologous exons identification, splicing orthology inference, transcript phylogeny inference, which requires to know the real evolutionary relationships between the sequences.


Asunto(s)
Empalme Alternativo/genética , Simulación por Computador , Evolución Molecular , Programas Informáticos , Animales , Secuencia de Bases , ADN Complementario/genética , Exones/genética , Humanos , Intrones/genética , Cadenas de Markov , Probabilidad , ARN Mensajero/genética , ARN Mensajero/metabolismo
16.
BMC Bioinformatics ; 20(Suppl 3): 133, 2019 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-30925859

RESUMEN

BACKGROUND: The inference of splicing orthology relationships between gene transcripts is a basic step for the prediction of transcripts and the annotation of gene structures in genomes. The splicing structure of a sequence refers to the exon extremity information in a CDS or the exon-intron extremity information in a gene sequence. Splicing orthologous CDS are pairs of CDS with similar sequences and conserved splicing structures from orthologous genes. Spliced alignment that consists in aligning a spliced cDNA sequence against an unspliced genomic sequence, constitutes a promising, yet unexplored approach for the identification of splicing orthology relationships. Existing spliced alignment algorithms do not exploit the information on the splicing structure of the input sequences, namely the exon structure of the cDNA sequence and the exon-intron structure of the genomic sequences. Yet, this information is often available for coding DNA sequences (CDS) and gene sequences annotated in databases, and it can help improve the accuracy of the computed spliced alignments. To address this issue, we introduce a new spliced alignment problem and a method called SplicedFamAlign (SFA) for computing the alignment of a spliced CDS against a gene sequence while accounting for the splicing structures of the input sequences, and then the inference of transcript splicing orthology groups in a gene family based on spliced alignments. RESULTS: The experimental results show that SFA outperforms existing spliced alignment methods in terms of accuracy and execution time for CDS-to-gene alignment. We also show that the performance of SFA remains high for various levels of sequence similarity between input sequences, thanks to accounting for the splicing structure of the input sequences. It is important to notice that unlike all current spliced alignment methods that are meant for cDNA-to-genome alignments and can be used for CDS-to-gene alignments, SFA is the first method specifically designed for CDS-to-gene alignments. CONCLUSION: We show the usefulness of SFA for the comparison of genes and transcripts within a gene family for the purpose of analyzing splicing orthologies. It can also be used for gene structure annotation and alternative splicing analyses. SplicedFamAlign was implemented in Python. Source code is freely available at https://github.com/UdeS-CoBIUS/SpliceFamAlign .


Asunto(s)
Algoritmos , Empalme Alternativo/genética , Sistemas de Lectura Abierta/genética , Alineación de Secuencia/métodos , Secuencia de Bases , Simulación por Computador , Exones/genética , Intrones/genética , Anotación de Secuencia Molecular , ARN Mensajero/genética , ARN Mensajero/metabolismo
17.
IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1364-1373, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-28166504

RESUMEN

Reconstructing ancestral gene orders in a given phylogeny is a classical problem in comparative genomics. Most existing methods compare conserved features in extant genomes in the phylogeny to define potential ancestral gene adjacencies, and either try to reconstruct all ancestral genomes under a global evolutionary parsimony criterion, or, focusing on a single ancestral genome, use a scaffolding approach to select a subset of ancestral gene adjacencies, generally aiming at reducing the fragmentation of the reconstructed ancestral genome. In this paper, we describe an exact algorithm for the Small Parsimony Problem that combines both approaches. We consider that gene adjacencies at internal nodes of the species phylogeny are weighted, and we introduce an objective function defined as a convex combination of these weights and the evolutionary cost under the Single-Cut-or-Join (SCJ) model. The weights of ancestral gene adjacencies can, e.g., be obtained through the recent availability of ancient DNA sequencing data, which provide a direct hint at the genome structure of the considered ancestor, or through probabilistic analysis of gene adjacencies evolution. We show the NP-hardness of our problem variant and propose a Fixed-Parameter Tractable algorithm based on the Sankoff-Rousseau dynamic programming algorithm that also allows to sample co-optimal solutions. We apply our approach to mammalian and bacterial data providing different degrees of complexity. We show that including adjacency weights in the objective has a significant impact in reducing the fragmentation of the reconstructed ancestral gene orders. An implementation is available at http://github.com/nluhmann/PhySca.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genoma Bacteriano , Genómica/métodos , Animales , Evolución Biológica , Simulación por Computador , Bases de Datos Genéticas , Evolución Molecular , Orden Génico , Marcadores Genéticos/genética , Modelos Genéticos , Zarigüeyas/genética , Filogenia , Plásmidos/metabolismo , Probabilidad , Reproducibilidad de los Resultados , Porcinos/genética , Yersinia/genética
18.
Nucleic Acids Res ; 47(D1): D403-D410, 2019 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-30299502

RESUMEN

Advances in proteomics and sequencing have highlighted many non-annotated open reading frames (ORFs) in eukaryotic genomes. Genome annotations, cornerstones of today's research, mostly rely on protein prior knowledge and on ab initio prediction algorithms. Such algorithms notably enforce an arbitrary criterion of one coding sequence (CDS) per transcript, leading to a substantial underestimation of the coding potential of eukaryotes. Here, we present OpenProt, the first database fully endorsing a polycistronic model of eukaryotic genomes to date. OpenProt contains all possible ORFs longer than 30 codons across 10 species, and cumulates supporting evidence such as protein conservation, translation and expression. OpenProt annotates all known proteins (RefProts), novel predicted isoforms (Isoforms) and novel predicted proteins from alternative ORFs (AltProts). It incorporates cutting-edge algorithms to evaluate protein orthology and re-interrogate publicly available ribosome profiling and mass spectrometry datasets, supporting the annotation of thousands of predicted ORFs. The constantly growing database currently cumulates evidence from 87 ribosome profiling and 114 mass spectrometry studies from several species, tissues and cell lines. All data is freely available and downloadable from a web platform (www.openprot.org) supporting a genome browser and advanced queries for each species. Thus, OpenProt enables a more comprehensive landscape of eukaryotic genomes' coding potential.


Asunto(s)
Eucariontes/genética , Genes/genética , Genoma , Sistemas de Lectura Abierta/genética , Proteoma/genética , Algoritmos , Animales , Humanos , Espectrometría de Masas , Anotación de Secuencia Molecular , Isoformas de Proteínas/genética , Proteómica/métodos , Ribosomas/metabolismo , Homología de Secuencia de Aminoácido
19.
Bioinformatics ; 34(13): i70-i78, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29949960

RESUMEN

Motivation: Predicting the conserved secondary structure of homologous ribonucleic acid (RNA) sequences is crucial for understanding RNA functions. However, fast and accurate RNA structure prediction is challenging, especially when the number and the divergence of homologous RNA increases. To address this challenge, we propose aliFreeFold, based on a novel alignment-free approach which computes a representative structure from a set of homologous RNA sequences using sub-optimal secondary structures generated for each sequence. It is based on a vector representation of sub-optimal structures capturing structure conservation signals by weighting structural motifs according to their conservation across the sub-optimal structures. Results: We demonstrate that aliFreeFold provides a good balance between speed and accuracy regarding predictions of representative structures for sets of homologous RNA compared to traditional methods based on sequence and structure alignment. We show that aliFreeFold is capable of uncovering conserved structural features fastly and effectively thanks to its weighting scheme that gives more (resp. less) importance to common (resp. uncommon) structural motifs. The weighting scheme is also shown to be capable of capturing conservation signal as the number of homologous RNA increases. These results demonstrate the ability of aliFreefold to efficiently and accurately provide interesting structural representatives of RNA families. Availability and implementation: aliFreeFold was implemented in C++. Source code and Linux binary are freely available at https://github.com/UdeS-CoBIUS/aliFreeFold. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
ARN/química , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Algoritmos , Conformación de Ácido Nucleico , ARN/metabolismo
20.
IEEE/ACM Trans Comput Biol Bioinform ; 15(5): 1560-1570, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-28678712

RESUMEN

The supertree problem asking for a tree displaying a set of consistent input trees has been largely considered for the reconstruction of species trees. Here, we rather explore this framework for the sake of reconstructing a gene tree from a set of input gene trees on partial data. In this perspective, the phylogenetic tree for the species containing the genes of interest can be used to choose among the many possible compatible "supergenetrees", the most natural criteria being to minimize a reconciliation cost. We develop a variety of algorithmic solutions for the construction and correction of gene trees using the supertree framework. A dynamic programming supertree algorithm for constructing or correcting gene trees, exponential in the number of input trees, is first developed for the less constrained version of the problem. It is then adapted to gene trees with nodes labeled as duplication or speciation, the additional constraint being to preserve the orthology and paralogy relations between genes. Then, a quadratic time algorithm is developed for efficiently correcting an initial gene tree while preserving a set of "trusted" subtrees, as well as the relative phylogenetic distance between them, in both cases of labeled or unlabeled input trees. By applying these algorithms to the set of Ensembl gene trees, we show that this new correction framework is particularly useful to correct weakly-supported duplication nodes. The C++ source code for the algorithms and simulations described in the paper are available at https://github.com/UdeM-LBIT/SuGeT.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Modelos Genéticos , Genes/genética , Filogenia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA