Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 33
Filtrar
1.
Nucleic Acids Res ; 52(D1): D522-D528, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37956315

RESUMO

The OpenProt proteogenomic resource (https://www.openprot.org/) provides users with a complete and freely accessible set of non-canonical or alternative open reading frames (AltORFs) within the transcriptome of various species, as well as functional annotations of the corresponding protein sequences not found in standard databases. Enhancements in this update are largely the result of user feedback and include the prediction of structure, subcellular localization, and intrinsic disorder, using cutting-edge algorithms based on machine learning techniques. The mass spectrometry pipeline now integrates a machine learning-based peptide rescoring method to improve peptide identification. We continue to help users explore this cryptic proteome by providing OpenCustomDB, a tool that enables users to build their own customized protein databases, and OpenVar, a genomic annotator including genetic variants within AltORFs and protein sequences. A new interface improves the visualization of all functional annotations, including a spectral viewer and the prediction of multicoding genes. All data on OpenProt are freely available and downloadable. Overall, OpenProt continues to establish itself as an important resource for the exploration and study of new proteins.


Assuntos
Bases de Dados de Proteínas , Peptídeos , Proteômica , Sequência de Aminoácidos , Genômica , Internet , Peptídeos/genética , Proteoma/genética , Proteômica/métodos , Humanos
2.
Bioinformatics ; 40(Supplement_1): i237-i246, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940169

RESUMO

MOTIVATION: Noncoding RNAs (ncRNAs) express their functions by adopting molecular structures. Specifically, RNA secondary structures serve as a relatively stable intermediate step before tertiary structures, offering a reliable signature of molecular function. Consequently, within an RNA functional family, secondary structures are generally more evolutionarily conserved than sequences. Conversely, homologous RNA families grouped within an RNA clan share ancestors but typically exhibit structural differences. Inferring the evolution of RNA structures within RNA families and clans is crucial for gaining insights into functional adaptations over time and providing clues about the Ancient RNA World Hypothesis. RESULTS: We introduce the median problem and the small parsimony problem for ncRNA families, where secondary structures are represented as leaf-labeled trees. We utilize the Robinson-Foulds (RF) tree distance, which corresponds to a specific edit distance between RNA trees, and a new metric called the Internal-Leafset (IL) distance. While the RF tree distance compares sets of leaves descending from internal nodes of two RNA trees, the IL distance compares the collection of leaf-children of internal nodes. The latter is better at capturing differences in structural elements of RNAs than the RF distance, which is more focused on base pairs. We also consider a more general tree edit distance that allows the mapping of base pairs that are not perfectly aligned. We study the theoretical complexity of the median problem and the small parsimony problem under the three distance metrics and various biologically relevant constraints, and we present polynomial-time maximum parsimony algorithms for solving some versions of the problems. Our algorithms are applied to ncRNA families from the RFAM database, illustrating their practical utility. AVAILABILITY AND IMPLEMENTATION: https://github.com/bmarchand/rna\_small\_parsimony.


Assuntos
Conformação de Ácido Nucleico , RNA não Traduzido , RNA não Traduzido/genética , RNA não Traduzido/química , Algoritmos , Evolução Molecular , Análise de Sequência de RNA/métodos , Biologia Computacional/métodos
3.
Nucleic Acids Res ; 51(D1): D135-D140, 2023 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-35971612

RESUMO

G-quadruplexes (G4) are 3D structures that are found in both DNA and RNA. Interest in this structure has grown over the past few years due to both its implication in diverse biological mechanisms and its potential use as a therapeutic target, to name two examples. G4s in humans have been widely studied; however, the level of their study in other species remains relatively minimal. That said, progress in this field has resulted in the prediction of G4s structures in various species, ranging from bacteria to eukaryotes. These predictions were analysed in a previous study which revealed that G4s are present in all living kingdoms. To date, eleven different databases have grouped the various G4s depending on either their structures, on the proteins that might bind them, or on their location in the various genomes. However, none of these databases contains information on their location in the transcriptome of many of the implicated species. The GAIA database was designed so as to make this data available online in a user-friendly manner. Through its web interface, users can query GAIA to filter G4s, which, we hope, will help the research in this field. GAIA is available at: https://gaia.cobius.usherbrooke.ca.


Assuntos
Quadruplex G , Humanos , RNA/química , Bactérias/genética , Eucariotos/genética , Eucariotos/metabolismo , DNA/química
4.
BMC Bioinformatics ; 25(1): 235, 2024 Jul 11.
Artigo em Inglês | MEDLINE | ID: mdl-38992593

RESUMO

BACKGROUND: SimSpliceEvol is a tool for simulating the evolution of eukaryotic gene sequences that integrates exon-intron structure evolution as well as the evolution of the sets of transcripts produced from genes. It takes a guide gene tree as input and generates a gene sequence with its transcripts for each node of the tree, from the root to the leaves. However, the sets of transcripts simulated at different nodes of the guide gene tree lack evolutionary connections. Consequently, SimSpliceEvol is not suitable for evaluating methods for transcript phylogeny inference or gene phylogeny inference that rely on transcript conservation. RESULTS: Here, we introduce SimSpliceEvol2, which, compared to the first version, incorporates an explicit model of transcript evolution for simulating alternative transcripts along the branches of a guide gene tree, as well as the transcript phylogenies inferred. We offer a comprehensive software with a graphical user interface and an updated version of the web server, ensuring easy and user-friendly access to the tool. CONCLUSION: SimSpliceEvol2 generates synthetic datasets that are useful for evaluating methods and tools for spliced RNA sequence analysis, such as spliced alignment methods, methods for identifying conserved transcripts, and transcript phylogeny reconstruction methods. The web server is accessible at https://simspliceevol.cobius.usherbrooke.ca , where you can also download the standalone software. Comprehensive documentation for the software is available at the same address. For developers interested in the source code, which requires the installation of all prerequisites to run, it is provided at  https://github.com/UdeS-CoBIUS/SimSpliceEvol .


Assuntos
Processamento Alternativo , Evolução Molecular , Filogenia , Software , Processamento Alternativo/genética , Éxons/genética , Análise de Sequência de RNA/métodos , Simulação por Computador
5.
Nucleic Acids Res ; 49(D1): D380-D388, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33179748

RESUMO

OpenProt (www.openprot.org) is the first proteogenomic resource supporting a polycistronic annotation model for eukaryotic genomes. It provides a deeper annotation of open reading frames (ORFs) while mining experimental data for supporting evidence using cutting-edge algorithms. This update presents the major improvements since the initial release of OpenProt. All species support recent NCBI RefSeq and Ensembl annotations, with changes in annotations being reported in OpenProt. Using the 131 ribosome profiling datasets re-analysed by OpenProt to date, non-AUG initiation starts are reported alongside a confidence score of the initiating codon. From the 177 mass spectrometry datasets re-analysed by OpenProt to date, the unicity of the detected peptides is controlled at each implementation. Furthermore, to guide the users, detectability statistics and protein relationships (isoforms) are now reported for each protein. Finally, to foster access to deeper ORF annotation independently of one's bioinformatics skills or computational resources, OpenProt now offers a data analysis platform. Users can submit their dataset for analysis and receive the results from the analysis by OpenProt. All data on OpenProt are freely available and downloadable for each species, the release-based format ensuring a continuous access to the data. Thus, OpenProt enables a more comprehensive annotation of eukaryotic genomes and fosters functional proteomic discoveries.


Assuntos
Bases de Dados de Proteínas , Eucariotos/genética , Genoma , Anotação de Sequência Molecular , Fases de Leitura Aberta/genética , Espectrometria de Massas , Isoformas de Proteínas/genética , Proteogenômica , Ribossomos/metabolismo , Interface Usuário-Computador
6.
Bioinformatics ; 37(Suppl_1): i120-i132, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252921

RESUMO

MOTIVATION: It is largely established that all extant mitochondria originated from a unique endosymbiotic event integrating an α-proteobacterial genome into an eukaryotic cell. Subsequently, eukaryote evolution has been marked by episodes of gene transfer, mainly from the mitochondria to the nucleus, resulting in a significant reduction of the mitochondrial genome, eventually completely disappearing in some lineages. However, in other lineages such as in land plants, a high variability in gene repertoire distribution, including genes encoded in both the nuclear and mitochondrial genome, is an indication of an ongoing process of Endosymbiotic Gene Transfer (EGT). Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is expected to shed light on a number of open questions regarding the evolution of eukaryotes, including rooting of the eukaryotic tree. RESULTS: We address the problem of inferring the evolution of a gene family through duplication, loss and EGT events, the latter considered as a special case of horizontal gene transfer occurring between the mitochondrial and nuclear genomes of the same species (in one direction or the other). We consider both EGT events resulting in maintaining (EGTcopy) or removing (EGTcut) the gene copy in the source genome. We present a linear-time algorithm for computing the DLE (Duplication, Loss and EGT) distance, as well as an optimal reconciled tree, for the unitary cost, and a dynamic programming algorithm allowing to output all optimal reconciliations for an arbitrary cost of operations. We illustrate the application of our EndoRex software and analyze different costs settings parameters on a plant dataset and discuss the resulting reconciled trees. AVAILABILITY AND IMPLEMENTATION: EndoRex implementation and supporting data are available on the GitHub repository via https://github.com/AEVO-lab/EndoRex.


Assuntos
Evolução Molecular , Transferência Genética Horizontal , Algoritmos , Duplicação Gênica , Genoma , Filogenia , Simbiose/genética
7.
Bioinformatics ; 37(13): 1920-1922, 2021 07 27.
Artigo em Inglês | MEDLINE | ID: mdl-33051656

RESUMO

MOTIVATION: A phylogenetic tree reconciliation is a mapping of one phylogenetic tree onto another which represents the co-evolution of two sets of taxa (e.g. parasite-host co-evolution, gene-species co-evolution). The reconciliation framework was extended to allow modeling the co-evolution of three sets of taxa such as transcript-gene-species co-evolutions. Several web-based tools have been developed for the display and manipulation of phylogenetic trees and co-phylogenetic trees involving two trees, but there currently exists no tool for visualizing the joint reconciliation between three phylogenetic trees. RESULTS: Here, we present DoubleRecViz, a web-based tool for visualizing double reconciliations between phylogenetic trees at three levels: transcript, gene and species. DoubleRecViz extends the RecPhyloXML model-developed for gene-species tree reconciliation-to represent joint transcript-gene and gene-species tree reconciliations. It is implemented using the Dash library, which is a toolbox that provides dynamic visualization functionalities for web data visualization in Python. AVAILABILITY AND IMPLEMENTATION: DoubleRecViz is available through a web server at https://doublerecviz.cobius.usherbrooke.ca. The source code and information about installation procedures are also available at https://github.com/UdeS-CoBIUS/DoubleRecViz. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Evolução Molecular , Software , Algoritmos , Internet , Filogenia
8.
Nucleic Acids Res ; 47(D1): D403-D410, 2019 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-30299502

RESUMO

Advances in proteomics and sequencing have highlighted many non-annotated open reading frames (ORFs) in eukaryotic genomes. Genome annotations, cornerstones of today's research, mostly rely on protein prior knowledge and on ab initio prediction algorithms. Such algorithms notably enforce an arbitrary criterion of one coding sequence (CDS) per transcript, leading to a substantial underestimation of the coding potential of eukaryotes. Here, we present OpenProt, the first database fully endorsing a polycistronic model of eukaryotic genomes to date. OpenProt contains all possible ORFs longer than 30 codons across 10 species, and cumulates supporting evidence such as protein conservation, translation and expression. OpenProt annotates all known proteins (RefProts), novel predicted isoforms (Isoforms) and novel predicted proteins from alternative ORFs (AltProts). It incorporates cutting-edge algorithms to evaluate protein orthology and re-interrogate publicly available ribosome profiling and mass spectrometry datasets, supporting the annotation of thousands of predicted ORFs. The constantly growing database currently cumulates evidence from 87 ribosome profiling and 114 mass spectrometry studies from several species, tissues and cell lines. All data is freely available and downloadable from a web platform (www.openprot.org) supporting a genome browser and advanced queries for each species. Thus, OpenProt enables a more comprehensive landscape of eukaryotic genomes' coding potential.


Assuntos
Eucariotos/genética , Genes/genética , Genoma , Fases de Leitura Aberta/genética , Proteoma/genética , Algoritmos , Animais , Humanos , Espectrometria de Massas , Anotação de Sequência Molecular , Isoformas de Proteínas/genética , Proteômica/métodos , Ribossomos/metabolismo , Homologia de Sequência de Aminoácidos
9.
BMC Bioinformatics ; 20(Suppl 20): 640, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31842741

RESUMO

BACKGROUND: It is now well established that eukaryotic coding genes have the ability to produce more than one type of transcript thanks to the mechanisms of alternative splicing and alternative transcription. Because of the lack of gold standard real data on alternative splicing, simulated data constitute a good option for evaluating the accuracy and the efficiency of methods developed for splice-aware sequence analysis. However, existing sequence evolution simulation methods do not model alternative splicing, and so they can not be used to test spliced sequence analysis methods. RESULTS: We propose a new method called SimSpliceEvol for simulating the evolution of sets of alternative transcripts along the branches of an input gene tree. In addition to traditional sequence evolution events, the simulation also includes gene exon-intron structure evolution events and alternative splicing events that modify the sets of transcripts produced from genes. SimSpliceEvol was implemented in Python. The source code is freely available at https://github.com/UdeS-CoBIUS/SimSpliceEvol. CONCLUSIONS: Data generated using SimSpliceEvol are useful for testing spliced RNA sequence analysis methods such as methods for spliced alignment of cDNA and genomic sequences, multiple cDNA alignment, orthologous exons identification, splicing orthology inference, transcript phylogeny inference, which requires to know the real evolutionary relationships between the sequences.


Assuntos
Processamento Alternativo/genética , Simulação por Computador , Evolução Molecular , Software , Animais , Sequência de Bases , DNA Complementar/genética , Éxons/genética , Humanos , Íntrons/genética , Cadeias de Markov , Probabilidade , RNA Mensageiro/genética , RNA Mensageiro/metabolismo
10.
BMC Bioinformatics ; 20(Suppl 3): 133, 2019 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-30925859

RESUMO

BACKGROUND: The inference of splicing orthology relationships between gene transcripts is a basic step for the prediction of transcripts and the annotation of gene structures in genomes. The splicing structure of a sequence refers to the exon extremity information in a CDS or the exon-intron extremity information in a gene sequence. Splicing orthologous CDS are pairs of CDS with similar sequences and conserved splicing structures from orthologous genes. Spliced alignment that consists in aligning a spliced cDNA sequence against an unspliced genomic sequence, constitutes a promising, yet unexplored approach for the identification of splicing orthology relationships. Existing spliced alignment algorithms do not exploit the information on the splicing structure of the input sequences, namely the exon structure of the cDNA sequence and the exon-intron structure of the genomic sequences. Yet, this information is often available for coding DNA sequences (CDS) and gene sequences annotated in databases, and it can help improve the accuracy of the computed spliced alignments. To address this issue, we introduce a new spliced alignment problem and a method called SplicedFamAlign (SFA) for computing the alignment of a spliced CDS against a gene sequence while accounting for the splicing structures of the input sequences, and then the inference of transcript splicing orthology groups in a gene family based on spliced alignments. RESULTS: The experimental results show that SFA outperforms existing spliced alignment methods in terms of accuracy and execution time for CDS-to-gene alignment. We also show that the performance of SFA remains high for various levels of sequence similarity between input sequences, thanks to accounting for the splicing structure of the input sequences. It is important to notice that unlike all current spliced alignment methods that are meant for cDNA-to-genome alignments and can be used for CDS-to-gene alignments, SFA is the first method specifically designed for CDS-to-gene alignments. CONCLUSION: We show the usefulness of SFA for the comparison of genes and transcripts within a gene family for the purpose of analyzing splicing orthologies. It can also be used for gene structure annotation and alternative splicing analyses. SplicedFamAlign was implemented in Python. Source code is freely available at https://github.com/UdeS-CoBIUS/SpliceFamAlign .


Assuntos
Algoritmos , Processamento Alternativo/genética , Fases de Leitura Aberta/genética , Alinhamento de Sequência/métodos , Sequência de Bases , Simulação por Computador , Éxons/genética , Íntrons/genética , Anotação de Sequência Molecular , RNA Mensageiro/genética , RNA Mensageiro/metabolismo
11.
Bioinformatics ; 34(13): i70-i78, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949960

RESUMO

Motivation: Predicting the conserved secondary structure of homologous ribonucleic acid (RNA) sequences is crucial for understanding RNA functions. However, fast and accurate RNA structure prediction is challenging, especially when the number and the divergence of homologous RNA increases. To address this challenge, we propose aliFreeFold, based on a novel alignment-free approach which computes a representative structure from a set of homologous RNA sequences using sub-optimal secondary structures generated for each sequence. It is based on a vector representation of sub-optimal structures capturing structure conservation signals by weighting structural motifs according to their conservation across the sub-optimal structures. Results: We demonstrate that aliFreeFold provides a good balance between speed and accuracy regarding predictions of representative structures for sets of homologous RNA compared to traditional methods based on sequence and structure alignment. We show that aliFreeFold is capable of uncovering conserved structural features fastly and effectively thanks to its weighting scheme that gives more (resp. less) importance to common (resp. uncommon) structural motifs. The weighting scheme is also shown to be capable of capturing conservation signal as the number of homologous RNA increases. These results demonstrate the ability of aliFreefold to efficiently and accurately provide interesting structural representatives of RNA families. Availability and implementation: aliFreeFold was implemented in C++. Source code and Linux binary are freely available at https://github.com/UdeS-CoBIUS/aliFreeFold. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
RNA/química , Análise de Sequência de RNA/métodos , Software , Algoritmos , Conformação de Ácido Nucleico , RNA/metabolismo
12.
BMC Bioinformatics ; 16 Suppl 14: S4, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26451911

RESUMO

Combining a set of trees on partial datasets into a single tree is a classical method for inferring large phylogenetic trees. Ideally, the combined tree should display each input partial tree, which is only possible if input trees do not contain contradictory phylogenetic information. The simplest version of the supertree problem is thus to state whether a set of trees is compatible, and if so, construct a tree displaying them all. Classically, supertree methods have been applied to the reconstruction of species trees. Here we rather consider reconstructing a super gene tree in light of a known species tree S. We define the supergenetree problem as finding, among all supertrees displaying a set of input gene trees, one supertree minimizing a reconciliation distance with S. We first show how classical exact methods to the supertree problem can be extended to the supergenetree problem. As all these methods are highly exponential, we also exhibit a natural greedy heuristic for the duplication cost, based on minimizing the set of duplications preceding the first speciation event. We then show that both the supergenetree problem and its restriction to minimizing duplications preceding the first speciation are NP-hard to approximate within a n1-ϵ factor, for any 0 < ϵ < 1. Finally, we show that a restriction of this problem to uniquely labeled speciation gene trees, which is relevant to many biological applications, is also NP-hard. Therefore, we introduce new avenues in the field of supertrees, and set the theoretical basis for the exploration of various algorithmic aspects of the problems.


Assuntos
Algoritmos , Biologia Computacional/métodos , Evolução Molecular , Especiação Genética , Filogenia , Animais , Humanos , Modelos Genéticos , Software
13.
BMC Genomics ; 16 Suppl 5: S6, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26040958

RESUMO

BACKGROUND: In the context of ancestral gene order reconstruction from extant genomes, there exist two main computational approaches: rearrangement-based, and homology-based methods. The rearrangement-based methods consist in minimizing a total rearrangement distance on the branches of a species tree. The homology-based methods consist in the detection of a set of potential ancestral contiguity features, followed by the assembling of these features into Contiguous Ancestral Regions (CARs). RESULTS: In this paper, we present a new homology-based method that uses a progressive approach for both the detection and the assembling of ancestral contiguity features into CARs. The method is based on detecting a set of potential ancestral adjacencies iteratively using the current set of CARs at each step, and constructing CARs progressively using a 2-phase assembling method. CONCLUSION: We show the usefulness of the method through a reconstruction of the boreoeutherian ancestral gene order, and a comparison with three other homology-based methods: AnGeS, InferCARs and GapAdj. The program, written in Python, and the dataset used in this paper are available at http://bioinfo.lifl.fr/procars/.


Assuntos
Grupos de População Animal/genética , Biologia Computacional/métodos , Genoma/genética , Genômica/métodos , Grupos Populacionais/genética , Algoritmos , Animais , Evolução Molecular , Humanos , Modelos Genéticos , Filogenia
14.
J Comput Biol ; 31(4): 277-293, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38621191

RESUMO

Eukaryotic genes undergo a mechanism called alternative processing, resulting in transcriptome diversity by allowing the production of multiple distinct transcripts from a gene. More than half of human genes are affected, and the resulting transcripts are highly conserved among orthologous genes of distinct species. In this work, we present the definition of orthology and paralogy between transcripts of homologous genes, together with an algorithm to compute clusters of conserved orthologous and paralogous transcripts. Gene-level homology relationships are utilized to define various types of homology relationships between transcripts originating from the same ancestral transcript. A Reciprocal Best Hits approach is employed to infer clusters of isoorthologous and recent paralogous transcripts. We applied this method to transcripts from simulated gene families as well as real gene families from the Ensembl-Compara database. The results are consistent with those from previous studies that compared orthologous gene transcripts. Furthermore, our findings provide evidence that searching for conserved transcripts between homologous genes, beyond the scope of orthologous genes, is likely to yield valuable information.


Assuntos
Algoritmos , Humanos , Transcriptoma/genética , Bases de Dados Genéticas , Animais , RNA Mensageiro/genética , Biologia Computacional/métodos , Família Multigênica
15.
Evol Bioinform Online ; 19: 11769343231212075, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38046653

RESUMO

Background: G-quadruplexes (G4s) are secondary structures in DNA and RNA that impact various cellular processes, such as transcription, splicing, and translation. Due to their numerous functions, G4s are involved in many diseases, making their study important. Yet, G4s evolution remains largely unknown, due to their low sequence similarity and the poor quality of their sequence alignments across several species. To address this, we designed a strategy that avoids direct G4s alignment to study G4s evolution in the 3 species kingdoms. We also explored the coevolution between RBPs and G4s. Methods: We retrieved one-to-one orthologous genes from the Ensembl Compara database and computed groups of one-to-one orthologous genes. For each group, we aligned gene sequences and identified G4 families as groups of overlapping G4s in the alignment. We analyzed these G4 families using Count, a tool to infer feature evolution into a gene or a species tree. Additionally, we utilized these G4 families to predict G4s by homology. To establish a control dataset, we performed mono-, di- and tri-nucleotide shuffling. Results: Only a few conserved G4s occur among all living kingdoms. In eukaryotes, G4s exhibit slight conservation among vertebrates, and few are conserved between plants. In archaea and bacteria, at most, only 2 G4s are common. The G4 homology-based prediction increases the number of conserved G4s in common ancestors. The coevolution between RNA-binding proteins and G4s was investigated and revealed a modest impact of RNA-binding proteins evolution on G4 evolution. However, the details of this relationship remain unclear. Conclusion: Even if G4 evolution still eludes us, the present study provides key information to compute groups of homologous G4 and to reveal the evolution history of G4 families.

16.
Bioinformatics ; 27(19): 2664-71, 2011 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-21846735

RESUMO

MOTIVATION: The ancestor of birds and mammals lived approximately 300 million years ago. Inferring its genome organization is key to understanding the differentiated evolution of these two lineages. However, detecting traces of its chromosomal organization in its extant descendants is difficult due to the accumulation of molecular evolution since birds and mammals lineages diverged. RESULTS: We address several methodological issues for the detection and assembly of ancestral genomic features of ancient vertebrate genomes, which encompass adjacencies, contiguous segments, syntenies and double syntenies in the context of a whole genome duplication. Using generic, but stringent, methods for all these problems, some of them new, we analyze 15 vertebrate genomes, including 12 amniotes and 3 teleost fishes, and infer a high-resolution genome organization of the amniote ancestral genome, composed of 39 ancestral linkage groups at a resolution of 100 kb. We extensively discuss the validity and robustness of the method to variations of data and parameters. We introduce a support value for each of the groups, and show that 36 out of 39 have maximum support. CONCLUSIONS: Single methodological principle cannot currently be used to infer the organization of the amniote ancestral genome, and we demonstrate that it is possible to gather several principles into a computational paleogenomics pipeline. This strategy offers a solid methodological base for the reconstruction of ancient vertebrate genomes. AVAILABILITY: Source code, in C++ and Python, is available at http://www.cecm.sfu.ca/~cchauve/SUPP/AMNIOTE2010/ CONTACT: cedric.chauve@sfu.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Cromossomos/genética , Evolução Molecular , Genoma/genética , Vertebrados/genética , Animais , Evolução Biológica , Aves/genética , Ligação Genética , Mamíferos/genética , Sintenia
17.
NAR Genom Bioinform ; 4(1): lqac010, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35261973

RESUMO

G-quadruplexes are motifs found in DNA and RNA that can fold into tertiary structures. Until now, they have been studied experimentally mainly in humans and a few other species. Recently, predictions have been made with bacterial and archaeal genomes. Nevertheless, a global comparison of predicted G4s (pG4s) across and within the three living kingdoms has not been addressed. In this study, we aimed to predict G4s in genes and transcripts of all kingdoms of living organisms and investigated the differences in their distributions. The relation of the predictions with GC content was studied. It appears that GC content is not the only parameter impacting G4 predictions and abundance. The distribution of pG4 densities varies depending on the class of transcripts and the group of species. Indeed, we have observed that, in coding transcripts, there are more predicted G4s than expected for eukaryotes but not for archaea and bacteria, while in noncoding transcripts, there are as many or fewer predicted G4s in all species groups. We even noticed that some species with the same GC content presented different pG4 profiles. For instance, Leishmania major and Chlamydomonas reinhardtii both have 60% of GC content, but the former has a pG4 density of 0.07 and the latter 1.16.

18.
Bioinform Adv ; 2(1): vbab044, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36699392

RESUMO

Motivation: Alternative splicing is a ubiquitous process in eukaryotes that allows distinct transcripts to be produced from the same gene. Yet, the study of transcript evolution within a gene family is still in its infancy. One prerequisite for this study is the availability of methods to compare sets of transcripts while accounting for their splicing structure. In this context, we generalize the concept of pairwise spliced alignments (PSpAs) to multiple spliced alignments (MSpAs). MSpAs have several important purposes in addition to empowering the study of the evolution of transcripts. For instance, it is a key to improving the prediction of gene models, which is important to solve the growing problem of genome annotation. Despite its essentialness, a formal definition of the concept and methods to compute MSpAs are still lacking. Results: We introduce the MSpA problem and the SplicedFamAlignMulti (SFAM) method, to compute the MSpA of a gene family. Like most multiple sequence alignment (MSA) methods that are generally greedy heuristic methods assembling pairwise alignments, SFAM combines all PSpAs of coding DNA sequences and gene sequences of a gene family into an MSpA. It produces a single structure that represents the superstructure and models of the gene family. Using real vertebrate and simulated gene family data, we illustrate the utility of SFAM for computing accurate gene family superstructures, MSAs, inferring splicing orthologous groups and improving gene-model annotations. Availability and implementation: The supporting data and implementation of SFAM are freely available at https://github.com/UdeS-CoBIUS/SpliceFamAlignMulti. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

19.
BMC Bioinformatics ; 12 Suppl 9: S20, 2011 Oct 05.
Artigo em Inglês | MEDLINE | ID: mdl-22152053

RESUMO

BACKGROUND: Segmental duplications in genomes have been studied for many years. Recently, several studies have highlighted a biological phenomenon called breakpoint-duplication that apparently associates a significant proportion of segmental duplications in Mammals, and the Drosophila species group, to breakpoints in rearrangement events. RESULTS: In this paper, we introduce and study a combinatorial problem, inspired from the breakpoint-duplication phenomenon, called the Genome Dedoubling Problem. It consists of finding a minimum length rearrangement scenario required to transform a genome with duplicated segments into a non-duplicated genome such that duplications are caused by rearrangement breakpoints. We show that the problem, in the Double-Cut-and-Join (DCJ) and the reversal rearrangement models, can be reduced to an APX-complete problem, and we provide algorithms for the Genome Dedoubling Problem with 2-approximable parts. We apply the methods for the reconstruction of a non-duplicated ancestor of Drosophila yakuba. CONCLUSIONS: We present the Genome Dedoubling Problem, and describe two algorithms solving the problem in the DCJ model, and the reversal model. The usefulness of the problems and the methods are showed through an application to real Drosophila data.


Assuntos
Evolução Molecular , Genômica/métodos , Duplicações Segmentares Genômicas , Algoritmos , Animais , Drosophila/genética , Modelos Genéticos
20.
NAR Genom Bioinform ; 2(2): lqaa035, 2020 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-33575590

RESUMO

It has been demonstrated that RNA G-quadruplexes (G4) are structural motifs present in transcriptomes and play important regulatory roles in several post-transcriptional mechanisms. However, the full picture of RNA G4 locations and the extent of their implication remain elusive. Solely computational prediction analysis of the whole transcriptome may reveal all potential G4, since experimental identifications are always limited to specific conditions or specific cell lines. The present study reports the first in-depth computational prediction of potential G4 region across the complete human transcriptome. Although using a relatively stringent approach based on three prediction scores that accounts for the composition of G4 sequences, the composition of their neighboring sequences, and the various forms of G4, over 1.1 million of potential G4 (pG4) were predicted. The abundance of G4 was computationally confirmed in both 5' and 3'UTR as well as splicing junction of mRNA, appreciate for the first time in the long ncRNA, while almost absent of most of the small ncRNA families. The present results constitute an important step toward a full understanding of the roles of G4 in post-transcriptional mechanisms.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA