Pesquisa | BVS Aleitamento Materno

1.

GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes.

Bruna, Tomás; Lomsadze, Alexandre; Borodovsky, Mark.

Genome Res ; 34(5): 757-768, 2024 06 25.

Artigo em Inglês | MEDLINE | ID: mdl-38866548

RESUMO

Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic-, and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data are sufficient for making gene predictions with "high confidence." The genes situated in the genomic space between the high-confidence genes are predicted in the next stage. The set of high-confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperforms gene finders using a single type of extrinsic evidence. Comparisons with gene finders MAKER2 and TSEBRA, those that use both transcript- and protein-derived extrinsic evidence, show that GeneMark-ETP delivers state-of-the-art gene-prediction accuracy, with the margin of outperforming existing approaches increasing in its application to larger and more complex eukaryotic genomes.

Assuntos

Anotação de Sequência Molecular , Anotação de Sequência Molecular/métodos , Animais , Software , Genoma , Genômica/métodos , Eucariotos/genética , Algoritmos

2.

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA.

Gabriel, Lars; Bruna, Tomás; Hoff, Katharina J; Ebel, Matthis; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

Genome Res ; 34(5): 769-777, 2024 06 25.

Artigo em Inglês | MEDLINE | ID: mdl-38866550

RESUMO

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

Assuntos

Anotação de Sequência Molecular , Software , Anotação de Sequência Molecular/métodos , Humanos , RNA-Seq/métodos , Algoritmos , Animais , Genoma , Biologia Computacional/métodos , Genômica/métodos , Transcriptoma

3.

TSEBRA: transcript selector for BRAKER.

Gabriel, Lars; Hoff, Katharina J; Bruna, Tomás; Borodovsky, Mark; Stanke, Mario.

BMC Bioinformatics ; 22(1): 566, 2021 Nov 25.

Artigo em Inglês | MEDLINE | ID: mdl-34823473

RESUMO

BACKGROUND: BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. RESULTS: We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. CONCLUSION: TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

Assuntos

Genoma , Software , Genômica , RNA-Seq , Análise de Sequência de RNA

4.

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes.

Lomsadze, Alexandre; Gemayel, Karl; Tang, Shiyuyun; Borodovsky, Mark.

Genome Res ; 28(7): 1079-1089, 2018 07.

Artigo em Inglês | MEDLINE | ID: mdl-29773659

RESUMO

In a conventional view of the prokaryotic genome organization, promoters precede operons and ribosome binding sites (RBSs) with Shine-Dalgarno consensus precede genes. However, recent experimental research suggesting a more diverse view motivated us to develop an algorithm with improved gene-finding accuracy. We describe GeneMarkS-2, an ab initio algorithm that uses a model derived by self-training for finding species-specific (native) genes, along with an array of precomputed "heuristic" models designed to identify harder-to-detect genes (likely horizontally transferred). Importantly, we designed GeneMarkS-2 to identify several types of distinct sequence patterns (signals) involved in gene expression control, among them the patterns characteristic for leaderless transcription as well as noncanonical RBS patterns. To assess the accuracy of GeneMarkS-2, we used genes validated by COG (Clusters of Orthologous Groups) annotation, proteomics experiments, and N-terminal protein sequencing. We observed that GeneMarkS-2 performed better on average in all accuracy measures when compared with the current state-of-the-art gene prediction tools. Furthermore, the screening of â¼5000 representative prokaryotic genomes made by GeneMarkS-2 predicted frequent leaderless transcription in both archaea and bacteria. We also observed that the RBS sites in some species with leadered transcription did not necessarily exhibit the Shine-Dalgarno consensus. The modeling of different types of sequence motifs regulating gene expression prompted a division of prokaryotic genomes into five categories with distinct sequence patterns around the gene starts.

Assuntos

Archaea/genética , Bactérias/genética , Genes Bacterianos/genética , Células Procarióticas/metabolismo , Transcrição Gênica/genética , Algoritmos , Sítios de Ligação/genética , Biologia Computacional/métodos , Anotação de Sequência Molecular/métodos , Óperon/genética , Biossíntese de Proteínas/genética , Proteômica/métodos , Ribossomos/genética

5.

Prediction of lncRNAs and their interactions with nucleic acids: benchmarking bioinformatics tools.

Antonov, Ivan V; Mazurov, Evgeny; Borodovsky, Mark; Medvedeva, Yulia A.

Brief Bioinform ; 20(2): 551-564, 2019 03 22.

Artigo em Inglês | MEDLINE | ID: mdl-29697742

RESUMO

The genomes of mammalian species are pervasively transcribed producing as many noncoding as protein-coding RNAs. There is a growing body of evidence supporting their functional role. Long noncoding RNA (lncRNA) can bind both nucleic acids and proteins through several mechanisms. A reliable computational prediction of the most probable mechanism of lncRNA interaction can facilitate experimental validation of its function. In this study, we benchmarked computational tools capable to discriminate lncRNA from mRNA and predict lncRNA interactions with other nucleic acids. We assessed the performance of 9 tools for distinguishing protein-coding from noncoding RNAs, as well as 19 tools for prediction of RNA-RNA and RNA-DNA interactions. Our conclusions about the considered tools were based on their performances on the entire genome/transcriptome level, as it is the most common task nowadays. We found that FEELnc and CPAT distinguish between coding and noncoding mammalian transcripts in the most accurate manner. ASSA, RIBlast and LASTAL, as well as Triplexator, turned out to be the best predictors of RNA-RNA and RNA-DNA interactions, respectively. We showed that the normalization of the predicted interaction strength to the transcript length and GC content may improve the accuracy of inferring RNA interactions. Yet, all the current tools have difficulties to make accurate predictions of short-trans RNA-RNA interactions-stretches of sparse contacts. All over, there is still room for improvement in each category, especially for predictions of RNA interactions.

Assuntos

Benchmarking , Biologia Computacional/métodos , RNA Longo não Codificante/metabolismo , RNA Mensageiro/metabolismo , Humanos , RNA Longo não Codificante/genética , RNA Mensageiro/genética , Transcriptoma

6.

NCBI prokaryotic genome annotation pipeline.

Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James.

Nucleic Acids Res ; 44(14): 6614-24, 2016 08 19.

Artigo em Inglês | MEDLINE | ID: mdl-27342282

RESUMO

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

Assuntos

Genoma Bacteriano , Anotação de Sequência Molecular , Células Procarióticas/metabolismo , Bactérias/genética , Proteínas de Bactérias/química , Bases de Dados de Ácidos Nucleicos , Genes Bacterianos

7.

BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS.

Hoff, Katharina J; Lange, Simone; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

Bioinformatics ; 32(5): 767-9, 2016 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-26559507

RESUMO

MOTIVATION: Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions. AUGUSTUS is a gene finder that usually requires supervised training and uses information from RNA-Seq reads in the prediction step. Complementary strengths of GeneMark-ET and AUGUSTUS provided motivation for designing a new combined tool for automatic gene prediction. RESULTS: We present BRAKER1, a pipeline for unsupervised RNA-Seq-based genome annotation that combines the advantages of GeneMark-ET and AUGUSTUS. As input, BRAKER1 requires a genome assembly file and a file in bam-format with spliced alignments of RNA-Seq reads to the genome. First, GeneMark-ET performs iterative training and generates initial gene structures. Second, AUGUSTUS uses predicted genes for training and then integrates RNA-Seq read information into final gene predictions. In our experiments, we observed that BRAKER1 was more accurate than MAKER2 when it is using RNA-Seq as sole source for training and prediction. BRAKER1 does not require pre-trained parameters or a separate expert-prepared training step. AVAILABILITY AND IMPLEMENTATION: BRAKER1 is available for download at http://bioinf.uni-greifswald.de/bioinf/braker/ and http://exon.gatech.edu/GeneMark/ CONTACT: katharina.hoff@uni-greifswald.de or borodovsky@gatech.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Análise de Sequência de RNA , Eucariotos , Genoma , RNA , Software

8.

Identification of protein coding regions in RNA transcripts.

Tang, Shiyuyun; Lomsadze, Alexandre; Borodovsky, Mark.

Nucleic Acids Res ; 43(12): e78, 2015 Jul 13.

Artigo em Inglês | MEDLINE | ID: mdl-25870408

RESUMO

Massive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets. We demonstrate that (i) the unsupervised training is robust with respect to the presence of transcripts assembly errors and (ii) the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting translation initiation sites in modelled as well as in assembled transcripts compares favourably to other existing methods.

Assuntos

Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Fases de Leitura Aberta , Análise de Sequência de RNA/métodos , Software , Algoritmos , Animais , Arabidopsis/genética , Drosophila melanogaster/genética , Genes , Camundongos , Iniciação Traducional da Cadeia Peptídica , RNA Mensageiro/química , Schizosaccharomyces/genética

9.

UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing.

Burns, Paul D; Li, Yang; Ma, Jian; Borodovsky, Mark.

Nucleic Acids Res ; 42(4): e25, 2014 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-24259430

RESUMO

Accurate mapping of spliced RNA-Seq reads to genomic DNA has been known as a challenging problem. Despite significant efforts invested in developing efficient algorithms, with the human genome as a primary focus, the best solution is still not known. A recently introduced tool, TrueSight, has demonstrated better performance compared with earlier developed algorithms such as TopHat and MapSplice. To improve detection of splice junctions, TrueSight uses information on statistical patterns of nucleotide ordering in intronic and exonic DNA. This line of research led to yet another new algorithm, UnSplicer, designed for eukaryotic species with compact genomes where functional alternative splicing is likely to be dominated by splicing noise. Genome-specific parameters of the new algorithm are generated by GeneMark-ES, an ab initio gene prediction algorithm based on unsupervised training. UnSplicer shares several components with TrueSight; the difference lies in the training strategy and the classification algorithm. We tested UnSplicer on RNA-Seq data sets of Arabidopsis thaliana, Caenorhabditis elegans, Cryptococcus neoformans and Drosophila melanogaster. We have shown that splice junctions inferred by UnSplicer are in better agreement with knowledge accumulated on these well-studied genomes than predictions made by earlier developed tools.

Assuntos

Algoritmos , Processamento Alternativo , Sítios de Splice de RNA , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Animais , Genoma

10.

Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.

Lomsadze, Alexandre; Burns, Paul D; Borodovsky, Mark.

Nucleic Acids Res ; 42(15): e119, 2014 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-24990371

RESUMO

We present a new approach to automatic training of a eukaryotic ab initio gene finding algorithm. With the advent of Next-Generation Sequencing, automatic training has become paramount, allowing genome annotation pipelines to keep pace with the speed of genome sequencing. Earlier we developed GeneMark-ES, currently the only gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode. The new algorithm, GeneMark-ET augments GeneMark-ES with a novel method that integrates RNA-Seq read alignments into the self-training procedure. Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments. We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%. In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.

Assuntos

Algoritmos , Genes , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Animais , Culicidae/genética , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Genes de Insetos

11.

GeneTack database: genes with frameshifts in prokaryotic genomes and eukaryotic mRNA sequences.

Antonov, Ivan; Baranov, Pavel; Borodovsky, Mark.

Nucleic Acids Res ; 41(Database issue): D152-6, 2013 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-23161689

RESUMO

Database annotations of prokaryotic genomes and eukaryotic mRNA sequences pay relatively low attention to frame transitions that disrupt protein-coding genes. Frame transitions (frameshifts) could be caused by sequencing errors or indel mutations inside protein-coding regions. Other observed frameshifts are related to recoding events (that evolved to control expression of some genes). Earlier, we have developed an algorithm and software program GeneTack for ab initio frameshift finding in intronless genes. Here, we describe a database (freely available at http://topaz.gatech.edu/GeneTack/db.html) containing genes with frameshifts (fs-genes) predicted by GeneTack. The database includes 206 991 fs-genes from 1106 complete prokaryotic genomes and 45 295 frameshifts predicted in mRNA sequences from 100 eukaryotic genomes. The whole set of fs-genes was grouped into clusters based on sequence similarity between fs-proteins (conceptually translated fs-genes), conservation of the frameshift position and frameshift direction (-1, +1). The fs-genes can be retrieved by similarity search to a given query sequence via a web interface, by fs-gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their likely origin, such as pseudogenization, phase variation, etc. The largest clusters contain fs-genes with programed frameshifts (related to recoding events).

Assuntos

Bases de Dados Genéticas , Mutação da Fase de Leitura , Mudança da Fase de Leitura do Gene Ribossômico , Eucariotos/genética , Genes , Genoma Bacteriano , Genômica/métodos , Internet , Matrizes de Pontuação de Posição Específica , RNA Mensageiro/química , Software

12.

Identification of the nature of reading frame transitions observed in prokaryotic genomes.

Antonov, Ivan; Coakley, Arthur; Atkins, John F; Baranov, Pavel V; Borodovsky, Mark.

Nucleic Acids Res ; 41(13): 6514-30, 2013 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-23649834

RESUMO

Our goal was to identify evolutionary conserved frame transitions in protein coding regions and to uncover an underlying functional role of these structural aberrations. We used the ab initio frameshift prediction program, GeneTack, to detect reading frame transitions in 206 991 genes (fs-genes) from 1106 complete prokaryotic genomes. We grouped 102 731 fs-genes into 19 430 clusters based on sequence similarity between protein products (fs-proteins) as well as conservation of predicted position of the frameshift and its direction. We identified 4010 pseudogene clusters and 146 clusters of fs-genes apparently using recoding (local deviation from using standard genetic code) due to possessing specific sequence motifs near frameshift positions. Particularly interesting was finding of a novel type of organization of the dnaX gene, where recoding is required for synthesis of the longer subunit, τ. We selected 20 clusters of predicted recoding candidates and designed a series of genetic constructs with a reporter gene or affinity tag whose expression would require a frameshift event. Expression of the constructs in Escherichia coli demonstrated enrichment of the set of candidates with sequences that trigger genuine programmed ribosomal frameshifting; we have experimentally confirmed four new families of programmed frameshifts.

Assuntos

Mudança da Fase de Leitura do Gene Ribossômico , Genoma Arqueal , Genoma Bacteriano , Fases de Leitura , DNA Polimerase III/genética , Genes Arqueais , Genes Bacterianos , Pseudogenes , Transcrição Gênica , Transposases/genética

13.

TrueSight: a new algorithm for splice junction detection using RNA-seq.

Li, Yang; Li-Byarlay, Hongmei; Burns, Paul; Borodovsky, Mark; Robinson, Gene E; Ma, Jian.

Nucleic Acids Res ; 41(4): e51, 2013 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-23254332

RESUMO

RNA-seq has proven to be a powerful technique for transcriptome profiling based on next-generation sequencing (NGS) technologies. However, due to the short length of NGS reads, it is challenging to accurately map RNA-seq reads to splice junctions (SJs), which is a critically important step in the analysis of alternative splicing (AS) and isoform construction. In this article, we describe a new method, called TrueSight, which for the first time combines RNA-seq read mapping quality and coding potential of genomic sequences into a unified model. The model is further utilized in a machine-learning approach to precisely identify SJs. Both simulations and real data evaluations showed that TrueSight achieved higher sensitivity and specificity than other methods. We applied TrueSight to new high coverage honey bee RNA-seq data to discover novel splice forms. We found that 60.3% of honey bee multi-exon genes are alternatively spliced. By utilizing gene models improved by TrueSight, we characterized AS types in honey bee transcriptome. We believe that TrueSight will be highly useful to comprehensively study the biology of alternative splicing.

Assuntos

Algoritmos , Processamento Alternativo , Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Sítios de Splice de RNA , Análise de Sequência de RNA/métodos , Animais , Inteligência Artificial , Abelhas/genética , Genômica/métodos , Humanos , Modelos Genéticos

14.

MetaGeneTack: ab initio detection of frameshifts in metagenomic sequences.

Tang, Shiyuyun; Antonov, Ivan; Borodovsky, Mark.

Bioinformatics ; 29(1): 114-6, 2013 Jan 01.

Artigo em Inglês | MEDLINE | ID: mdl-23129300

RESUMO

SUMMARY: Frameshift (FS) prediction is important for analysis and biological interpretation of metagenomic sequences. Since a genomic context of a short metagenomic sequence is rarely known, there is not enough data available to estimate parameters of species-specific statistical models of protein-coding and non-coding regions. The challenge of ab initio FS detection is, therefore, two fold: (i) to find a way to infer necessary model parameters and (ii) to identify positions of frameshifts (if any). Here we describe a new tool, MetaGeneTack, which uses a heuristic method to estimate parameters of sequence models used in the FS detection algorithm. It is shown on multiple test sets that the MetaGeneTack FS detection performance is comparable or better than the one of earlier developed program FragGeneScan. AVAILABILITY AND IMPLEMENTATION: MetaGeneTack is available as a web server at http://exon.gatech.edu/GeneTack/cgi/metagenetack.cgi. Academic users can download a standalone version of the program from http://exon.gatech.edu/license_download.cgi.

Assuntos

Algoritmos , Mutação da Fase de Leitura , Metagenômica/métodos , Modelos Estatísticos , Software

15.

GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistency with Extrinsic Data.

Bruna, Tomas; Lomsadze, Alexandre; Borodovsky, Mark.

bioRxiv ; 2024 Jan 03.

Artigo em Inglês | MEDLINE | ID: mdl-36711453

RESUMO

New large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. A new automatic tool, GeneMark-ETP, presented here, finds genes by integration of genomic-, transcriptomic- and protein-derived evidence. The algorithm was developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for gene prediction with 'high confidence' and then proceeds with finding the remaining genes across the whole genome. The initial set of parameters of the statistical model is estimated on the training set made from the high confidence genes. Subsequently, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions of the whole complement of genes. The GeneMark-ETP performance was expectably better than the performance of GeneMark-ET or GeneMark-EP+, the gene finders using a single type of extrinsic evidence, either short RNA-seq reads or mapped to genome homologous proteins. Subsequently, for comparisons with the tools utilizing both transcript- and protein-derived extrinsic evidence, we have chosen MAKER2 and a more recent tool, TSEBRA, combining BRAKER1 and BRAKER2. The results demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing for larger and more complex eukaryotic genomes.

16.

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA.

Gabriel, Lars; Bruna, Tomás; Hoff, Katharina J; Ebel, Matthis; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

bioRxiv ; 2024 Feb 29.

Artigo em Inglês | MEDLINE | ID: mdl-37398387

RESUMO

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made by the recently released GeneMark-ETP integrating all three data types. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2. The average transcript-level F1-score was increased by ~20 percentage points on average, while the difference was most pronounced for species with large and complex genomes. BRAKER3 also outperformed other existing tools, MAKER2, Funannotate and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

17.

The Chlorella variabilis NC64A genome reveals adaptation to photosymbiosis, coevolution with viruses, and cryptic sex.

Blanc, Guillaume; Duncan, Garry; Agarkova, Irina; Borodovsky, Mark; Gurnon, James; Kuo, Alan; Lindquist, Erika; Lucas, Susan; Pangilinan, Jasmyn; Polle, Juergen; Salamov, Asaf; Terry, Astrid; Yamada, Takashi; Dunigan, David D; Grigoriev, Igor V; Claverie, Jean-Michel; Van Etten, James L.

Plant Cell ; 22(9): 2943-55, 2010 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-20852019

RESUMO

Chlorella variabilis NC64A, a unicellular photosynthetic green alga (Trebouxiophyceae), is an intracellular photobiont of Paramecium bursaria and a model system for studying virus/algal interactions. We sequenced its 46-Mb nuclear genome, revealing an expansion of protein families that could have participated in adaptation to symbiosis. NC64A exhibits variations in GC content across its genome that correlate with global expression level, average intron size, and codon usage bias. Although Chlorella species have been assumed to be asexual and nonmotile, the NC64A genome encodes all the known meiosis-specific proteins and a subset of proteins found in flagella. We hypothesize that Chlorella might have retained a flagella-derived structure that could be involved in sexual reproduction. Furthermore, a survey of phytohormone pathways in chlorophyte algae identified algal orthologs of Arabidopsis thaliana genes involved in hormone biosynthesis and signaling, suggesting that these functions were established prior to the evolution of land plants. We show that the ability of Chlorella to produce chitinous cell walls likely resulted from the capture of metabolic genes by horizontal gene transfer from algal viruses, prokaryotes, or fungi. Analysis of the NC64A genome substantially advances our understanding of the green lineage evolution, including the genomic interplay with viruses and symbiosis between eukaryotes.

Assuntos

Chlorella/genética , Evolução Molecular , Genoma de Planta , Simbiose , Composição de Bases , Parede Celular/metabolismo , Chlorella/virologia , DNA de Plantas/genética , Etiquetas de Sequências Expressas , Flagelos/genética , Dados de Sequência Molecular , Família Multigênica , Reguladores de Crescimento de Plantas/genética , Sequências Repetitivas de Ácido Nucleico , Reprodução , Análise de Sequência de DNA

18.

Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus).

Stajich, Jason E; Wilke, Sarah K; Ahrén, Dag; Au, Chun Hang; Birren, Bruce W; Borodovsky, Mark; Burns, Claire; Canbäck, Björn; Casselton, Lorna A; Cheng, C K; Deng, Jixin; Dietrich, Fred S; Fargo, David C; Farman, Mark L; Gathman, Allen C; Goldberg, Jonathan; Guigó, Roderic; Hoegger, Patrick J; Hooker, James B; Huggins, Ashleigh; James, Timothy Y; Kamada, Takashi; Kilaru, Sreedhar; Kodira, Chinnapa; Kües, Ursula; Kupfer, Doris; Kwan, H S; Lomsadze, Alexandre; Li, Weixi; Lilly, Walt W; Ma, Li-Jun; Mackey, Aaron J; Manning, Gerard; Martin, Francis; Muraguchi, Hajime; Natvig, Donald O; Palmerini, Heather; Ramesh, Marilee A; Rehmeyer, Cathy J; Roe, Bruce A; Shenoy, Narmada; Stanke, Mario; Ter-Hovhannisyan, Vardges; Tunlid, Anders; Velagapudi, Rajesh; Vision, Todd J; Zeng, Qiandong; Zolan, Miriam E; Pukkila, Patricia J.

Proc Natl Acad Sci U S A ; 107(26): 11889-94, 2010 Jun 29.

Artigo em Inglês | MEDLINE | ID: mdl-20547848

RESUMO

The mushroom Coprinopsis cinerea is a classic experimental model for multicellular development in fungi because it grows on defined media, completes its life cycle in 2 weeks, produces some 10(8) synchronized meiocytes, and can be manipulated at all stages in development by mutation and transformation. The 37-megabase genome of C. cinerea was sequenced and assembled into 13 chromosomes. Meiotic recombination rates vary greatly along the chromosomes, and retrotransposons are absent in large regions of the genome with low levels of meiotic recombination. Single-copy genes with identifiable orthologs in other basidiomycetes are predominant in low-recombination regions of the chromosome. In contrast, paralogous multicopy genes are found in the highly recombining regions, including a large family of protein kinases (FunK1) unique to multicellular fungi. Analyses of P450 and hydrophobin gene families confirmed that local gene duplications drive the expansions of paralogous copies and the expansions occur in independent lineages of Agaricomycotina fungi. Gene-expression patterns from microarrays were used to dissect the transcriptional program of dikaryon formation (mating). Several members of the FunK1 kinase family are differentially regulated during sexual morphogenesis, and coordinate regulation of adjacent duplications is rare. The genomes of C. cinerea and Laccaria bicolor, a symbiotic basidiomycete, share extensive regions of synteny. The largest syntenic blocks occur in regions with low meiotic recombination rates, no transposable elements, and tight gene spacing, where orthologous single-copy genes are overrepresented. The chromosome assembly of C. cinerea is an essential resource in understanding the evolution of multicellularity in the fungi.

Assuntos

Cromossomos Fúngicos/genética , Coprinus/genética , Evolução Molecular , Sequência de Bases , Mapeamento Cromossômico , Coprinus/citologia , Coprinus/crescimento & desenvolvimento , Sistema Enzimático do Citocromo P-450/genética , Primers do DNA/genética , Proteínas Fúngicas/genética , Duplicação Gênica , Genoma Fúngico , Meiose/genética , Dados de Sequência Molecular , Família Multigênica , Filogenia , Proteínas Quinases/genética , RNA Fúngico/genética , Recombinação Genética , Retroelementos/genética

19.

MgCod: Gene Prediction in Phage Genomes with Multiple Genetic Codes.

Pfennig, Aaron; Lomsadze, Alexandre; Borodovsky, Mark.

J Mol Biol ; 435(14): 168159, 2023 07 15.

Artigo em Inglês | MEDLINE | ID: mdl-37244571

RESUMO

Massive sequencing of microbiomes has led to the discovery of a large number of phage genomes with intermittent stop codon recoding. We have developed a computational tool, MgCod, that identifies genomic regions (blocks) with distinct stop codon recoding simultaneously with the prediction of protein-coding regions. When MgCod was used to scan a large volume of human metagenomic contigs hundreds of viral contigs with intermittent stop codon recoding were revealed. Many of these contigs originated from genomes of known crAssphages. Further analyses had shown that intermittent recoding was associated with subtle patterns in the organization of protein-coding genes, such as 'single-coding' and 'dual-coding'. The dual-coding genes, clustered into blocks, could be translated by two alternative codes producing nearly identical proteins. It was observed that the dual-coded blocks were enriched with the early-stage phage genes, while the late-stage genes were residing in the single-coded blocks. MgCod can identify types of stop codon recoding in novel genomic sequences in parallel with gene prediction. It is available for download from https://github.com/gatech-genemark/MgCod.

Assuntos

Bacteriófagos , Códon de Terminação , Genoma Viral , Humanos , Bacteriófagos/genética , Códon de Terminação/genética , Proteínas/genética , Análise de Sequência

20.

A chromosome-length genome assembly and annotation of blackberry (Rubus argutus, cv. "Hillquist").

Bruna, Tomás; Aryal, Rishi; Dudchenko, Olga; Sargent, Daniel James; Mead, Daniel; Buti, Matteo; Cavallini, Andrea; Hytönen, Timo; Andrés, Javier; Pham, Melanie; Weisz, David; Mascagni, Flavia; Usai, Gabriele; Natali, Lucia; Bassil, Nahla; Fernandez, Gina E; Lomsadze, Alexandre; Armour, Mitchell; Olukolu, Bode; Poorten, Thomas; Britton, Caitlin; Davik, Jahn; Ashrafi, Hamid; Aiden, Erez Lieberman; Borodovsky, Mark; Worthington, Margaret.

G3 (Bethesda) ; 13(2)2023 02 09.

Artigo em Inglês | MEDLINE | ID: mdl-36331334

RESUMO

Blackberries (Rubus spp.) are the fourth most economically important berry crop worldwide. Genome assemblies and annotations have been developed for Rubus species in subgenus Idaeobatus, including black raspberry (R. occidentalis), red raspberry (R. idaeus), and R. chingii, but very few genomic resources exist for blackberries and their relatives in subgenus Rubus. Here we present a chromosome-length assembly and annotation of the diploid blackberry germplasm accession "Hillquist" (R. argutus). "Hillquist" is the only known source of primocane-fruiting (annual-fruiting) in tetraploid fresh-market blackberry breeding programs and is represented in the pedigree of many important cultivars worldwide. The "Hillquist" assembly, generated using Pacific Biosciences long reads scaffolded with high-throughput chromosome conformation capture sequencing, consisted of 298 Mb, of which 270 Mb (90%) was placed on 7 chromosome-length scaffolds with an average length of 38.6 Mb. Approximately 52.8% of the genome was composed of repetitive elements. The genome sequence was highly collinear with a novel maternal haplotype-resolved linkage map of the tetraploid blackberry selection A-2551TN and genome assemblies of R. chingii and red raspberry. A total of 38,503 protein-coding genes were predicted, of which 72% were functionally annotated. Eighteen flowering gene homologs within a previously mapped locus aligning to an 11.2 Mb region on chromosome Ra02 were identified as potential candidate genes for primocane-fruiting. The utility of the "Hillquist" genome has been demonstrated here by the development of the first genotyping-by-sequencing-based linkage map of tetraploid blackberry and the identification of possible candidate genes for primocane-fruiting. This chromosome-length assembly will facilitate future studies in Rubus biology, genetics, and genomics and strengthen applied breeding programs.

Assuntos

Rubus , Rubus/genética , Tetraploidia , Melhoramento Vegetal , Mapeamento Cromossômico , Cromossomos de Plantas/genética , Anotação de Sequência Molecular

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA