Pesquisa | BVS IEC

1.

GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes.

Bruna, Tomás; Lomsadze, Alexandre; Borodovsky, Mark.

Genome Res ; 34(5): 757-768, 2024 06 25.

Artigo em Inglês | MEDLINE | ID: mdl-38866548

RESUMO

Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic-, and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data are sufficient for making gene predictions with "high confidence." The genes situated in the genomic space between the high-confidence genes are predicted in the next stage. The set of high-confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperforms gene finders using a single type of extrinsic evidence. Comparisons with gene finders MAKER2 and TSEBRA, those that use both transcript- and protein-derived extrinsic evidence, show that GeneMark-ETP delivers state-of-the-art gene-prediction accuracy, with the margin of outperforming existing approaches increasing in its application to larger and more complex eukaryotic genomes.

Assuntos

Anotação de Sequência Molecular , Anotação de Sequência Molecular/métodos , Animais , Software , Genoma , Genômica/métodos , Eucariotos/genética , Algoritmos

2.

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA.

Gabriel, Lars; Bruna, Tomás; Hoff, Katharina J; Ebel, Matthis; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

Genome Res ; 34(5): 769-777, 2024 06 25.

Artigo em Inglês | MEDLINE | ID: mdl-38866550

RESUMO

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

Assuntos

Anotação de Sequência Molecular , Software , Anotação de Sequência Molecular/métodos , Humanos , RNA-Seq/métodos , Algoritmos , Animais , Genoma , Biologia Computacional/métodos , Genômica/métodos , Transcriptoma

3.

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes.

Lomsadze, Alexandre; Gemayel, Karl; Tang, Shiyuyun; Borodovsky, Mark.

Genome Res ; 28(7): 1079-1089, 2018 07.

Artigo em Inglês | MEDLINE | ID: mdl-29773659

RESUMO

In a conventional view of the prokaryotic genome organization, promoters precede operons and ribosome binding sites (RBSs) with Shine-Dalgarno consensus precede genes. However, recent experimental research suggesting a more diverse view motivated us to develop an algorithm with improved gene-finding accuracy. We describe GeneMarkS-2, an ab initio algorithm that uses a model derived by self-training for finding species-specific (native) genes, along with an array of precomputed "heuristic" models designed to identify harder-to-detect genes (likely horizontally transferred). Importantly, we designed GeneMarkS-2 to identify several types of distinct sequence patterns (signals) involved in gene expression control, among them the patterns characteristic for leaderless transcription as well as noncanonical RBS patterns. To assess the accuracy of GeneMarkS-2, we used genes validated by COG (Clusters of Orthologous Groups) annotation, proteomics experiments, and N-terminal protein sequencing. We observed that GeneMarkS-2 performed better on average in all accuracy measures when compared with the current state-of-the-art gene prediction tools. Furthermore, the screening of â¼5000 representative prokaryotic genomes made by GeneMarkS-2 predicted frequent leaderless transcription in both archaea and bacteria. We also observed that the RBS sites in some species with leadered transcription did not necessarily exhibit the Shine-Dalgarno consensus. The modeling of different types of sequence motifs regulating gene expression prompted a division of prokaryotic genomes into five categories with distinct sequence patterns around the gene starts.

Assuntos

Archaea/genética , Bactérias/genética , Genes Bacterianos/genética , Células Procarióticas/metabolismo , Transcrição Gênica/genética , Algoritmos , Sítios de Ligação/genética , Biologia Computacional/métodos , Anotação de Sequência Molecular/métodos , Óperon/genética , Biossíntese de Proteínas/genética , Proteômica/métodos , Ribossomos/genética

4.

NCBI prokaryotic genome annotation pipeline.

Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James.

Nucleic Acids Res ; 44(14): 6614-24, 2016 08 19.

Artigo em Inglês | MEDLINE | ID: mdl-27342282

RESUMO

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

Assuntos

Genoma Bacteriano , Anotação de Sequência Molecular , Células Procarióticas/metabolismo , Bactérias/genética , Proteínas de Bactérias/química , Bases de Dados de Ácidos Nucleicos , Genes Bacterianos

5.

BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS.

Hoff, Katharina J; Lange, Simone; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

Bioinformatics ; 32(5): 767-9, 2016 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-26559507

RESUMO

MOTIVATION: Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions. AUGUSTUS is a gene finder that usually requires supervised training and uses information from RNA-Seq reads in the prediction step. Complementary strengths of GeneMark-ET and AUGUSTUS provided motivation for designing a new combined tool for automatic gene prediction. RESULTS: We present BRAKER1, a pipeline for unsupervised RNA-Seq-based genome annotation that combines the advantages of GeneMark-ET and AUGUSTUS. As input, BRAKER1 requires a genome assembly file and a file in bam-format with spliced alignments of RNA-Seq reads to the genome. First, GeneMark-ET performs iterative training and generates initial gene structures. Second, AUGUSTUS uses predicted genes for training and then integrates RNA-Seq read information into final gene predictions. In our experiments, we observed that BRAKER1 was more accurate than MAKER2 when it is using RNA-Seq as sole source for training and prediction. BRAKER1 does not require pre-trained parameters or a separate expert-prepared training step. AVAILABILITY AND IMPLEMENTATION: BRAKER1 is available for download at http://bioinf.uni-greifswald.de/bioinf/braker/ and http://exon.gatech.edu/GeneMark/ CONTACT: katharina.hoff@uni-greifswald.de or borodovsky@gatech.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Análise de Sequência de RNA , Eucariotos , Genoma , RNA , Software

6.

Identification of protein coding regions in RNA transcripts.

Tang, Shiyuyun; Lomsadze, Alexandre; Borodovsky, Mark.

Nucleic Acids Res ; 43(12): e78, 2015 Jul 13.

Artigo em Inglês | MEDLINE | ID: mdl-25870408

RESUMO

Massive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets. We demonstrate that (i) the unsupervised training is robust with respect to the presence of transcripts assembly errors and (ii) the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting translation initiation sites in modelled as well as in assembled transcripts compares favourably to other existing methods.

Assuntos

Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Fases de Leitura Aberta , Análise de Sequência de RNA/métodos , Software , Algoritmos , Animais , Arabidopsis/genética , Drosophila melanogaster/genética , Genes , Camundongos , Iniciação Traducional da Cadeia Peptídica , RNA Mensageiro/química , Schizosaccharomyces/genética

7.

Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.

Lomsadze, Alexandre; Burns, Paul D; Borodovsky, Mark.

Nucleic Acids Res ; 42(15): e119, 2014 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-24990371

RESUMO

We present a new approach to automatic training of a eukaryotic ab initio gene finding algorithm. With the advent of Next-Generation Sequencing, automatic training has become paramount, allowing genome annotation pipelines to keep pace with the speed of genome sequencing. Earlier we developed GeneMark-ES, currently the only gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode. The new algorithm, GeneMark-ET augments GeneMark-ES with a novel method that integrates RNA-Seq read alignments into the self-training procedure. Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments. We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%. In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.

Assuntos

Algoritmos , Genes , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Animais , Culicidae/genética , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Genes de Insetos

8.

GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistency with Extrinsic Data.

Bruna, Tomas; Lomsadze, Alexandre; Borodovsky, Mark.

bioRxiv ; 2024 Jan 03.

Artigo em Inglês | MEDLINE | ID: mdl-36711453

RESUMO

New large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. A new automatic tool, GeneMark-ETP, presented here, finds genes by integration of genomic-, transcriptomic- and protein-derived evidence. The algorithm was developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for gene prediction with 'high confidence' and then proceeds with finding the remaining genes across the whole genome. The initial set of parameters of the statistical model is estimated on the training set made from the high confidence genes. Subsequently, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions of the whole complement of genes. The GeneMark-ETP performance was expectably better than the performance of GeneMark-ET or GeneMark-EP+, the gene finders using a single type of extrinsic evidence, either short RNA-seq reads or mapped to genome homologous proteins. Subsequently, for comparisons with the tools utilizing both transcript- and protein-derived extrinsic evidence, we have chosen MAKER2 and a more recent tool, TSEBRA, combining BRAKER1 and BRAKER2. The results demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing for larger and more complex eukaryotic genomes.

9.

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA.

Gabriel, Lars; Bruna, Tomás; Hoff, Katharina J; Ebel, Matthis; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

bioRxiv ; 2024 Feb 29.

Artigo em Inglês | MEDLINE | ID: mdl-37398387

RESUMO

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made by the recently released GeneMark-ETP integrating all three data types. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2. The average transcript-level F1-score was increased by ~20 percentage points on average, while the difference was most pronounced for species with large and complex genomes. BRAKER3 also outperformed other existing tools, MAKER2, Funannotate and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

10.

Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus).

Stajich, Jason E; Wilke, Sarah K; Ahrén, Dag; Au, Chun Hang; Birren, Bruce W; Borodovsky, Mark; Burns, Claire; Canbäck, Björn; Casselton, Lorna A; Cheng, C K; Deng, Jixin; Dietrich, Fred S; Fargo, David C; Farman, Mark L; Gathman, Allen C; Goldberg, Jonathan; Guigó, Roderic; Hoegger, Patrick J; Hooker, James B; Huggins, Ashleigh; James, Timothy Y; Kamada, Takashi; Kilaru, Sreedhar; Kodira, Chinnapa; Kües, Ursula; Kupfer, Doris; Kwan, H S; Lomsadze, Alexandre; Li, Weixi; Lilly, Walt W; Ma, Li-Jun; Mackey, Aaron J; Manning, Gerard; Martin, Francis; Muraguchi, Hajime; Natvig, Donald O; Palmerini, Heather; Ramesh, Marilee A; Rehmeyer, Cathy J; Roe, Bruce A; Shenoy, Narmada; Stanke, Mario; Ter-Hovhannisyan, Vardges; Tunlid, Anders; Velagapudi, Rajesh; Vision, Todd J; Zeng, Qiandong; Zolan, Miriam E; Pukkila, Patricia J.

Proc Natl Acad Sci U S A ; 107(26): 11889-94, 2010 Jun 29.

Artigo em Inglês | MEDLINE | ID: mdl-20547848

RESUMO

The mushroom Coprinopsis cinerea is a classic experimental model for multicellular development in fungi because it grows on defined media, completes its life cycle in 2 weeks, produces some 10(8) synchronized meiocytes, and can be manipulated at all stages in development by mutation and transformation. The 37-megabase genome of C. cinerea was sequenced and assembled into 13 chromosomes. Meiotic recombination rates vary greatly along the chromosomes, and retrotransposons are absent in large regions of the genome with low levels of meiotic recombination. Single-copy genes with identifiable orthologs in other basidiomycetes are predominant in low-recombination regions of the chromosome. In contrast, paralogous multicopy genes are found in the highly recombining regions, including a large family of protein kinases (FunK1) unique to multicellular fungi. Analyses of P450 and hydrophobin gene families confirmed that local gene duplications drive the expansions of paralogous copies and the expansions occur in independent lineages of Agaricomycotina fungi. Gene-expression patterns from microarrays were used to dissect the transcriptional program of dikaryon formation (mating). Several members of the FunK1 kinase family are differentially regulated during sexual morphogenesis, and coordinate regulation of adjacent duplications is rare. The genomes of C. cinerea and Laccaria bicolor, a symbiotic basidiomycete, share extensive regions of synteny. The largest syntenic blocks occur in regions with low meiotic recombination rates, no transposable elements, and tight gene spacing, where orthologous single-copy genes are overrepresented. The chromosome assembly of C. cinerea is an essential resource in understanding the evolution of multicellularity in the fungi.

Assuntos

Cromossomos Fúngicos/genética , Coprinus/genética , Evolução Molecular , Sequência de Bases , Mapeamento Cromossômico , Coprinus/citologia , Coprinus/crescimento & desenvolvimento , Sistema Enzimático do Citocromo P-450/genética , Primers do DNA/genética , Proteínas Fúngicas/genética , Duplicação Gênica , Genoma Fúngico , Meiose/genética , Dados de Sequência Molecular , Família Multigênica , Filogenia , Proteínas Quinases/genética , RNA Fúngico/genética , Recombinação Genética , Retroelementos/genética

11.

MgCod: Gene Prediction in Phage Genomes with Multiple Genetic Codes.

Pfennig, Aaron; Lomsadze, Alexandre; Borodovsky, Mark.

J Mol Biol ; 435(14): 168159, 2023 07 15.

Artigo em Inglês | MEDLINE | ID: mdl-37244571

RESUMO

Massive sequencing of microbiomes has led to the discovery of a large number of phage genomes with intermittent stop codon recoding. We have developed a computational tool, MgCod, that identifies genomic regions (blocks) with distinct stop codon recoding simultaneously with the prediction of protein-coding regions. When MgCod was used to scan a large volume of human metagenomic contigs hundreds of viral contigs with intermittent stop codon recoding were revealed. Many of these contigs originated from genomes of known crAssphages. Further analyses had shown that intermittent recoding was associated with subtle patterns in the organization of protein-coding genes, such as 'single-coding' and 'dual-coding'. The dual-coding genes, clustered into blocks, could be translated by two alternative codes producing nearly identical proteins. It was observed that the dual-coded blocks were enriched with the early-stage phage genes, while the late-stage genes were residing in the single-coded blocks. MgCod can identify types of stop codon recoding in novel genomic sequences in parallel with gene prediction. It is available for download from https://github.com/gatech-genemark/MgCod.

Assuntos

Bacteriófagos , Códon de Terminação , Genoma Viral , Humanos , Bacteriófagos/genética , Códon de Terminação/genética , Proteínas/genética , Análise de Sequência

12.

A chromosome-length genome assembly and annotation of blackberry (Rubus argutus, cv. "Hillquist").

Bruna, Tomás; Aryal, Rishi; Dudchenko, Olga; Sargent, Daniel James; Mead, Daniel; Buti, Matteo; Cavallini, Andrea; Hytönen, Timo; Andrés, Javier; Pham, Melanie; Weisz, David; Mascagni, Flavia; Usai, Gabriele; Natali, Lucia; Bassil, Nahla; Fernandez, Gina E; Lomsadze, Alexandre; Armour, Mitchell; Olukolu, Bode; Poorten, Thomas; Britton, Caitlin; Davik, Jahn; Ashrafi, Hamid; Aiden, Erez Lieberman; Borodovsky, Mark; Worthington, Margaret.

G3 (Bethesda) ; 13(2)2023 02 09.

Artigo em Inglês | MEDLINE | ID: mdl-36331334

RESUMO

Blackberries (Rubus spp.) are the fourth most economically important berry crop worldwide. Genome assemblies and annotations have been developed for Rubus species in subgenus Idaeobatus, including black raspberry (R. occidentalis), red raspberry (R. idaeus), and R. chingii, but very few genomic resources exist for blackberries and their relatives in subgenus Rubus. Here we present a chromosome-length assembly and annotation of the diploid blackberry germplasm accession "Hillquist" (R. argutus). "Hillquist" is the only known source of primocane-fruiting (annual-fruiting) in tetraploid fresh-market blackberry breeding programs and is represented in the pedigree of many important cultivars worldwide. The "Hillquist" assembly, generated using Pacific Biosciences long reads scaffolded with high-throughput chromosome conformation capture sequencing, consisted of 298 Mb, of which 270 Mb (90%) was placed on 7 chromosome-length scaffolds with an average length of 38.6 Mb. Approximately 52.8% of the genome was composed of repetitive elements. The genome sequence was highly collinear with a novel maternal haplotype-resolved linkage map of the tetraploid blackberry selection A-2551TN and genome assemblies of R. chingii and red raspberry. A total of 38,503 protein-coding genes were predicted, of which 72% were functionally annotated. Eighteen flowering gene homologs within a previously mapped locus aligning to an 11.2 Mb region on chromosome Ra02 were identified as potential candidate genes for primocane-fruiting. The utility of the "Hillquist" genome has been demonstrated here by the development of the first genotyping-by-sequencing-based linkage map of tetraploid blackberry and the identification of possible candidate genes for primocane-fruiting. This chromosome-length assembly will facilitate future studies in Rubus biology, genetics, and genomics and strengthen applied breeding programs.

Assuntos

Rubus , Rubus/genética , Tetraploidia , Melhoramento Vegetal , Mapeamento Cromossômico , Cromossomos de Plantas/genética , Anotação de Sequência Molecular

13.

Ab initio gene identification in metagenomic sequences.

Zhu, Wenhan; Lomsadze, Alexandre; Borodovsky, Mark.

Nucleic Acids Res ; 38(12): e132, 2010 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-20403810

RESUMO

We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.

Assuntos

Algoritmos , Genes Arqueais , Genes Bacterianos , Metagenômica/métodos , Animais , Trato Gastrointestinal/microbiologia , Genoma Arqueal , Genoma Bacteriano , Humanos , Metagenoma , Camundongos , Modelos Estatísticos , Análise de Sequência de DNA

14.

GeneMark-HM: improving gene prediction in DNA sequences of human microbiome.

Lomsadze, Alexandre; Bonny, Christophe; Strozzi, Francesco; Borodovsky, Mark.

NAR Genom Bioinform ; 3(2): lqab047, 2021 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-34056597

RESUMO

Computational reconstruction of nearly complete genomes from metagenomic reads may identify thousands of new uncultured candidate bacterial species. We have shown that reconstructed prokaryotic genomes along with genomes of sequenced microbial isolates can be used to support more accurate gene prediction in novel metagenomic sequences. We have proposed an approach that used three types of gene prediction algorithms and found for all contigs in a metagenome nearly optimal models of protein-coding regions either in libraries of pre-computed models or constructed de novo. The model selection process and gene annotation were done by the new GeneMark-HM pipeline. We have created a database of the species level pan-genomes for the human microbiome. To create a library of models representing each pan-genome we used a self-training algorithm GeneMarkS-2. Genes initially predicted in each contig served as queries for a fast similarity search through the pan-genome database. The best matches led to selection of the model for gene prediction. Contigs not assigned to pan-genomes were analyzed by crude, but still accurate models designed for sequences with particular GC compositions. Tests of GeneMark-HM on simulated metagenomes demonstrated improvement in gene annotation of human metagenomic sequences in comparison with the current state-of-the-art gene prediction tools.

15.

StartLink and StartLink+: Prediction of Gene Starts in Prokaryotic Genomes.

Gemayel, Karl; Lomsadze, Alexandre; Borodovsky, Mark.

Front Bioinform ; 1: 704157, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-36303749

RESUMO

State-of-the-art algorithms of ab initio gene prediction for prokaryotic genomes were shown to be sufficiently accurate. A pair of algorithms would agree on predictions of gene 3'ends. Nonetheless, predictions of gene starts would not match for 15-25% of genes in a genome. This discrepancy is a serious issue that is difficult to be resolved due to the absence of sufficiently large sets of genes with experimentally verified starts. We have introduced StartLink that infers gene starts from conservation patterns revealed by multiple alignments of homologous nucleotide sequences. We also have introduced StartLink+ combining both ab initio and alignment-based methods. The ability of StartLink to predict the start of a given gene is restricted by the availability of homologs in a database. We observed that StartLink made predictions for 85% of genes per genome on average. The StartLink+ accuracy was shown to be 98-99% on the sets of genes with experimentally verified starts. In comparison with database annotations, we observed that the annotated gene starts deviated from the StartLink+ predictions for â¼5% of genes in AT-rich genomes and for 10-15% of genes in GC-rich genomes on average. The use of StartLink+ has a potential to significantly improve gene start annotation in genomic databases.

16.

BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database.

Bruna, Tomás; Hoff, Katharina J; Lomsadze, Alexandre; Stanke, Mario; Borodovsky, Mark.

NAR Genom Bioinform ; 3(1): lqaa108, 2021 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-33575650

RESUMO

The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.

17.

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins.

Bruna, Tomás; Lomsadze, Alexandre; Borodovsky, Mark.

NAR Genom Bioinform ; 2(2): lqaa026, 2020 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-32440658

RESUMO

We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.

18.

Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine.

Lomsadze, Alexandre; Li, Tengguo; Rajeevan, Mangalathu S; Unger, Elizabeth R; Borodovsky, Mark.

Viruses ; 12(7)2020 06 30.

Artigo em Inglês | MEDLINE | ID: mdl-32629900

RESUMO

We recently developed a test based on the Agilent SureSelect target enrichment system capturing genomic fragments from 191 human papillomaviruses (HPV) types for Illumina sequencing. This enriched whole genome sequencing (eWGS) assay provides an approach to identify all HPV types in a sample. Here we present a machine learning algorithm that calls HPV types based on the eWGS output. The algorithm based on the support vector machine (SVM) technique was trained on eWGS data from 122 control samples with known HPV types. The new algorithm demonstrated good performance in HPV type detection for designed samples with 25 or greater HPV plasmid copies per sample. We compared the results of HPV typing made by the new algorithm for 261 residual epidemiologic samples with the results of the typing delivered by the standard HPV Linear Array (LA). The agreement between methods (97.4%) was substantial (kappa= 0.783). However, the new algorithm identified additionally 428 instances of HPV types not detectable by the LA assay by design. Overall, we have demonstrated that the bioinformatics pipeline is an accurate tool for calling HPV types by analyzing data generated by eWGS processing of DNA fragments extracted from control and epidemiological samples.

Assuntos

Alphapapillomavirus/classificação , Alphapapillomavirus/genética , Biologia Computacional/métodos , Infecções por Papillomavirus/virologia , Algoritmos , Alphapapillomavirus/química , Alphapapillomavirus/metabolismo , Biologia Computacional/instrumentação , Genômica , Humanos , Máquina de Vetores de Suporte

19.

Development of a workflow for identification of nuclear genotyping markers for Cyclospora cayetanensis.

Houghton, Katelyn A; Lomsadze, Alexandre; Park, Subin; Nascimento, Fernanda S; Barratt, Joel; Arrowood, Michael J; VanRoey, Erik; Talundzic, Eldin; Borodovsky, Mark; Qvarnstrom, Yvonne.

Parasite ; 27: 24, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32275020

RESUMO

Cyclospora cayetanensis is an intestinal parasite responsible for the diarrheal illness, cyclosporiasis. Molecular genotyping, using targeted amplicon sequencing, provides a complementary tool for outbreak investigations, especially when epidemiological data are insufficient for linking cases and identifying clusters. The goal of this study was to identify candidate genotyping markers using a novel workflow for detection of segregating single nucleotide polymorphisms (SNPs) in C. cayetanensis genomes. Four whole C. cayetanensis genomes were compared using this workflow and four candidate markers were selected for evaluation of their genotyping utility by PCR and Sanger sequencing. These four markers covered 13 SNPs and resolved parasites from 57 stool specimens, differentiating C. cayetanensis into 19 new unique genotypes.

TITLE: Développement d'un flux de travail pour l'identification de marqueurs de génotypage nucléaire pour Cyclospora cayetanensis. ABSTRACT: Cyclospora cayetanensis est un parasite intestinal responsable de la cyclosporose, maladie diarrhéique. Le génotypage moléculaire, utilisant le séquençage ciblé des amplicons, fournit un outil complémentaire pour les enquêtes sur les épidémies, en particulier lorsque les données épidémiologiques sont insuffisantes pour relier les cas et identifier les grappes. Le but de cette étude était d'identifier des marqueurs candidats de génotypage à l'aide d'un nouveau flux de travail pour la détection des polymorphismes d'un seul nucléotide (SNP) différentiateurs dans les génomes de C. cayetanensis. Quatre génomes entiers de C. cayetanensis ont été comparés à l'aide de ce flux de travail et quatre marqueurs candidats ont été sélectionnés pour l'évaluation de leur utilité de génotypage par PCR et séquençage Sanger. Ces quatre marqueurs couvraient 13 SNP et ont résolu les parasites provenant de 57 spécimens de selles, différenciant C. cayetanensis en 19 nouveaux génotypes uniques.

Assuntos

Cyclospora/genética , DNA de Protozoário/genética , Genoma de Protozoário , Técnicas de Genotipagem , Fluxo de Trabalho , Cyclospora/classificação , Marcadores Genéticos , Biologia Molecular/métodos , Polimorfismo de Nucleotídeo Único

20.

In silico identification of genes in bacteriophage DNA.

Kropinski, Andrew M; Borodovsky, Mark; Carver, Tim J; Cerdeño-Tárraga, Ana M; Darling, Aaron; Lomsadze, Alexandre; Mahadevan, Padmanabhan; Stothard, Paul; Seto, Donald; Van Domselaar, Gary; Wishart, David S.

Methods Mol Biol ; 502: 57-89, 2009.

Artigo em Inglês | MEDLINE | ID: mdl-19082552

RESUMO

One of the most satisfying aspects of a genome sequencing project is the identification of the genes contained within it.These are of two types: those which encode tRNAs and those which produce proteins. After a general introduction on the properties of protein-encoding genes and the utility of the Basic Local Alignment Search Tool (BLASTX) to identify genes through homologs, a variety of tools are discussed by their creators. These include for genome annotation: GeneMark, Artemis, and BASys; and, for genome comparisons: Artemis Comparison Tool (ACT), Mauve, CoreGenes, and GeneOrder.

Assuntos

Bacteriófagos/genética , Biologia Computacional/métodos , DNA Viral/genética , DNA Viral/análise , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA