Search | VHL Regional Portal

1.

Nachtweide, Stefanie; Romoth, Lars; Stanke, Mario.

Methods Mol Biol ; 2802: 165-187, 2024.

Article in English | MEDLINE | ID: mdl-38819560

ABSTRACT

Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. A large proportion of such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies, differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate either a single target genome or all input genomes simultaneously. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Furthermore, we provide practical advice on genome annotation in general.

Subject(s)

Genomics , Molecular Sequence Annotation , Phylogeny , Molecular Sequence Annotation/methods , Genomics/methods , Computational Biology/methods , Genome/genetics , Sequence Alignment/methods , Software

2.

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA.

Gabriel, Lars; Bruna, Tomás; Hoff, Katharina J; Ebel, Matthis; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

bioRxiv ; 2024 Feb 29.

Article in English | MEDLINE | ID: mdl-37398387

ABSTRACT

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made by the recently released GeneMark-ETP integrating all three data types. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2. The average transcript-level F1-score was increased by ~20 percentage points on average, while the difference was most pronounced for species with large and complex genomes. BRAKER3 also outperformed other existing tools, MAKER2, Funannotate and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

3.

Galba: genome annotation with miniprot and AUGUSTUS.

Bruna, Tomás; Li, Heng; Guhlin, Joseph; Honsel, Daniel; Herbold, Steffen; Stanke, Mario; Nenasheva, Natalia; Ebel, Matthis; Gabriel, Lars; Hoff, Katharina J.

BMC Bioinformatics ; 24(1): 327, 2023 Aug 31.

Article in English | MEDLINE | ID: mdl-37653395

ABSTRACT

BACKGROUND: The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. RESULTS: Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. CONCLUSIONS: Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Subject(s)

Eukaryota , Eukaryotic Cells , Animals , Molecular Sequence Annotation , Transcriptome

4.

GALBA: Genome Annotation with Miniprot and AUGUSTUS.

Bruna, Tomás; Li, Heng; Guhlin, Joseph; Honsel, Daniel; Herbold, Steffen; Stanke, Mario; Nenasheva, Natalia; Ebel, Matthis; Gabriel, Lars; Hoff, Katharina J.

bioRxiv ; 2023 Apr 10.

Article in English | MEDLINE | ID: mdl-37090650

ABSTRACT

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein- to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

5.

learnMSA: learning and aligning large protein families.

Becker, Felix; Stanke, Mario.

Gigascience ; 112022 11 18.

Article in English | MEDLINE | ID: mdl-36399060

ABSTRACT

BACKGROUND: The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. RESULTS: We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum-Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. CONCLUSIONS: Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.

Subject(s)

Algorithms , Proteins , Sequence Alignment , Amino Acid Sequence , Benchmarking

6.

Global, highly specific and fast filtering of alignment seeds.

Ebel, Matthis; Migliorelli, Giovanna; Stanke, Mario.

BMC Bioinformatics ; 23(1): 225, 2022 Jun 10.

Article in English | MEDLINE | ID: mdl-35689182

ABSTRACT

BACKGROUND: An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. RESULTS: We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. CONCLUSIONS: An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Subject(s)

Algorithms , Genome , Animals , Mice , Sequence Alignment

7.

End-to-end learning of evolutionary models to find coding regions in genome alignments.

Mertsch, Darvin; Stanke, Mario.

Bioinformatics ; 38(7): 1857-1862, 2022 03 28.

Article in English | MEDLINE | ID: mdl-35060608

ABSTRACT

MOTIVATION: The comparison of genomes using models of molecular evolution is a powerful approach for finding, or toward understanding, functional elements. In particular, comparative genomics is a fundamental building brick in annotating ever larger sets of alignable genomes completely, accurately and consistently. RESULTS: We here present our new program ClaMSA that classifies multiple sequence alignments using a phylogenetic model. It uses a novel continuous-time Markov chain machine learning layer, named CTMC, whose parameters are learned end-to-end and together with (recurrent) neural networks for a learning task. We trained ClaMSA discriminatively to classify aligned codon sequences that are candidates of coding regions into coding or non-coding and obtained four times fewer false positives for this task on vertebrate and fly alignments than existing methods at the same true positive rate. ClaMSA and the CTMC layer are general tools that could be used for other machine learning tasks on tree-related sequence data. AVAILABILITY AND IMPLEMENTATION: Freely from https://github.com/Gaius-Augustus/clamsa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Biological Evolution , Evolution, Molecular , Phylogeny , Genomics , Machine Learning

8.

TSEBRA: transcript selector for BRAKER.

Gabriel, Lars; Hoff, Katharina J; Bruna, Tomás; Borodovsky, Mark; Stanke, Mario.

BMC Bioinformatics ; 22(1): 566, 2021 Nov 25.

Article in English | MEDLINE | ID: mdl-34823473

ABSTRACT

BACKGROUND: BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. RESULTS: We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. CONCLUSION: TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

Subject(s)

Genome , Software , Genomics , RNA-Seq , Sequence Analysis, RNA

9.

Application of YOLOv4 for Detection and Motion Monitoring of Red Foxes.

Schütz, Anne K; Schöler, Verena; Krause, E Tobias; Fischer, Mareike; Müller, Thomas; Freuling, Conrad M; Conraths, Franz J; Stanke, Mario; Homeier-Bachmann, Timo; Lentz, Hartmut H K.

Animals (Basel) ; 11(6)2021 Jun 09.

Article in English | MEDLINE | ID: mdl-34207726

ABSTRACT

Animal activity is an indicator for its welfare and manual observation is time and cost intensive. To this end, automatic detection and monitoring of live captive animals is of major importance for assessing animal activity, and, thereby, allowing for early recognition of changes indicative for diseases and animal welfare issues. We demonstrate that machine learning methods can provide a gap-less monitoring of red foxes in an experimental lab-setting, including a classification into activity patterns. Therefore, bounding boxes are used to measure fox movements, and, thus, the activity level of the animals. We use computer vision, being a non-invasive method for the automatic monitoring of foxes. More specifically, we train the existing algorithm 'you only look once' version 4 (YOLOv4) to detect foxes, and the trained classifier is applied to video data of an experiment involving foxes. As we show, computer evaluation outperforms other evaluation methods. Application of automatic detection of foxes can be used for detecting different movement patterns. These, in turn, can be used for animal behavioral analysis and, thus, animal welfare monitoring. Once established for a specific animal species, such systems could be used for animal monitoring in real-time under experimental conditions, or other areas of animal husbandry.

10.

Pseudomonas Strains Induce Transcriptional and Morphological Changes and Reduce Root Colonization of Verticillium spp.

Harting, Rebekka; Nagel, Alexandra; Nesemann, Kai; Höfer, Annalena M; Bastakis, Emmanouil; Kusch, Harald; Stanley, Claire E; Stöckli, Martina; Kaever, Alexander; Hoff, Katharina J; Stanke, Mario; deMello, Andrew J; Künzler, Markus; Haney, Cara H; Braus-Stromeyer, Susanna A; Braus, Gerhard H.

Front Microbiol ; 12: 652468, 2021.

Article in English | MEDLINE | ID: mdl-34108946

ABSTRACT

Phytopathogenic Verticillia cause Verticillium wilt on numerous economically important crops. Plant infection begins at the roots, where the fungus is confronted with rhizosphere inhabiting bacteria. The effects of different fluorescent pseudomonads, including some known biocontrol agents of other plant pathogens, on fungal growth of the haploid Verticillium dahliae and/or the amphidiploid Verticillium longisporum were compared on pectin-rich medium, in microfluidic interaction channels, allowing visualization of single hyphae, or on Arabidopsis thaliana roots. We found that the potential for formation of bacterial lipopeptide syringomycin resulted in stronger growth reduction effects on saprophytic Aspergillus nidulans compared to Verticillium spp. A more detailed analyses on bacterial-fungal co-cultivation in narrow interaction channels of microfluidic devices revealed that the strongest inhibitory potential was found for Pseudomonas protegens CHA0, with its inhibitory potential depending on the presence of the GacS/GacA system controlling several bacterial metabolites. Hyphal tip polarity was altered when V. longisporum was confronted with pseudomonads in narrow interaction channels, resulting in a curly morphology instead of straight hyphal tip growth. These results support the hypothesis that the fungus attempts to evade the bacterial confrontation. Alterations due to co-cultivation with bacteria could not only be observed in fungal morphology but also in fungal transcriptome. P. protegens CHA0 alters transcriptional profiles of V. longisporum during 2 h liquid media co-cultivation in pectin-rich medium. Genes required for degradation of and growth on the carbon source pectin were down-regulated, whereas transcripts involved in redox processes were up-regulated. Thus, the secondary metabolite mediated effect of Pseudomonas isolates on Verticillium species results in a complex transcriptional response, leading to decreased growth with precautions for self-protection combined with the initiation of a change in fungal growth direction. This interplay of bacterial effects on the pathogen can be beneficial to protect plants from infection, as shown with A. thaliana root experiments. Treatment of the roots with bacteria prior to infection with V. dahliae resulted in a significant reduction of fungal root colonization. Taken together we demonstrate how pseudomonads interfere with the growth of Verticillium spp. and show that these bacteria could serve in plant protection.

11.

The genomic basis of evolutionary differentiation among honey bees.

Fouks, Bertrand; Brand, Philipp; Nguyen, Hung N; Herman, Jacob; Camara, Francisco; Ence, Daniel; Hagen, Darren E; Hoff, Katharina J; Nachweide, Stefanie; Romoth, Lars; Walden, Kimberly K O; Guigo, Roderic; Stanke, Mario; Narzisi, Giuseppe; Yandell, Mark; Robertson, Hugh M; Koeniger, Nikolaus; Chantawannakul, Panuwan; Schatz, Michael C; Worley, Kim C; Robinson, Gene E; Elsik, Christine G; Rueppell, Olav.

Genome Res ; 31(7): 1203-1215, 2021 Jul.

Article in English | MEDLINE | ID: mdl-33947700

ABSTRACT

In contrast to the western honey bee, Apis mellifera, other honey bee species have been largely neglected despite their importance and diversity. The genetic basis of the evolutionary diversification of honey bees remains largely unknown. Here, we provide a genome-wide comparison of three honey bee species, each representing one of the three subgenera of honey bees, namely the dwarf (Apis florea), giant (A. dorsata), and cavity-nesting (A. mellifera) honey bees with bumblebees as an outgroup. Our analyses resolve the phylogeny of honey bees with the dwarf honey bees diverging first. We find that evolution of increased eusocial complexity in Apis proceeds via increases in the complexity of gene regulation, which is in agreement with previous studies. However, this process seems to be related to pathways other than transcriptional control. Positive selection patterns across Apis reveal a trade-off between maintaining genome stability and generating genetic diversity, with a rapidly evolving piRNA pathway leading to genomes depleted of transposable elements, and a rapidly evolving DNA repair pathway associated with high recombination rates in all Apis species. Diversification within Apis is accompanied by positive selection in several genes whose putative functions present candidate mechanisms for lineage-specific adaptations, such as migration, immunity, and nesting behavior.

12.

A 20-kb lineage-specific genomic region tames virulence in pathogenic amphidiploid Verticillium longisporum.

Harting, Rebekka; Starke, Jessica; Kusch, Harald; Pöggeler, Stefanie; Maurus, Isabel; Schlüter, Rabea; Landesfeind, Manuel; Bulla, Ingo; Nowrousian, Minou; de Jonge, Ronnie; Stahlhut, Gertrud; Hoff, Katharina J; Aßhauer, Kathrin P; Thürmer, Andrea; Stanke, Mario; Daniel, Rolf; Morgenstern, Burkhard; Thomma, Bart P H J; Kronstad, James W; Braus-Stromeyer, Susanna A; Braus, Gerhard H.

Mol Plant Pathol ; 22(8): 939-953, 2021 08.

Article in English | MEDLINE | ID: mdl-33955130

ABSTRACT

Amphidiploid fungal Verticillium longisporum strains Vl43 and Vl32 colonize the plant host Brassica napus but differ in their ability to cause disease symptoms. These strains represent two V. longisporum lineages derived from different hybridization events of haploid parental Verticillium strains. Vl32 and Vl43 carry same-sex mating-type genes derived from both parental lineages. Vl32 and Vl43 similarly colonize and penetrate plant roots, but asymptomatic Vl32 proliferation in planta is lower than virulent Vl43. The highly conserved Vl43 and Vl32 genomes include less than 1% unique genes, and the karyotypes of 15 or 16 chromosomes display changed genetic synteny due to substantial genomic reshuffling. A 20 kb Vl43 lineage-specific (LS) region apparently originating from the Verticillium dahliae-related ancestor is specific for symptomatic Vl43 and encodes seven genes, including two putative transcription factors. Either partial or complete deletion of this LS region in Vl43 did not reduce virulence but led to induction of even more severe disease symptoms in rapeseed. This suggests that the LS insertion in the genome of symptomatic V. longisporum Vl43 mediates virulence-reducing functions, limits damage on the host plant, and therefore tames Vl43 from being even more virulent.

Subject(s)

Plant Diseases , Verticillium , Ascomycota , Genomics , Plant Diseases/genetics , Verticillium/genetics , Virulence/genetics

13.

BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database.

Bruna, Tomás; Hoff, Katharina J; Lomsadze, Alexandre; Stanke, Mario; Borodovsky, Mark.

NAR Genom Bioinform ; 3(1): lqaa108, 2021 Mar.

Article in English | MEDLINE | ID: mdl-33575650

ABSTRACT

The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.

14.

Enhanced genome assembly and a new official gene set for Tribolium castaneum.

Herndon, Nicolae; Shelton, Jennifer; Gerischer, Lizzy; Ioannidis, Panos; Ninova, Maria; Dönitz, Jürgen; Waterhouse, Robert M; Liang, Chun; Damm, Carsten; Siemanowski, Janna; Kitzmann, Peter; Ulrich, Julia; Dippel, Stefan; Oberhofer, Georg; Hu, Yonggang; Schwirz, Jonas; Schacht, Magdalena; Lehmann, Sabrina; Montino, Alice; Posnien, Nico; Gurska, Daniela; Horn, Thorsten; Seibert, Jan; Vargas Jentzsch, Iris M; Panfilio, Kristen A; Li, Jianwei; Wimmer, Ernst A; Stappert, Dominik; Roth, Siegfried; Schröder, Reinhard; Park, Yoonseong; Schoppmeier, Michael; Chung, Ho-Ryun; Klingler, Martin; Kittelmann, Sebastian; Friedrich, Markus; Chen, Rui; Altincicek, Boran; Vilcinskas, Andreas; Zdobnov, Evgeny; Griffiths-Jones, Sam; Ronshaugen, Matthew; Stanke, Mario; Brown, Sue J; Bucher, Gregor.

BMC Genomics ; 21(1): 47, 2020 Jan 14.

Article in English | MEDLINE | ID: mdl-31937263

ABSTRACT

BACKGROUND: The red flour beetle Tribolium castaneum has emerged as an important model organism for the study of gene function in development and physiology, for ecological and evolutionary genomics, for pest control and a plethora of other topics. RNA interference (RNAi), transgenesis and genome editing are well established and the resources for genome-wide RNAi screening have become available in this model. All these techniques depend on a high quality genome assembly and precise gene models. However, the first version of the genome assembly was generated by Sanger sequencing, and with a small set of RNA sequence data limiting annotation quality. RESULTS: Here, we present an improved genome assembly (Tcas5.2) and an enhanced genome annotation resulting in a new official gene set (OGS3) for Tribolium castaneum, which significantly increase the quality of the genomic resources. By adding large-distance jumping library DNA sequencing to join scaffolds and fill small gaps, the gaps in the genome assembly were reduced and the N50 increased to 4753kbp. The precision of the gene models was enhanced by the use of a large body of RNA-Seq reads of different life history stages and tissue types, leading to the discovery of 1452 novel gene sequences. We also added new features such as alternative splicing, well defined UTRs and microRNA target predictions. For quality control, 399 gene models were evaluated by manual inspection. The current gene set was submitted to Genbank and accepted as a RefSeq genome by NCBI. CONCLUSIONS: The new genome assembly (Tcas5.2) and the official gene set (OGS3) provide enhanced genomic resources for genetic work in Tribolium castaneum. The much improved information on transcription start sites supports transgenic and gene editing approaches. Further, novel types of information such as splice variants and microRNA target genes open additional possibilities for analysis.

Subject(s)

Genes, Insect , Genome, Insect , Genomics , Tribolium/genetics , Animals , Binding Sites , Computational Biology/methods , Genomics/methods , MicroRNAs/genetics , Molecular Sequence Annotation , Phylogeny , RNA Interference , Reproducibility of Results

15.

VARUS: sampling complementary RNA reads from the sequence read archive.

Stanke, Mario; Bruhn, Willy; Becker, Felix; Hoff, Katharina J.

BMC Bioinformatics ; 20(1): 558, 2019 Nov 08.

Article in English | MEDLINE | ID: mdl-31703556

ABSTRACT

BACKGROUND: Vast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data. RESULTS: This article presents the software VARUS that selects, downloads and aligns reads from NCBI's Sequence Read Archive, given only the species' binomial name and genome. VARUS automatically chooses runs from among all archived runs to randomly select subsets of reads. The objective of its online algorithm is to cover a large number of transcripts adequately when network bandwidth and computing resources are limited. For most tested species VARUS achieved both a higher sensitivity and specificity with a lower number of downloaded reads than when runs were manually selected. At the example of twelve eukaryotic genomes, we show that RNA-Seq that was sampled with VARUS is well-suited for fully-automatic genome annotation with BRAKER. CONCLUSIONS: With VARUS, genome annotation can be automatized to the extent that not even the selection and quality control of RNA-Seq has to be done manually. This introduces the possibility to have fully automatized genome annotation loops over potentially many species without incurring a loss of accuracy over a manually supervised annotation process.

Subject(s)

Databases, Genetic , RNA, Complementary/genetics , Sequence Analysis, RNA/methods , Software , Algorithms , Animals , Drosophila melanogaster/genetics , Eukaryota/genetics , High-Throughput Nucleotide Sequencing , Introns/genetics , Molecular Sequence Annotation , Transcriptome/genetics

16.

Whole-Genome Annotation with BRAKER.

Hoff, Katharina J; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario.

Methods Mol Biol ; 1962: 65-95, 2019.

Article in English | MEDLINE | ID: mdl-31020555

ABSTRACT

BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence.

Subject(s)

Genome , Molecular Sequence Annotation/methods , Software , Amino Acid Sequence , Genomics/methods , Internet , User-Computer Interface

17.

Multi-Genome Annotation with AUGUSTUS.

Nachtweide, Stefanie; Stanke, Mario.

Methods Mol Biol ; 1962: 139-160, 2019.

Article in English | MEDLINE | ID: mdl-31020558

ABSTRACT

Comparing multiple related genomes can help to improve their structural annotation. The accuracy and consistency of the predicted exon-intron structures of the protein coding genes can be higher when considering all genomes at once rather than annotating one genome at a time.The comparative gene prediction algorithm of AUGUSTUS performs such a multi-genome annotation. A multiple alignment of genomes is used to exploit evolutionary clues to conservation and negative selection. Further, AUGUSTUS exploits the fact that orthologous genes typically have congruent exon-intron structures. Comparative AUGUSTUS simultaneously predicts the genes in all input genomes. In this chapter we walk the reader through a small example from eight vertebrate species, including the construction of an alignment of the input genomes and how to integrate RNA-Seq evidence from multiple species for gene finding.

Subject(s)

Algorithms , Genome , Molecular Sequence Annotation/methods , Vertebrates/genetics , Animals , Computational Biology/methods , Databases, Genetic , Evolution, Molecular , Sequence Analysis, RNA/methods , User-Computer Interface

18.

Effects of adult temperature on gene expression in a butterfly: identifying pathways associated with thermal acclimation.

Franke, Kristin; Karl, Isabell; Centeno, Tonatiuh Pena; Feldmeyer, Barbara; Lassek, Christian; Oostra, Vicencio; Riedel, Katharina; Stanke, Mario; Wheat, Christopher W; Fischer, Klaus.

BMC Evol Biol ; 19(1): 32, 2019 01 23.

Article in English | MEDLINE | ID: mdl-30674272

ABSTRACT

BACKGROUND: Phenotypic plasticity is a pervasive property of all organisms and considered to be of key importance for dealing with environmental variation. Plastic responses to temperature, which is one of the most important ecological factors, have received much attention over recent decades. A recurrent pattern of temperature-induced adaptive plasticity includes increased heat tolerance after exposure to warmer temperatures and increased cold tolerance after exposure to cooler temperatures. However, the mechanisms underlying these plastic responses are hitherto not well understood. Therefore, we here investigate effects of adult acclimation on gene expression in the tropical butterfly Bicyclus anynana, using an RNAseq approach. RESULTS: We show that several antioxidant markers (e.g. peroxidase, cytochrome P450) were up-regulated at a higher temperature compared with a lower adult temperature, which might play an important role in the acclamatory responses subsequently providing increased heat tolerance. Furthermore, several metabolic pathways were up-regulated at the higher temperature, likely reflecting increased metabolic rates. In contrast, we found no evidence for a decisive role of the heat shock response. CONCLUSIONS: Although the important role of antioxidant defence mechanisms in alleviating detrimental effects of oxidative stress is firmly established, we speculate that its potentially important role in mediating heat tolerance and survival under stress has been underestimated thus far and thus deserves more attention.

Subject(s)

Acclimatization/genetics , Aging/genetics , Butterflies/genetics , Butterflies/physiology , Gene Expression Regulation , Temperature , Analysis of Variance , Animals , Genetic Variation , Heat-Shock Response , Molecular Sequence Annotation , Quantitative Trait, Heritable , RNA, Messenger/genetics , RNA, Messenger/metabolism

19.

Predicting Genes in Single Genomes with AUGUSTUS.

Hoff, Katharina J; Stanke, Mario.

Curr Protoc Bioinformatics ; 65(1): e57, 2019 03.

Article in English | MEDLINE | ID: mdl-30466165

ABSTRACT

AUGUSTUS is a tool for finding protein-coding genes and their exon-intron structure in genomic sequences. It does not necessarily require additional experimental input, as it can be applied in so-called ab initio mode. However, extrinsic evidence from various sources such as transcriptome sequencing or the annotations of closely related genomes can be integrated in order to improve the accuracy and completeness of the annotation. AUGUSTUS can be applied to single genomes, or simultaneously to several aligned genomes. Here, we describe steps required for training AUGUSTUS for the annotation of individual genomes and the steps to do the actual structural annotation. Further, we describe the generation and integration of evidence from various sources of extrinsic evidence. © 2018 by John Wiley & Sons, Inc.

Subject(s)

Computational Biology/methods , Genome , Molecular Sequence Annotation , Software , Animals , Base Sequence , Databases, Protein , Expressed Sequence Tags , Sequence Analysis, RNA

20.

Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci.

Lilue, Jingtao; Doran, Anthony G; Fiddes, Ian T; Abrudan, Monica; Armstrong, Joel; Bennett, Ruth; Chow, William; Collins, Joanna; Collins, Stephan; Czechanski, Anne; Danecek, Petr; Diekhans, Mark; Dolle, Dirk-Dominik; Dunn, Matt; Durbin, Richard; Earl, Dent; Ferguson-Smith, Anne; Flicek, Paul; Flint, Jonathan; Frankish, Adam; Fu, Beiyuan; Gerstein, Mark; Gilbert, James; Goodstadt, Leo; Harrow, Jennifer; Howe, Kerstin; Ibarra-Soria, Ximena; Kolmogorov, Mikhail; Lelliott, Chris J; Logan, Darren W; Loveland, Jane; Mathews, Clayton E; Mott, Richard; Muir, Paul; Nachtweide, Stefanie; Navarro, Fabio C P; Odom, Duncan T; Park, Naomi; Pelan, Sarah; Pham, Son K; Quail, Mike; Reinholdt, Laura; Romoth, Lars; Shirley, Lesley; Sisu, Cristina; Sjoberg-Herrera, Marcela; Stanke, Mario; Steward, Charles; Thomas, Mark; Threadgold, Glen.

Nat Genet ; 50(11): 1574-1583, 2018 11.

Article in English | MEDLINE | ID: mdl-30275530

ABSTRACT

We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.

Subject(s)

Chromosome Mapping , Genetic Loci , Genome , Haplotypes , Mice, Inbred Strains/genetics , Animals , Animals, Laboratory , Chromosome Mapping/veterinary , Haplotypes/genetics , Mice , Mice, Inbred BALB C/genetics , Mice, Inbred C3H/genetics , Mice, Inbred C57BL/genetics , Mice, Inbred CBA/genetics , Mice, Inbred DBA/genetics , Mice, Inbred NOD/genetics , Mice, Inbred Strains/classification , Molecular Sequence Annotation , Phylogeny , Polymorphism, Single Nucleotide , Species Specificity

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL