Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 68
Filtrar
1.
Genome Res ; 34(5): 769-777, 2024 06 25.
Artículo en Inglés | MEDLINE | ID: mdl-38866550

RESUMEN

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.


Asunto(s)
Anotación de Secuencia Molecular , Programas Informáticos , Anotación de Secuencia Molecular/métodos , Humanos , RNA-Seq/métodos , Algoritmos , Animales , Genoma , Biología Computacional/métodos , Genómica/métodos , Transcriptoma
2.
Bioinformatics ; 40(Supplement_2): ii79-ii86, 2024 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-39230690

RESUMEN

MOTIVATION: For the alignment of large numbers of protein sequences, tools are predominant that decide to align two residues using only simple prior knowledge, e.g. amino acid substitution matrices, and using only part of the available data. The accuracy of state-of-the-art programs declines with decreasing sequence identity and when increasingly large numbers of sequences are aligned. Recently, transformer-based deep-learning models started to harness the vast amount of protein sequence data, resulting in powerful pretrained language models with the main purpose of generating high-dimensional numerical representations, embeddings, for individual sites that agglomerate evolutionary, structural, and biophysical information. RESULTS: We extend the traditional profile hidden Markov model so that it takes as inputs unaligned protein sequences and the corresponding embeddings. We fit the model with gradient descent using our existing differentiable hidden Markov layer. All sequences and their embeddings are jointly aligned to a model of the protein family. We report that our upgraded HMM-based aligner, learnMSA2, combined with the ProtT5-XL protein language model aligns on average almost 6% points more columns correctly than the best amino acid-based competitor and scales well with sequence number. The relative advantage of learnMSA2 over other programs tends to be greater when the sequence identity is lower and when the number of sequences is larger. Our results strengthen the evidence on the rich information contained in protein language models' embeddings and their potential downstream impact on the field of bioinformatics. Availability and implementation:  https://github.com/Gaius-Augustus/learnMSA, PyPI and Bioconda, evaluation: https://github.com/felbecker/snakeMSA.


Asunto(s)
Cadenas de Markov , Proteínas , Alineación de Secuencia , Análisis de Secuencia de Proteína , Alineación de Secuencia/métodos , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Aprendizaje Profundo , Algoritmos , Biología Computacional/métodos , Secuencia de Aminoácidos
3.
Genome Res ; 31(7): 1203-1215, 2021 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-33947700

RESUMEN

In contrast to the western honey bee, Apis mellifera, other honey bee species have been largely neglected despite their importance and diversity. The genetic basis of the evolutionary diversification of honey bees remains largely unknown. Here, we provide a genome-wide comparison of three honey bee species, each representing one of the three subgenera of honey bees, namely the dwarf (Apis florea), giant (A. dorsata), and cavity-nesting (A. mellifera) honey bees with bumblebees as an outgroup. Our analyses resolve the phylogeny of honey bees with the dwarf honey bees diverging first. We find that evolution of increased eusocial complexity in Apis proceeds via increases in the complexity of gene regulation, which is in agreement with previous studies. However, this process seems to be related to pathways other than transcriptional control. Positive selection patterns across Apis reveal a trade-off between maintaining genome stability and generating genetic diversity, with a rapidly evolving piRNA pathway leading to genomes depleted of transposable elements, and a rapidly evolving DNA repair pathway associated with high recombination rates in all Apis species. Diversification within Apis is accompanied by positive selection in several genes whose putative functions present candidate mechanisms for lineage-specific adaptations, such as migration, immunity, and nesting behavior.

4.
BMC Bioinformatics ; 24(1): 327, 2023 Aug 31.
Artículo en Inglés | MEDLINE | ID: mdl-37653395

RESUMEN

BACKGROUND: The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. RESULTS: Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. CONCLUSIONS: Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.


Asunto(s)
Eucariontes , Células Eucariotas , Animales , Anotación de Secuencia Molecular , Transcriptoma
5.
Bioinformatics ; 38(7): 1857-1862, 2022 03 28.
Artículo en Inglés | MEDLINE | ID: mdl-35060608

RESUMEN

MOTIVATION: The comparison of genomes using models of molecular evolution is a powerful approach for finding, or toward understanding, functional elements. In particular, comparative genomics is a fundamental building brick in annotating ever larger sets of alignable genomes completely, accurately and consistently. RESULTS: We here present our new program ClaMSA that classifies multiple sequence alignments using a phylogenetic model. It uses a novel continuous-time Markov chain machine learning layer, named CTMC, whose parameters are learned end-to-end and together with (recurrent) neural networks for a learning task. We trained ClaMSA discriminatively to classify aligned codon sequences that are candidates of coding regions into coding or non-coding and obtained four times fewer false positives for this task on vertebrate and fly alignments than existing methods at the same true positive rate. ClaMSA and the CTMC layer are general tools that could be used for other machine learning tasks on tree-related sequence data. AVAILABILITY AND IMPLEMENTATION: Freely from https://github.com/Gaius-Augustus/clamsa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Evolución Biológica , Evolución Molecular , Filogenia , Genómica , Aprendizaje Automático
6.
BMC Bioinformatics ; 23(1): 225, 2022 Jun 10.
Artículo en Inglés | MEDLINE | ID: mdl-35689182

RESUMEN

BACKGROUND: An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. RESULTS: We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. CONCLUSIONS: An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.


Asunto(s)
Algoritmos , Genoma , Animales , Ratones , Alineación de Secuencia
7.
BMC Bioinformatics ; 22(1): 566, 2021 Nov 25.
Artículo en Inglés | MEDLINE | ID: mdl-34823473

RESUMEN

BACKGROUND: BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. RESULTS: We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. CONCLUSION: TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.


Asunto(s)
Genoma , Programas Informáticos , Genómica , RNA-Seq , Análisis de Secuencia de ARN
8.
Genome Res ; 28(7): 1029-1038, 2018 07.
Artículo en Inglés | MEDLINE | ID: mdl-29884752

RESUMEN

The recent introductions of low-cost, long-read, and read-cloud sequencing technologies coupled with intense efforts to develop efficient algorithms have made affordable, high-quality de novo sequence assembly a realistic proposition. The result is an explosion of new, ultracontiguous genome assemblies. To compare these genomes, we need robust methods for genome annotation. We describe the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships. We show that CAT can be used to improve annotations on the rat genome, annotate the great apes, annotate a diverse set of mammals, and annotate personal, diploid human genomes. We demonstrate the resulting discovery of novel genes, isoforms, and structural variants-even in genomes as well studied as rat and the great apes-and how these annotations improve cross-species RNA expression experiments.


Asunto(s)
Genoma Humano/genética , Algoritmos , Animales , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Anotación de Secuencia Molecular/métodos , ARN/genética , Ratas
9.
Genome Res ; 28(4): 448-459, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29563166

RESUMEN

Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.


Asunto(s)
Evolución Molecular , Genoma/genética , Muridae/genética , Filogenia , Animales , Sitios de Unión , Factor de Unión a CCCTC/genética , Cromosomas/genética , Cariotipificación/métodos , Elementos de Nucleótido Esparcido Largo/genética , Ratones , Retroelementos/genética , Especificidad de la Especie
10.
BMC Genomics ; 21(1): 47, 2020 Jan 14.
Artículo en Inglés | MEDLINE | ID: mdl-31937263

RESUMEN

BACKGROUND: The red flour beetle Tribolium castaneum has emerged as an important model organism for the study of gene function in development and physiology, for ecological and evolutionary genomics, for pest control and a plethora of other topics. RNA interference (RNAi), transgenesis and genome editing are well established and the resources for genome-wide RNAi screening have become available in this model. All these techniques depend on a high quality genome assembly and precise gene models. However, the first version of the genome assembly was generated by Sanger sequencing, and with a small set of RNA sequence data limiting annotation quality. RESULTS: Here, we present an improved genome assembly (Tcas5.2) and an enhanced genome annotation resulting in a new official gene set (OGS3) for Tribolium castaneum, which significantly increase the quality of the genomic resources. By adding large-distance jumping library DNA sequencing to join scaffolds and fill small gaps, the gaps in the genome assembly were reduced and the N50 increased to 4753kbp. The precision of the gene models was enhanced by the use of a large body of RNA-Seq reads of different life history stages and tissue types, leading to the discovery of 1452 novel gene sequences. We also added new features such as alternative splicing, well defined UTRs and microRNA target predictions. For quality control, 399 gene models were evaluated by manual inspection. The current gene set was submitted to Genbank and accepted as a RefSeq genome by NCBI. CONCLUSIONS: The new genome assembly (Tcas5.2) and the official gene set (OGS3) provide enhanced genomic resources for genetic work in Tribolium castaneum. The much improved information on transcription start sites supports transgenic and gene editing approaches. Further, novel types of information such as splice variants and microRNA target genes open additional possibilities for analysis.


Asunto(s)
Genes de Insecto , Genoma de los Insectos , Genómica , Tribolium/genética , Animales , Sitios de Unión , Biología Computacional/métodos , Genómica/métodos , MicroARNs/genética , Anotación de Secuencia Molecular , Filogenia , Interferencia de ARN , Reproducibilidad de los Resultados
11.
Proc Natl Acad Sci U S A ; 114(23): 6133-6138, 2017 06 06.
Artículo en Inglés | MEDLINE | ID: mdl-28536194

RESUMEN

Nicotine, the signature alkaloid of Nicotiana species responsible for the addictive properties of human tobacco smoking, functions as a defensive neurotoxin against attacking herbivores. However, the evolution of the genetic features that contributed to the assembly of the nicotine biosynthetic pathway remains unknown. We sequenced and assembled genomes of two wild tobaccos, Nicotiana attenuata (2.5 Gb) and Nicotiana obtusifolia (1.5 Gb), two ecological models for investigating adaptive traits in nature. We show that after the Solanaceae whole-genome triplication event, a repertoire of rapidly expanding transposable elements (TEs) bloated these Nicotiana genomes, promoted expression divergences among duplicated genes, and contributed to the evolution of herbivory-induced signaling and defenses, including nicotine biosynthesis. The biosynthetic machinery that allows for nicotine synthesis in the roots evolved from the stepwise duplications of two ancient primary metabolic pathways: the polyamine and nicotinamide adenine dinucleotide (NAD) pathways. In contrast to the duplication of the polyamine pathway that is shared among several solanaceous genera producing polyamine-derived tropane alkaloids, we found that lineage-specific duplications within the NAD pathway and the evolution of root-specific expression of the duplicated Solanaceae-specific ethylene response factor that activates the expression of all nicotine biosynthetic genes resulted in the innovative and efficient production of nicotine in the genus Nicotiana Transcription factor binding motifs derived from TEs may have contributed to the coexpression of nicotine biosynthetic pathway genes and coordinated the metabolic flux. Together, these results provide evidence that TEs and gene duplications facilitated the emergence of a key metabolic innovation relevant to plant fitness.


Asunto(s)
Nicotiana/genética , Nicotina/biosíntesis , Alcaloides/biosíntesis , Secuencia de Bases , Vías Biosintéticas/genética , Elementos Transponibles de ADN/genética , Evolución Molecular , Duplicación de Gen/genética , Regulación de la Expresión Génica de las Plantas/efectos de los fármacos , Nicotina/genética , Nicotina/metabolismo , Proteínas de Plantas/genética , Raíces de Plantas/metabolismo , Regiones Promotoras Genéticas/efectos de los fármacos , Factores de Transcripción/metabolismo
12.
BMC Bioinformatics ; 20(1): 558, 2019 Nov 08.
Artículo en Inglés | MEDLINE | ID: mdl-31703556

RESUMEN

BACKGROUND: Vast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data. RESULTS: This article presents the software VARUS that selects, downloads and aligns reads from NCBI's Sequence Read Archive, given only the species' binomial name and genome. VARUS automatically chooses runs from among all archived runs to randomly select subsets of reads. The objective of its online algorithm is to cover a large number of transcripts adequately when network bandwidth and computing resources are limited. For most tested species VARUS achieved both a higher sensitivity and specificity with a lower number of downloaded reads than when runs were manually selected. At the example of twelve eukaryotic genomes, we show that RNA-Seq that was sampled with VARUS is well-suited for fully-automatic genome annotation with BRAKER. CONCLUSIONS: With VARUS, genome annotation can be automatized to the extent that not even the selection and quality control of RNA-Seq has to be done manually. This introduces the possibility to have fully automatized genome annotation loops over potentially many species without incurring a loss of accuracy over a manually supervised annotation process.


Asunto(s)
Bases de Datos Genéticas , ARN Complementario/genética , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Algoritmos , Animales , Drosophila melanogaster/genética , Eucariontes/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Intrones/genética , Anotación de Secuencia Molecular , Transcriptoma/genética
13.
BMC Evol Biol ; 19(1): 32, 2019 01 23.
Artículo en Inglés | MEDLINE | ID: mdl-30674272

RESUMEN

BACKGROUND: Phenotypic plasticity is a pervasive property of all organisms and considered to be of key importance for dealing with environmental variation. Plastic responses to temperature, which is one of the most important ecological factors, have received much attention over recent decades. A recurrent pattern of temperature-induced adaptive plasticity includes increased heat tolerance after exposure to warmer temperatures and increased cold tolerance after exposure to cooler temperatures. However, the mechanisms underlying these plastic responses are hitherto not well understood. Therefore, we here investigate effects of adult acclimation on gene expression in the tropical butterfly Bicyclus anynana, using an RNAseq approach. RESULTS: We show that several antioxidant markers (e.g. peroxidase, cytochrome P450) were up-regulated at a higher temperature compared with a lower adult temperature, which might play an important role in the acclamatory responses subsequently providing increased heat tolerance. Furthermore, several metabolic pathways were up-regulated at the higher temperature, likely reflecting increased metabolic rates. In contrast, we found no evidence for a decisive role of the heat shock response. CONCLUSIONS: Although the important role of antioxidant defence mechanisms in alleviating detrimental effects of oxidative stress is firmly established, we speculate that its potentially important role in mediating heat tolerance and survival under stress has been underestimated thus far and thus deserves more attention.


Asunto(s)
Aclimatación/genética , Envejecimiento/genética , Mariposas Diurnas/genética , Mariposas Diurnas/fisiología , Regulación de la Expresión Génica , Temperatura , Análisis de Varianza , Animales , Variación Genética , Respuesta al Choque Térmico , Anotación de Secuencia Molecular , Carácter Cuantitativo Heredable , ARN Mensajero/genética , ARN Mensajero/metabolismo
14.
BMC Biol ; 15(1): 62, 2017 07 31.
Artículo en Inglés | MEDLINE | ID: mdl-28756775

RESUMEN

BACKGROUND: The duplication of genes can occur through various mechanisms and is thought to make a major contribution to the evolutionary diversification of organisms. There is increasing evidence for a large-scale duplication of genes in some chelicerate lineages including two rounds of whole genome duplication (WGD) in horseshoe crabs. To investigate this further, we sequenced and analyzed the genome of the common house spider Parasteatoda tepidariorum. RESULTS: We found pervasive duplication of both coding and non-coding genes in this spider, including two clusters of Hox genes. Analysis of synteny conservation across the P. tepidariorum genome suggests that there has been an ancient WGD in spiders. Comparison with the genomes of other chelicerates, including that of the newly sequenced bark scorpion Centruroides sculpturatus, suggests that this event occurred in the common ancestor of spiders and scorpions, and is probably independent of the WGDs in horseshoe crabs. Furthermore, characterization of the sequence and expression of the Hox paralogs in P. tepidariorum suggests that many have been subject to neo-functionalization and/or sub-functionalization since their duplication. CONCLUSIONS: Our results reveal that spiders and scorpions are likely the descendants of a polyploid ancestor that lived more than 450 MYA. Given the extensive morphological diversity and ecological adaptations found among these animals, rivaling those of vertebrates, our study of the ancient WGD event in Arachnopulmonata provides a new comparative platform to explore common and divergent evolutionary outcomes of polyploidization events across eukaryotes.


Asunto(s)
Evolución Molecular , Duplicación de Gen , Genoma , Arañas/genética , Animales , Femenino , Masculino , Sintenía
15.
Hum Mutat ; 38(9): 1266-1276, 2017 09.
Artículo en Inglés | MEDLINE | ID: mdl-28544481

RESUMEN

The advent of next-generation sequencing has dramatically decreased the cost for whole-genome sequencing and increased the viability for its application in research and clinical care. The Personal Genome Project (PGP) provides unrestricted access to genomes of individuals and their associated phenotypes. This resource enabled the Critical Assessment of Genome Interpretation (CAGI) to create a community challenge to assess the bioinformatics community's ability to predict traits from whole genomes. In the CAGI PGP challenge, researchers were asked to predict whether an individual had a particular trait or profile based on their whole genome. Several approaches were used to assess submissions, including ROC AUC (area under receiver operating characteristic curve), probability rankings, the number of correct predictions, and statistical significance simulations. Overall, we found that prediction of individual traits is difficult, relying on a strong knowledge of trait frequency within the general population, whereas matching genomes to trait profiles relies heavily upon a small number of common traits including ancestry, blood type, and eye color. When a rare genetic disorder is present, profiles can be matched when one or more pathogenic variants are identified. Prediction accuracy has improved substantially over the last 6 years due to improved methodology and a better understanding of features.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación Completa del Genoma/métodos , Área Bajo la Curva , Predisposición Genética a la Enfermedad , Proyecto Genoma Humano , Humanos , Fenotipo , Sitios de Carácter Cuantitativo
16.
BMC Genomics ; 18(1): 178, 2017 02 16.
Artículo en Inglés | MEDLINE | ID: mdl-28209133

RESUMEN

BACKGROUND: Black widow spiders are infamous for their neurotoxic venom, which can cause extreme and long-lasting pain. This unusual venom is dominated by latrotoxins and latrodectins, two protein families virtually unknown outside of the black widow genus Latrodectus, that are difficult to study given the paucity of spider genomes. Using tissue-, sex- and stage-specific expression data, we analyzed the recently sequenced genome of the house spider (Parasteatoda tepidariorum), a close relative of black widows, to investigate latrotoxin and latrodectin diversity, expression and evolution. RESULTS: We discovered at least 47 latrotoxin genes in the house spider genome, many of which are tandem-arrayed. Latrotoxins vary extensively in predicted structural domains and expression, implying their significant functional diversification. Phylogenetic analyses show latrotoxins have substantially duplicated after the Latrodectus/Parasteatoda split and that they are also related to proteins found in endosymbiotic bacteria. Latrodectin genes are less numerous than latrotoxins, but analyses show their recruitment for venom function from neuropeptide hormone genes following duplication, inversion and domain truncation. While latrodectins and other peptides are highly expressed in house spider and black widow venom glands, latrotoxins account for a far smaller percentage of house spider venom gland expression. CONCLUSIONS: The house spider genome sequence provides novel insights into the evolution of venom toxins once considered unique to black widows. Our results greatly expand the size of the latrotoxin gene family, reinforce its narrow phylogenetic distribution, and provide additional evidence for the lateral transfer of latrotoxins between spiders and bacterial endosymbionts. Moreover, we strengthen the evidence for the evolution of latrodectin venom genes from the ecdysozoan Ion Transport Peptide (ITP)/Crustacean Hyperglycemic Hormone (CHH) neuropeptide superfamily. The lower expression of latrotoxins in house spiders relative to black widows, along with the absence of a vertebrate-targeting α-latrotoxin gene in the house spider genome, may account for the extreme potency of black widow venom.


Asunto(s)
Araña Viuda Negra , Evolución Molecular , Perfilación de la Expresión Génica , Variación Genética , Genómica , Proteínas de Insectos/toxicidad , Venenos de Araña/genética , Animales , Coxiellaceae/fisiología , Femenino , Proteínas de Insectos/química , Proteínas de Insectos/genética , Proteínas de Insectos/metabolismo , Masculino , Dominios Proteicos , Caracteres Sexuales , Simbiosis
17.
Bioinformatics ; 32(22): 3388-3395, 2016 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-27466621

RESUMEN

MOTIVATION: As the tree of life is populated with sequenced genomes ever more densely, the new challenge is the accurate and consistent annotation of entire clades of genomes. We address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or-if not-where the exon gains and losses are plausible given the species tree. We formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach. RESULTS: The proposed method was tested on whole-genome alignments of 12 vertebrate and 12 Drosophila species. The accuracy was evaluated for human, mouse and Drosophila melanogaster and compared to competing methods. Results suggest that our method is well-suited for annotation of (a large number of) genomes of closely related species within a clade, in particular, when RNA-Seq data are available for many of the genomes. The transfer of existing annotations from one genome to another via the genome alignment is more accurate than previous approaches that are based on protein-spliced alignments, when the genomes are at close to medium distances. AVAILABILITY AND IMPLEMENTATION: The method is implemented in C ++ as part of Augustus and available open source at http://bioinf.uni-greifswald.de/augustus/ CONTACT: stefaniekoenig@ymail.com or mario.stanke@uni-greifswald.deSupplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Alineación de Secuencia , Animales , Drosophila melanogaster , Exones , Humanos , Ratones
18.
Bioinformatics ; 32(5): 767-9, 2016 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-26559507

RESUMEN

MOTIVATION: Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions. AUGUSTUS is a gene finder that usually requires supervised training and uses information from RNA-Seq reads in the prediction step. Complementary strengths of GeneMark-ET and AUGUSTUS provided motivation for designing a new combined tool for automatic gene prediction. RESULTS: We present BRAKER1, a pipeline for unsupervised RNA-Seq-based genome annotation that combines the advantages of GeneMark-ET and AUGUSTUS. As input, BRAKER1 requires a genome assembly file and a file in bam-format with spliced alignments of RNA-Seq reads to the genome. First, GeneMark-ET performs iterative training and generates initial gene structures. Second, AUGUSTUS uses predicted genes for training and then integrates RNA-Seq read information into final gene predictions. In our experiments, we observed that BRAKER1 was more accurate than MAKER2 when it is using RNA-Seq as sole source for training and prediction. BRAKER1 does not require pre-trained parameters or a separate expert-prepared training step. AVAILABILITY AND IMPLEMENTATION: BRAKER1 is available for download at http://bioinf.uni-greifswald.de/bioinf/braker/ and http://exon.gatech.edu/GeneMark/ CONTACT: katharina.hoff@uni-greifswald.de or borodovsky@gatech.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Análisis de Secuencia de ARN , Eucariontes , Genoma , ARN , Programas Informáticos
19.
Nature ; 466(7307): 720-6, 2010 Aug 05.
Artículo en Inglés | MEDLINE | ID: mdl-20686567

RESUMEN

Sponges are an ancient group of animals that diverged from other metazoans over 600 million years ago. Here we present the draft genome sequence of Amphimedon queenslandica, a demosponge from the Great Barrier Reef, and show that it is remarkably similar to other animal genomes in content, structure and organization. Comparative analysis enabled by the sequencing of the sponge genome reveals genomic events linked to the origin and early evolution of animals, including the appearance, expansion and diversification of pan-metazoan transcription factor, signalling pathway and structural genes. This diverse 'toolkit' of genes correlates with critical aspects of all metazoan body plans, and comprises cell cycle control and growth, development, somatic- and germ-cell specification, cell adhesion, innate immunity and allorecognition. Notably, many of the genes associated with the emergence of animals are also implicated in cancer, which arises from defects in basic processes associated with metazoan multicellularity.


Asunto(s)
Evolución Molecular , Genoma/genética , Poríferos/genética , Animales , Apoptosis/genética , Adhesión Celular/genética , Ciclo Celular/genética , Polaridad Celular/genética , Proliferación Celular , Genes/genética , Genómica , Humanos , Inmunidad Innata/genética , Modelos Biológicos , Neuronas/metabolismo , Fosfotransferasas/química , Fosfotransferasas/genética , Filogenia , Poríferos/anatomía & histología , Poríferos/citología , Poríferos/inmunología , Análisis de Secuencia de ADN , Transducción de Señal/genética
20.
Proteins ; 83(5): 844-52, 2015 May.
Artículo en Inglés | MEDLINE | ID: mdl-25663045

RESUMEN

Large efforts have been made in classifying residues as binding sites in proteins using machine learning methods. The prediction task can be translated into the computational challenge of assigning each residue the label binding site or non-binding site. Observational data comes from various possibly highly correlated sources. It includes the structure of the protein but not the structure of the complex. The model class of conditional random fields (CRFs) has previously successfully been used for protein binding site prediction. Here, a new CRF-approach is presented that models the dependencies of residues using a general graphical structure defined as a neighborhood graph and thus our model makes fewer independence assumptions on the labels than sequential labeling approaches. A novel node feature "change in free energy" is introduced into the model, which is then denoted by ΔF-CRF. Parameters are trained with an online large-margin algorithm. Using the standard feature class relative accessible surface area alone, the general graph-structure CRF already achieves higher prediction accuracy than the linear chain CRF of Li et al. ΔF-CRF performs significantly better on a large range of false positive rates than the support-vector-machine-based program PresCont of Zellner et al. on a homodimer set containing 128 chains. ΔF-CRF has a broader scope than PresCont since it is not constrained to protein subgroups and requires no multiple sequence alignment. The improvement is attributed to the advantageous combination of the novel node feature with the standard feature and to the adopted parameter training method.


Asunto(s)
Programas Informáticos , Sitios de Unión , Simulación por Computador , Modelos Moleculares , Dominios y Motivos de Interacción de Proteínas , Proteínas/química , Curva ROC , Máquina de Vectores de Soporte , Termodinámica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA