RESUMO
Background: DNA methylation plays a key role in the regulation of gene expression and carcinogenesis. Bisulfite sequencing studies mainly focus on calling single nucleotide polymorphism, different methylation region, and find allele-specific DNA methylation. Until now, only a few software tools have focused on virus integration using bisulfite sequencing data. Findings: We have developed a new and easy-to-use software tool, named BS-virus-finder (BSVF, RRID:SCR_015727), to detect viral integration breakpoints in whole human genomes. The tool is hosted at https://github.com/BGI-SZ/BSVF. Conclusions: BS-virus-finder demonstrates high sensitivity and specificity. It is useful in epigenetic studies and to reveal the relationship between viral integration and DNA methylation. BS-virus-finder is the first software tool to detect virus integration loci by using bisulfite sequencing data.
Assuntos
DNA Viral/genética , Genoma Humano , Vírus da Hepatite B/genética , Hepatócitos/virologia , Software , Integração Viral , Pareamento de Bases , Sequência de Bases , Linhagem Celular Tumoral , Metilação de DNA , Epigênese Genética , Hepatócitos/metabolismo , Hepatócitos/patologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Sensibilidade e Especificidade , Sulfitos/química , Sequenciamento Completo do GenomaRESUMO
Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome or by performing local assembly. However, these approaches are biased against discovery of structural variants and variation in the more complex parts of the genome. Hence, large-scale de novo assembly is needed. Here we show that it is possible to construct excellent de novo assemblies from high-coverage sequencing with mate-pair libraries extending up to 20 kilobases. We report de novo assemblies of 150 individuals (50 trios) from the GenomeDenmark project. The quality of these assemblies is similar to those obtained using the more expensive long-read technology. We use the assemblies to identify a rich set of structural variants including many novel insertions and demonstrate how this variant catalogue enables further deciphering of known association mapping signals. We leverage the assemblies to provide 100 completely resolved major histocompatibility complex haplotypes and to resolve major parts of the Y chromosome. Our study provides a regional reference genome that we expect will improve the power of future association mapping studies and hence pave the way for precision medicine initiatives, which now are being launched in many countries including Denmark.
Assuntos
Variação Genética/genética , Genética Populacional/normas , Genoma Humano/genética , Genômica/normas , Análise de Sequência de DNA/normas , Adulto , Alelos , Criança , Cromossomos Humanos Y/genética , Dinamarca , Feminino , Haplótipos/genética , Humanos , Complexo Principal de Histocompatibilidade/genética , Masculino , Idade Materna , Taxa de Mutação , Idade Paterna , Mutação Puntual/genética , Padrões de ReferênciaRESUMO
A cancer of unknown primary (CUP) is a metastatic cancer for which standard diagnostic tests fail to identify the location of the primary tumor. CUPs account for 3-5% of cancer cases. Using molecular data to determine the location of the primary tumor in such cases can help doctors make the right treatment choice and thus improve the clinical outcome. In this paper, we present a new method for predicting the location of the primary tumor using gene expression data: locating cancers of unknown primary (LoCUP). The method models the data as a mixture of normal and tumor cells and thus allows correct classification even in impure samples, where the tumor biopsy is contaminated by a large fraction of normal cells. We find that our method provides a significant increase in classification accuracy (95.8% over 90.8%) on simulated low-purity metastatic samples and shows potential on a small dataset of real metastasis samples with known origin.
Assuntos
Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Neoplasias Primárias Desconhecidas/genética , Neoplasias Primárias Desconhecidas/terapia , Biópsia , HumanosRESUMO
Expression of bacterial type II toxin-antitoxin (TA) systems is regulated at the transcriptional level through direct binding of the antitoxin to pseudo-palindromic sequences on operator DNA. In this context, the toxin functions as a co-repressor by stimulating DNA binding through direct interaction with the antitoxin. Here, we determine crystal structures of the complete 90 kDa heterooctameric VapBC1 complex from Caulobacter crescentus CB15 both in isolation and bound to its cognate DNA operator sequence at 1.6 and 2.7 Å resolution, respectively. DNA binding is associated with a dramatic architectural rearrangement of conserved TA interactions in which C-terminal extended structures of the antitoxin VapB1 swap positions to interlock the complex in the DNA-bound state. We further show that a pseudo-palindromic protein sequence in the antitoxin is responsible for this interaction and required for binding and inactivation of the VapC1 toxin dimer. Sequence analysis of 4127 orthologous VapB sequences reveals that such palindromic protein sequences are widespread and unique to bacterial and archaeal VapB antitoxins suggesting a general principle governing regulation of VapBC TA systems. Finally, a structure of C-terminally truncated VapB1 bound to VapC1 reveals discrete states of the TA interaction that suggest a structural basis for toxin activation in vivo.
Assuntos
Proteínas de Bactérias/química , Toxinas Bacterianas/química , Caulobacter crescentus/genética , DNA Bacteriano/química , Proteínas de Ligação a DNA/química , Glicoproteínas de Membrana/química , Regiões Operadoras Genéticas , Motivos de Aminoácidos , Proteínas de Bactérias/metabolismo , Toxinas Bacterianas/antagonistas & inibidores , Toxinas Bacterianas/metabolismo , DNA Bacteriano/metabolismo , Proteínas de Ligação a DNA/metabolismo , Glicoproteínas de Membrana/metabolismo , Modelos Moleculares , Conformação de Ácido Nucleico , Ligação Proteica , Domínios ProteicosRESUMO
H2 metabolism is proposed to be the most ancient and diverse mechanism of energy-conservation. The metalloenzymes mediating this metabolism, hydrogenases, are encoded by over 60 microbial phyla and are present in all major ecosystems. We developed a classification system and web tool, HydDB, for the structural and functional analysis of these enzymes. We show that hydrogenase function can be predicted by primary sequence alone using an expanded classification scheme (comprising 29 [NiFe], 8 [FeFe], and 1 [Fe] hydrogenase classes) that defines 11 new classes with distinct biological functions. Using this scheme, we built a web tool that rapidly and reliably classifies hydrogenase primary sequences using a combination of k-nearest neighbors' algorithms and CDD referencing. Demonstrating its capacity, the tool reliably predicted hydrogenase content and function in 12 newly-sequenced bacteria, archaea, and eukaryotes. HydDB provides the capacity to browse the amino acid sequences of 3248 annotated hydrogenase catalytic subunits and also contains a detailed repository of physiological, biochemical, and structural information about the 38 hydrogenase classes defined here. The database and classifier are freely and publicly available at http://services.birc.au.dk/hyddb/.
RESUMO
MOTIVATION: By using a class of large modular enzymes known as Non-Ribosomal Peptide Synthetases (NRPS), bacteria and fungi are capable of synthesizing a large variety of secondary metabolites, many of which are bioactive and have potential, pharmaceutical applications as e.g. antibiotics. There is thus an interest in predicting the compound synthesized by an NRPS from its primary structure (amino acid sequence) alone, as this would enable an in silico search of whole genomes for NRPS enzymes capable of synthesizing potentially useful compounds. RESULTS: NRPS synthesis happens in a conveyor belt-like fashion where each individual NRPS module is responsible for incorporating a specific substrate (typically an amino acid) into the final product. Here, we present a new method for predicting substrate specificities of individual NRPS modules based on occurrences of motifs in their primary structures. We compare our classifier with existing methods and discuss possible biological explanations of how the motifs might relate to substrate specificity. AVAILABILITY AND IMPLEMENTATION: SEQL-NRPS is available as a web service implemented in Python with Flask at http://services.birc.au.dk/seql-nrps and source code available at https://bitbucket.org/dansondergaard/seql-nrps/. CONTACT: micknudsen@gmail.com or cstorm@birc.au.dk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bactérias/enzimologia , Fungos/enzimologia , Peptídeo Sintases/química , Análise de Sequência de Proteína/métodos , Motivos de Aminoácidos , Simulação por Computador , Peptídeo Sintases/metabolismo , Especificidade por SubstratoRESUMO
Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e-8 and 1.5e-9 per nucleotide per generation for SNVs and indels, respectively.
Assuntos
Genoma Humano/genética , Algoritmos , Humanos , Taxa de Mutação , Polimorfismo de Nucleotídeo Único/genética , Análise de Sequência de DNA/métodosRESUMO
UNLABELLED: tqDist is a software package for computing the triplet and quartet distances between general rooted or unrooted trees, respectively. The program is based on algorithms with running time [Formula: see text] for the triplet distance calculation and [Formula: see text] for the quartet distance calculation, where n is the number of leaves in the trees and d is the degree of the tree with minimum degree. These are currently the fastest algorithms both in theory and in practice. AVAILABILITY AND IMPLEMENTATION: tqDist can be installed on Windows, Linux and Mac OS X. Doing this will install a set of command-line tools together with a Python module and an R package for scripting in Python or R. The software package is freely available under the GNU LGPL licence at http://birc.au.dk/software/tqDist.
Assuntos
Filogenia , Software , Algoritmos , Classificação/métodosRESUMO
BACKGROUND: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. RESULTS: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. CONCLUSIONS: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
Assuntos
Modelos Logísticos , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Inteligência Artificial , Biologia Computacional/métodos , Bases de Dados de Proteínas , Alinhamento de Sequência , SoftwareRESUMO
BACKGROUND: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library. RESULTS: We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library. CONCLUSIONS: We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.
Assuntos
Cadeias de Markov , Biblioteca de Peptídeos , Software , Algoritmos , Animais , Simulação por Computador , Gorilla gorilla/genética , Humanos , Funções Verossimilhança , Estudos Observacionais como Assunto , Pan troglodytes/genética , Pongo/genética , Probabilidade , Fatores de TempoRESUMO
The triplet distance is a distance measure that compares two rooted trees on the same set of leaves by enumerating all sub-sets of three leaves and counting how often the induced topologies of the tree are equal or different. We present an algorithm that computes the triplet distance between two rooted binary trees in time O (n log2 n). The algorithm is related to an algorithm for computing the quartet distance between two unrooted binary trees in time O (n log n). While the quartet distance algorithm has a very severe overhead in the asymptotic time complexity that makes it impractical compared to O (n2) time algorithms, we show through experiments that the triplet distance algorithm can be implemented to give a competitive wall-time running time.
Assuntos
Algoritmos , Biologia Computacional/métodos , Simulação por Computador , Filogenia , SoftwareRESUMO
Comparative methods for RNA secondary structure prediction use evolutionary information from RNA alignments to increase prediction accuracy. The model is often described in terms of stochastic context-free grammars (SCFGs), which generate a probability distribution over secondary structures. It is, however, unclear how this probability distribution changes as a function of the input alignment. As prediction programs typically only return a single secondary structure, better characterisation of the underlying probability space of RNA secondary structures is of great interest. In this work, we show how to efficiently compute the information entropy of the probability distribution over RNA secondary structures produced for RNA alignments by a phylo-SCFG, and implement it for the PPfold model. We also discuss interpretations and applications of this quantity, including how it can clarify reasons for low prediction reliability scores. PPfold and its source code are available from http://birc.au.dk/software/ppfold/.
Assuntos
Algoritmos , Modelos Teóricos , Conformação de Ácido Nucleico , RNA/química , Sequência de Bases , Biologia Computacional/métodos , Entropia , Probabilidade , SoftwareRESUMO
Developing new medical drugs is expensive. Among the first steps is a screening process, in which molecules in existing chemical libraries are tested for activity against a given target. This requires a lot of resources and manpower. Therefore it has become common to perform a virtual screening, where computers are used for predicting the activity of very large libraries of molecules, to identify the most promising leads for further laboratory experiments. Since computer simulations generally require fewer resources than physical experimentation this can lower the cost of medical and biological research significantly. In this paper we review practically fast algorithms for screening databases of molecules in order to find molecules that are sufficiently similar to a query molecule.
RESUMO
Distance measures between trees are useful for comparing trees in a systematic manner, and several different distance measures have been proposed. The triplet and quartet distances, for rooted and unrooted trees, respectively, are defined as the number of subsets of three or four leaves, respectively, where the topologies of the induced subtrees differ. These distances can trivially be computed by explicitly enumerating all sets of three or four leaves and testing if the topologies are different, but this leads to time complexities at least of the order n3 or n4 just for enumerating the sets. The different topologies can be counte dimplicitly, however, and in this paper, we review a series of algorithmic improvements that have been used during the last decade to develop more efficient algorithms by exploiting two different strategies for this; one based on dynamic programming and another based oncoloring leaves in one tree and updating a hierarchical decomposition of the other.
RESUMO
Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.
RESUMO
UNLABELLED: PPfold is a multi-threaded implementation of the Pfold algorithm for RNA secondary structure prediction. Here we present a new version of PPfold, which extends the evolutionary analysis with a flexible probabilistic model for incorporating auxiliary data, such as data from structure probing experiments. Our tests show that the accuracy of single-sequence secondary structure prediction using experimental data in PPfold 3.0 is comparable to RNAstructure. Furthermore, alignment structure prediction quality is improved even further by the addition of experimental data. PPfold 3.0 therefore has the potential of producing more accurate predictions than it was previously possible. AVAILABILITY AND IMPLEMENTATION: PPfold 3.0 is available as a platform-independent Java application and can be downloaded from http://birc.au.dk/software/ppfold.
Assuntos
RNA/química , Software , Algoritmos , Modelos Estatísticos , Conformação de Ácido Nucleico , FilogeniaRESUMO
UNLABELLED: High-throughput sequencing currently generates a wealth of small RNA (sRNA) data, making data mining a topical issue. Processing of these large data sets is inherently multidimensional as length, abundance, sequence composition, and genomic location all hold clues to sRNA function. Analysis can be challenging because the formulation and testing of complex hypotheses requires combined use of visualization, annotation and abundance profiling. To allow flexible generation and querying of these disparate types of information, we have developed the shortran pipeline for analysis of plant or animal short RNA sequencing data. It comprises nine modules and produces both graphical and MySQL format output. AVAILABILITY: shortran is freely available and can be downloaded from http://users-mb.au.dk/pmgrp/shortran/.
Assuntos
Pequeno RNA não Traduzido/química , Software , Arabidopsis/genética , Mineração de Dados , Genômica , Anotação de Sequência Molecular , Análise de Sequência de RNA/métodosRESUMO
SUMMARY: UniMoG is a software combining five genome rearrangement models: double cut and join (DCJ), restricted DCJ, Hannenhalli and Pevzner (HP), inversion and translocation. It can compute the pairwise genomic distances and a corresponding optimal sorting scenario for an arbitrary number of genomes. All five models can be unified through the DCJ model, thus the implementation is based on DCJ and, where reasonable, uses the most efficient existing algorithms for each distance and sorting problem. Both textual and graphical output is possible for visualizing the operations. AVAILABILITY AND IMPLEMENTATION: The software is available through the Bielefeld University Bioinformatics Web Server at http://bibiserv.techfak.uni-bielefeld.de/dcj with instructions and example data. CONTACT: rhilker@cebitec.uni-bielefeld.de.
Assuntos
Algoritmos , Biologia Computacional/métodos , Genômica/métodos , Software , Internet , Modelos Genéticos , Interface Usuário-ComputadorRESUMO
Five organisms having completely sequenced genomes and belonging to all major branches of green plants (Viridiplantae) were analyzed with respect to their content of P-type ATPases encoding genes. These were the chlorophytes Ostreococcus tauri and Chlamydomonas reinhardtii, and the streptophytes Physcomitrella patens (a non-vascular moss), Selaginella moellendorffii (a primitive vascular plant), and Arabidopsis thaliana (a model flowering plant). Each organism contained sequences for all five subfamilies of P-type ATPases. Whereas Na(+) and H(+) pumps seem to mutually exclude each other in flowering plants and animals, they co-exist in chlorophytes, which show representatives for two kinds of Na(+) pumps (P2C and P2D ATPases) as well as a primitive H(+)-ATPase. Both Na(+) and H(+) pumps also co-exist in the moss P. patens, which has a P2D Na(+)-ATPase. In contrast to the primitive H(+)-ATPases in chlorophytes and P. patens, the H(+)-ATPases from vascular plants all have a large C-terminal regulatory domain as well as a conserved Arg in transmembrane segment 5 that is predicted to function as part of a backflow protection mechanism. Together these features are predicted to enable H(+) pumps in vascular plants to create large electrochemical gradients that can be modulated in response to diverse physiological cues. The complete inventory of P-type ATPases in the major branches of Viridiplantae is an important starting point for elucidating the evolution in plants of these important pumps.
RESUMO
A novel approach to incorporate water molecules in protein-ligand docking is proposed. In this method, the water molecules display the same flexibility during the docking simulation as the ligand. The method solvates the ligand with the maximum number of water molecules, and these are then retained or displaced depending on energy contributions during the docking simulation. Instead of being a static part of the receptor, each water molecule is a flexible on/off part of the ligand and is treated with the same flexibility as the ligand itself. To favor exclusion of the water molecules, a constant entropy penalty is added for each included water molecule. The method was evaluated using 12 structurally diverse protein-ligand complexes from the PDB, where several water molecules bridge the ligand and the protein. A considerable improvement in successful docking simulations was found when including flexible water molecules solvating hydrogen bonding groups of the ligand. The method has been implemented in the docking program Molegro Virtual Docker (MVD).