Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
Nature ; 548(7665): 87-91, 2017 08 03.
Artículo en Inglés | MEDLINE | ID: mdl-28746312

RESUMEN

Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome or by performing local assembly. However, these approaches are biased against discovery of structural variants and variation in the more complex parts of the genome. Hence, large-scale de novo assembly is needed. Here we show that it is possible to construct excellent de novo assemblies from high-coverage sequencing with mate-pair libraries extending up to 20 kilobases. We report de novo assemblies of 150 individuals (50 trios) from the GenomeDenmark project. The quality of these assemblies is similar to those obtained using the more expensive long-read technology. We use the assemblies to identify a rich set of structural variants including many novel insertions and demonstrate how this variant catalogue enables further deciphering of known association mapping signals. We leverage the assemblies to provide 100 completely resolved major histocompatibility complex haplotypes and to resolve major parts of the Y chromosome. Our study provides a regional reference genome that we expect will improve the power of future association mapping studies and hence pave the way for precision medicine initiatives, which now are being launched in many countries including Denmark.


Asunto(s)
Variación Genética/genética , Genética de Población/normas , Genoma Humano/genética , Genómica/normas , Análisis de Secuencia de ADN/normas , Adulto , Alelos , Niño , Cromosomas Humanos Y/genética , Dinamarca , Femenino , Haplotipos/genética , Humanos , Complejo Mayor de Histocompatibilidad/genética , Masculino , Edad Materna , Tasa de Mutación , Edad Paterna , Mutación Puntual/genética , Estándares de Referencia
2.
Nucleic Acids Res ; 45(5): 2875-2886, 2017 03 17.
Artículo en Inglés | MEDLINE | ID: mdl-27998932

RESUMEN

Expression of bacterial type II toxin-antitoxin (TA) systems is regulated at the transcriptional level through direct binding of the antitoxin to pseudo-palindromic sequences on operator DNA. In this context, the toxin functions as a co-repressor by stimulating DNA binding through direct interaction with the antitoxin. Here, we determine crystal structures of the complete 90 kDa heterooctameric VapBC1 complex from Caulobacter crescentus CB15 both in isolation and bound to its cognate DNA operator sequence at 1.6 and 2.7 Å resolution, respectively. DNA binding is associated with a dramatic architectural rearrangement of conserved TA interactions in which C-terminal extended structures of the antitoxin VapB1 swap positions to interlock the complex in the DNA-bound state. We further show that a pseudo-palindromic protein sequence in the antitoxin is responsible for this interaction and required for binding and inactivation of the VapC1 toxin dimer. Sequence analysis of 4127 orthologous VapB sequences reveals that such palindromic protein sequences are widespread and unique to bacterial and archaeal VapB antitoxins suggesting a general principle governing regulation of VapBC TA systems. Finally, a structure of C-terminally truncated VapB1 bound to VapC1 reveals discrete states of the TA interaction that suggest a structural basis for toxin activation in vivo.


Asunto(s)
Proteínas Bacterianas/química , Toxinas Bacterianas/química , Caulobacter crescentus/genética , ADN Bacteriano/química , Proteínas de Unión al ADN/química , Glicoproteínas de Membrana/química , Regiones Operadoras Genéticas , Secuencias de Aminoácidos , Proteínas Bacterianas/metabolismo , Toxinas Bacterianas/antagonistas & inhibidores , Toxinas Bacterianas/metabolismo , ADN Bacteriano/metabolismo , Proteínas de Unión al ADN/metabolismo , Glicoproteínas de Membrana/metabolismo , Modelos Moleculares , Conformación de Ácido Nucleico , Unión Proteica , Dominios Proteicos
3.
Bioinformatics ; 32(3): 325-9, 2016 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-26471456

RESUMEN

MOTIVATION: By using a class of large modular enzymes known as Non-Ribosomal Peptide Synthetases (NRPS), bacteria and fungi are capable of synthesizing a large variety of secondary metabolites, many of which are bioactive and have potential, pharmaceutical applications as e.g. antibiotics. There is thus an interest in predicting the compound synthesized by an NRPS from its primary structure (amino acid sequence) alone, as this would enable an in silico search of whole genomes for NRPS enzymes capable of synthesizing potentially useful compounds. RESULTS: NRPS synthesis happens in a conveyor belt-like fashion where each individual NRPS module is responsible for incorporating a specific substrate (typically an amino acid) into the final product. Here, we present a new method for predicting substrate specificities of individual NRPS modules based on occurrences of motifs in their primary structures. We compare our classifier with existing methods and discuss possible biological explanations of how the motifs might relate to substrate specificity. AVAILABILITY AND IMPLEMENTATION: SEQL-NRPS is available as a web service implemented in Python with Flask at http://services.birc.au.dk/seql-nrps and source code available at https://bitbucket.org/dansondergaard/seql-nrps/. CONTACT: micknudsen@gmail.com or cstorm@birc.au.dk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Bacterias/enzimología , Hongos/enzimología , Péptido Sintasas/química , Análisis de Secuencia de Proteína/métodos , Secuencias de Aminoácidos , Simulación por Computador , Péptido Sintasas/metabolismo , Especificidad por Sustrato
4.
Bioinformatics ; 30(14): 2079-80, 2014 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-24651968

RESUMEN

UNLABELLED: tqDist is a software package for computing the triplet and quartet distances between general rooted or unrooted trees, respectively. The program is based on algorithms with running time [Formula: see text] for the triplet distance calculation and [Formula: see text] for the quartet distance calculation, where n is the number of leaves in the trees and d is the degree of the tree with minimum degree. These are currently the fastest algorithms both in theory and in practice. AVAILABILITY AND IMPLEMENTATION: tqDist can be installed on Windows, Linux and Mac OS X. Doing this will install a set of command-line tools together with a Python module and an R package for scripting in Python or R. The software package is freely available under the GNU LGPL licence at http://birc.au.dk/software/tqDist.


Asunto(s)
Filogenia , Programas Informáticos , Algoritmos , Clasificación/métodos
5.
BMC Bioinformatics ; 14: 339, 2013 Nov 22.
Artículo en Inglés | MEDLINE | ID: mdl-24266924

RESUMEN

BACKGROUND: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library. RESULTS: We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library. CONCLUSIONS: We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.


Asunto(s)
Cadenas de Markov , Biblioteca de Péptidos , Programas Informáticos , Algoritmos , Animales , Simulación por Computador , Gorilla gorilla/genética , Humanos , Funciones de Verosimilitud , Estudios Observacionales como Asunto , Pan troglodytes/genética , Pongo/genética , Probabilidad , Factores de Tiempo
6.
BMC Bioinformatics ; 14 Suppl 2: S18, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23368759

RESUMEN

The triplet distance is a distance measure that compares two rooted trees on the same set of leaves by enumerating all sub-sets of three leaves and counting how often the induced topologies of the tree are equal or different. We present an algorithm that computes the triplet distance between two rooted binary trees in time O (n log2 n). The algorithm is related to an algorithm for computing the quartet distance between two unrooted binary trees in time O (n log n). While the quartet distance algorithm has a very severe overhead in the asymptotic time complexity that makes it impractical compared to O (n2) time algorithms, we show through experiments that the triplet distance algorithm can be implemented to give a competitive wall-time running time.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Simulación por Computador , Filogenia , Programas Informáticos
7.
BMC Bioinformatics ; 14 Suppl 2: S22, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23368905

RESUMEN

Comparative methods for RNA secondary structure prediction use evolutionary information from RNA alignments to increase prediction accuracy. The model is often described in terms of stochastic context-free grammars (SCFGs), which generate a probability distribution over secondary structures. It is, however, unclear how this probability distribution changes as a function of the input alignment. As prediction programs typically only return a single secondary structure, better characterisation of the underlying probability space of RNA secondary structures is of great interest. In this work, we show how to efficiently compute the information entropy of the probability distribution over RNA secondary structures produced for RNA alignments by a phylo-SCFG, and implement it for the PPfold model. We also discuss interpretations and applications of this quantity, including how it can clarify reasons for low prediction reliability scores. PPfold and its source code are available from http://birc.au.dk/software/ppfold/.


Asunto(s)
Algoritmos , Modelos Teóricos , Conformación de Ácido Nucleico , ARN/química , Secuencia de Bases , Biología Computacional/métodos , Entropía , Probabilidad , Programas Informáticos
8.
Bioinformatics ; 28(20): 2691-2, 2012 Oct 15.
Artículo en Inglés | MEDLINE | ID: mdl-22877864

RESUMEN

UNLABELLED: PPfold is a multi-threaded implementation of the Pfold algorithm for RNA secondary structure prediction. Here we present a new version of PPfold, which extends the evolutionary analysis with a flexible probabilistic model for incorporating auxiliary data, such as data from structure probing experiments. Our tests show that the accuracy of single-sequence secondary structure prediction using experimental data in PPfold 3.0 is comparable to RNAstructure. Furthermore, alignment structure prediction quality is improved even further by the addition of experimental data. PPfold 3.0 therefore has the potential of producing more accurate predictions than it was previously possible. AVAILABILITY AND IMPLEMENTATION: PPfold 3.0 is available as a platform-independent Java application and can be downloaded from http://birc.au.dk/software/ppfold.


Asunto(s)
ARN/química , Programas Informáticos , Algoritmos , Modelos Estadísticos , Conformación de Ácido Nucleico , Filogenia
9.
Bioinformatics ; 28(19): 2509-11, 2012 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-22815356

RESUMEN

SUMMARY: UniMoG is a software combining five genome rearrangement models: double cut and join (DCJ), restricted DCJ, Hannenhalli and Pevzner (HP), inversion and translocation. It can compute the pairwise genomic distances and a corresponding optimal sorting scenario for an arbitrary number of genomes. All five models can be unified through the DCJ model, thus the implementation is based on DCJ and, where reasonable, uses the most efficient existing algorithms for each distance and sorting problem. Both textual and graphical output is possible for visualizing the operations. AVAILABILITY AND IMPLEMENTATION: The software is available through the Bielefeld University Bioinformatics Web Server at http://bibiserv.techfak.uni-bielefeld.de/dcj with instructions and example data. CONTACT: rhilker@cebitec.uni-bielefeld.de.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genómica/métodos , Programas Informáticos , Internet , Modelos Genéticos , Interfaz Usuario-Computador
10.
Bioinformatics ; 28(20): 2698-700, 2012 Oct 15.
Artículo en Inglés | MEDLINE | ID: mdl-22914220

RESUMEN

UNLABELLED: High-throughput sequencing currently generates a wealth of small RNA (sRNA) data, making data mining a topical issue. Processing of these large data sets is inherently multidimensional as length, abundance, sequence composition, and genomic location all hold clues to sRNA function. Analysis can be challenging because the formulation and testing of complex hypotheses requires combined use of visualization, annotation and abundance profiling. To allow flexible generation and querying of these disparate types of information, we have developed the shortran pipeline for analysis of plant or animal short RNA sequencing data. It comprises nine modules and produces both graphical and MySQL format output. AVAILABILITY: shortran is freely available and can be downloaded from http://users-mb.au.dk/pmgrp/shortran/.


Asunto(s)
ARN Pequeño no Traducido/química , Programas Informáticos , Arabidopsis/genética , Minería de Datos , Genómica , Anotación de Secuencia Molecular , Análisis de Secuencia de ARN/métodos
11.
J Chem Inf Model ; 51(3): 597-600, 2011 Mar 28.
Artículo en Inglés | MEDLINE | ID: mdl-21332133

RESUMEN

The ever growing size of chemical databases calls for the development of novel methods for representing and comparing molecules. One such method called LINGO is based on fragmenting the SMILES string representation of molecules. Comparison of molecules can then be performed by calculating the Tanimoto coefficient, which is called LINGOsim when used on LINGO multisets. This paper introduces a verbose representation for storing LINGO multisets, which makes it possible to transform them into sparse fingerprints such that fingerprint data structures and algorithms can be used to accelerate queries. The previous best method for rapidly calculating the LINGOsim similarity matrix required specialized hardware to yield a significant speedup over existing methods. By representing LINGO multisets in the verbose representation and using inverted indices, it is possible to calculate LINGOsim similarity matrices roughly 2.6 times faster than existing methods without relying on specialized hardware.


Asunto(s)
Biología Computacional , Bases de Datos de Compuestos Químicos , Algoritmos
12.
J Chem Inf Model ; 51(4): 909-17, 2011 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-21452852

RESUMEN

A novel approach to incorporate water molecules in protein-ligand docking is proposed. In this method, the water molecules display the same flexibility during the docking simulation as the ligand. The method solvates the ligand with the maximum number of water molecules, and these are then retained or displaced depending on energy contributions during the docking simulation. Instead of being a static part of the receptor, each water molecule is a flexible on/off part of the ligand and is treated with the same flexibility as the ligand itself. To favor exclusion of the water molecules, a constant entropy penalty is added for each included water molecule. The method was evaluated using 12 structurally diverse protein-ligand complexes from the PDB, where several water molecules bridge the ligand and the protein. A considerable improvement in successful docking simulations was found when including flexible water molecules solvating hydrogen bonding groups of the ligand. The method has been implemented in the docking program Molegro Virtual Docker (MVD).


Asunto(s)
Sitios de Unión , Simulación por Computador , Ligandos , Proteínas/química , Agua/química , Algoritmos , Proteínas Portadoras , Enlace de Hidrógeno , Modelos Moleculares , Unión Proteica , Conformación Proteica , Termodinámica
13.
BMC Bioinformatics ; 10 Suppl 1: S74, 2009 Jan 30.
Artículo en Inglés | MEDLINE | ID: mdl-19208179

RESUMEN

BACKGROUND: Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The Haplotype Pattern Mining (HPM) method is a machine learning approach to do exactly this. RESULTS: We present a new, faster algorithm for the HPM method. The new approach use patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome-wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased. CONCLUSION: The new algorithm speeds up the HPM method and we show that it is feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genoma Humano , Haplotipos/genética , Bases de Datos Genéticas , Marcadores Genéticos , Predisposición Genética a la Enfermedad , Variación Genética , Humanos , Polimorfismo de Nucleótido Simple
14.
Gigascience ; 7(1): 1-7, 2018 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-29267855

RESUMEN

Background: DNA methylation plays a key role in the regulation of gene expression and carcinogenesis. Bisulfite sequencing studies mainly focus on calling single nucleotide polymorphism, different methylation region, and find allele-specific DNA methylation. Until now, only a few software tools have focused on virus integration using bisulfite sequencing data. Findings: We have developed a new and easy-to-use software tool, named BS-virus-finder (BSVF, RRID:SCR_015727), to detect viral integration breakpoints in whole human genomes. The tool is hosted at https://github.com/BGI-SZ/BSVF. Conclusions: BS-virus-finder demonstrates high sensitivity and specificity. It is useful in epigenetic studies and to reveal the relationship between viral integration and DNA methylation. BS-virus-finder is the first software tool to detect virus integration loci by using bisulfite sequencing data.


Asunto(s)
ADN Viral/genética , Genoma Humano , Virus de la Hepatitis B/genética , Hepatocitos/virología , Programas Informáticos , Integración Viral , Emparejamiento Base , Secuencia de Bases , Línea Celular Tumoral , Metilación de ADN , Epigénesis Genética , Hepatocitos/metabolismo , Hepatocitos/patología , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Sensibilidad y Especificidad , Sulfitos/química , Secuenciación Completa del Genoma
15.
Bioinformatics ; 22(18): 2317-8, 2006 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-16632491

RESUMEN

UNLABELLED: GeneRecon is a tool for fine-scale association mapping using a coalescence model. GeneRecon takes as input case-control data from phased or unphased SNP and microsatellite genotypes. The posterior distribution of disease locus position is obtained by Metropolis-Hastings sampling in the state space of genealogies. Input format, search strategy and the sampled statistics can be configured through the Guile Scheme programming language embedded in GeneRecon, making GeneRecon highly configurable. AVAILABILITY: The source code for GeneRecon, written in C++ and Scheme, is available under the GNU General Public License (GPL) at http://www.birc.au.dk/~mailund/GeneRecon CONTACT: mailund@birc.au.dk.


Asunto(s)
Mapeo Cromosómico/métodos , Análisis Mutacional de ADN/métodos , Predisposición Genética a la Enfermedad/genética , Genética de Población , Desequilibrio de Ligamiento/genética , Modelos Genéticos , Programas Informáticos , Algoritmos , Animales , Ligamiento Genético/genética , Humanos , Modelos Estadísticos , Análisis de Secuencia de ADN/métodos
16.
J Integr Bioinform ; 14(2)2017 Jul 07.
Artículo en Inglés | MEDLINE | ID: mdl-28686574

RESUMEN

A cancer of unknown primary (CUP) is a metastatic cancer for which standard diagnostic tests fail to identify the location of the primary tumor. CUPs account for 3-5% of cancer cases. Using molecular data to determine the location of the primary tumor in such cases can help doctors make the right treatment choice and thus improve the clinical outcome. In this paper, we present a new method for predicting the location of the primary tumor using gene expression data: locating cancers of unknown primary (LoCUP). The method models the data as a mixture of normal and tumor cells and thus allows correct classification even in impure samples, where the tumor biopsy is contaminated by a large fraction of normal cells. We find that our method provides a significant increase in classification accuracy (95.8% over 90.8%) on simulated low-purity metastatic samples and shows potential on a small dataset of real metastasis samples with known origin.


Asunto(s)
Perfilación de la Expresión Génica , Regulación Neoplásica de la Expresión Génica , Neoplasias Primarias Desconocidas/genética , Neoplasias Primarias Desconocidas/terapia , Biopsia , Humanos
17.
BMC Bioinformatics ; 7: 29, 2006 Jan 19.
Artículo en Inglés | MEDLINE | ID: mdl-16423304

RESUMEN

BACKGROUND: The neighbor-joining method by Saitou and Nei is a widely used method for constructing phylogenetic trees. The formulation of the method gives rise to a canonical Theta(n3) algorithm upon which all existing implementations are based. RESULTS: In this paper we present techniques for speeding up the canonical neighbor-joining method. Our algorithms construct the same phylogenetic trees as the canonical neighbor-joining method. The best-case running time of our algorithms are O(n2) but the worst-case remains O(n3). We empirically evaluate the performance of our algoritms on distance matrices obtained from the Pfam collection of alignments. The experiments indicate that the running time of our algorithms evolve as Theta(n2) on the examined instance collection. We also compare the running time with that of the QuickTree tool, a widely used efficient implementation of the canonical neighbor-joining method. CONCLUSION: The experiments show that our algorithms also yield a significant speed-up, already for medium sized instances.


Asunto(s)
Biología Computacional/métodos , Algoritmos , Animales , Análisis por Conglomerados , Simulación por Computador , Evolución Molecular , Humanos , Funciones de Verosimilitud , Modelos Estadísticos , Filogenia , Alineación de Secuencia , Análisis de Secuencia de Proteína , Programas Informáticos
18.
Sci Rep ; 6: 34212, 2016 Sep 27.
Artículo en Inglés | MEDLINE | ID: mdl-27670643

RESUMEN

H2 metabolism is proposed to be the most ancient and diverse mechanism of energy-conservation. The metalloenzymes mediating this metabolism, hydrogenases, are encoded by over 60 microbial phyla and are present in all major ecosystems. We developed a classification system and web tool, HydDB, for the structural and functional analysis of these enzymes. We show that hydrogenase function can be predicted by primary sequence alone using an expanded classification scheme (comprising 29 [NiFe], 8 [FeFe], and 1 [Fe] hydrogenase classes) that defines 11 new classes with distinct biological functions. Using this scheme, we built a web tool that rapidly and reliably classifies hydrogenase primary sequences using a combination of k-nearest neighbors' algorithms and CDD referencing. Demonstrating its capacity, the tool reliably predicted hydrogenase content and function in 12 newly-sequenced bacteria, archaea, and eukaryotes. HydDB provides the capacity to browse the amino acid sequences of 3248 annotated hydrogenase catalytic subunits and also contains a detailed repository of physiological, biochemical, and structural information about the 38 hydrogenase classes defined here. The database and classifier are freely and publicly available at http://services.birc.au.dk/hyddb/.

19.
BMC Bioinformatics ; 6: 252, 2005 Oct 14.
Artículo en Inglés | MEDLINE | ID: mdl-16225674

RESUMEN

BACKGROUND: Coalescent simulations are playing a large role in interpreting large scale intra-specific sequence or polymorphism surveys and for planning and evaluating association studies. Coalescent simulations of data sets under different models can be compared to the actual data to test the importance of different evolutionary factors and thus get insight into these. RESULTS: We have created the CoaSim application as a flexible environment for Monte Carlo simulation of various types of genetic data under equilibrium and non-equilibrium coalescent processes for a variety of applications. Interaction with the tool is through the Guile version of the Scheme scripting language. Scheme scripts for many standard and advanced applications are provided and these can easily be modified by the user for a much wider range of applications. A graphical user interface with less functionality and flexibility is also included. It is primarily intended as an exploratory and educational tool CONCLUSION: CoaSim is a powerful tool because of its flexibility and ease of use. This is illustrated through very varied uses of the application, e.g. evaluation of association mapping methods, parametric bootstrapping, and design and choice of markers for specific questions.


Asunto(s)
Simulación por Computador , Marcadores Genéticos , Predisposición Genética a la Enfermedad/genética , Programas Informáticos , Estudios de Casos y Controles , Interpretación Estadística de Datos , Susceptibilidad a Enfermedades , Humanos , Modelos Genéticos , Método de Montecarlo , Polimorfismo Genético , Polimorfismo de Nucleótido Simple
20.
Nat Commun ; 6: 5969, 2015 Jan 19.
Artículo en Inglés | MEDLINE | ID: mdl-25597990

RESUMEN

Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e-8 and 1.5e-9 per nucleotide per generation for SNVs and indels, respectively.


Asunto(s)
Genoma Humano/genética , Algoritmos , Humanos , Tasa de Mutación , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA