Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
bioRxiv ; 2024 May 10.
Artículo en Inglés | MEDLINE | ID: mdl-38766004

RESUMEN

Haplotype phasing, the process of determining which genetic variants are physically located on the same chromosome, is crucial for various genetic analyses. In this study, we first benchmark SHAPEIT and Beagle, two state-of-the-art phasing methods, on two large datasets: > 8 million diverse, research-consented 23andMe, Inc. customers and the UK Biobank (UKB). We find that both perform exceptionally well. Beagle's median switch error rate (SER) (after excluding single SNP switches) in white British trios from UKB is 0.026% compared to 0.00% for European ancestry 23andMe research participants; 55.6% of European ancestry 23andMe research participants have zero non-single SNP switches, compared to 42.4% of white British trios. South Asian ancestry 23andMe research participants have the highest median SER amongst the 23andMe populations, but it is still remarkably low at 0.46%. We also investigate the relationship between identity-by-descent (IBD) and SER, finding that switch errors tend to occur in regions of little or no IBD segment coverage. SHAPEIT and Beagle excel at 'intra-chromosomal' phasing, but lack the ability to phase across chromosomes, motivating us to develop an inter-chromosomal phasing method, called HAPTIC ( HAP lotype TI ling and C lustering), that assigns paternal and maternal variants discretely genome-wide. Our approach uses identity-by-descent (IBD) segments to phase blocks of variants on different chromosomes. HAPTIC represents the segments a focal individual shares with their relatives as nodes in a signed graph and performs bipartite clustering on the signed graph using spectral clustering. We test HAPTIC on 1022 UKB trios, yielding a median phase error of 0.08% in regions covered by IBD segments (33.5% of sites). We also ran HAPTIC in the 23andMe database and found a median phase error rate (the rate of mismatching alleles between the inferred and true phase) of 0.92% in Europeans (93.8% of sites) and 0.09% in admixed Africans (92.7% of sites). HAPTIC's precision depends heavily on data from relatives, so will increase as datasets grow larger and more diverse. HAPTIC enables analyses that require the parent-of-origin of variants, such as association studies and ancestry inference of untyped parents.

2.
Nat Commun ; 15(1): 3238, 2024 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-38622117

RESUMEN

Great efforts are being made to develop advanced polygenic risk scores (PRS) to improve the prediction of complex traits and diseases. However, most existing PRS are primarily trained on European ancestry populations, limiting their transferability to non-European populations. In this article, we propose a novel method for generating multi-ancestry Polygenic Risk scOres based on enSemble of PEnalized Regression models (PROSPER). PROSPER integrates genome-wide association studies (GWAS) summary statistics from diverse populations to develop ancestry-specific PRS with improved predictive power for minority populations. The method uses a combination of L 1 (lasso) and L 2 (ridge) penalty functions, a parsimonious specification of the penalty parameters across populations, and an ensemble step to combine PRS generated across different penalty parameters. We evaluate the performance of PROSPER and other existing methods on large-scale simulated and real datasets, including those from 23andMe Inc., the Global Lipids Genetics Consortium, and All of Us. Results show that PROSPER can substantially improve multi-ancestry polygenic prediction compared to alternative methods across a wide variety of genetic architectures. In real data analyses, for example, PROSPER increased out-of-sample prediction R2 for continuous traits by an average of 70% compared to a state-of-the-art Bayesian method (PRS-CSx) in the African ancestry population. Further, PROSPER is computationally highly scalable for the analysis of large SNP contents and many diverse populations.


Asunto(s)
Estudio de Asociación del Genoma Completo , Salud Poblacional , Humanos , Teorema de Bayes , Herencia Multifactorial/genética , Población Negra/genética , Puntuación de Riesgo Genético , Factores de Riesgo
3.
Cell Genom ; 4(4): 100539, 2024 Apr 10.
Artículo en Inglés | MEDLINE | ID: mdl-38604127

RESUMEN

Polygenic risk scores (PRSs) are now showing promising predictive performance on a wide variety of complex traits and diseases, but there exists a substantial performance gap across populations. We propose MUSSEL, a method for ancestry-specific polygenic prediction that borrows information in summary statistics from genome-wide association studies (GWASs) across multiple ancestry groups via Bayesian hierarchical modeling and ensemble learning. In our simulation studies and data analyses across four distinct studies, totaling 5.7 million participants with a substantial ancestral diversity, MUSSEL shows promising performance compared to alternatives. For example, MUSSEL has an average gain in prediction R2 across 11 continuous traits of 40.2% and 49.3% compared to PRS-CSx and CT-SLEB, respectively, in the African ancestry population. The best-performing method, however, varies by GWAS sample size, target ancestry, trait architecture, and linkage disequilibrium reference samples; thus, ultimately a combination of methods may be needed to generate the most robust PRSs across diverse populations.


Asunto(s)
Bivalvos , Herencia Multifactorial , Humanos , Animales , Herencia Multifactorial/genética , Estudio de Asociación del Genoma Completo/métodos , Teorema de Bayes , Fenotipo , Puntuación de Riesgo Genético
4.
bioRxiv ; 2024 Apr 10.
Artículo en Inglés | MEDLINE | ID: mdl-36993331

RESUMEN

Great efforts are being made to develop advanced polygenic risk scores (PRS) to improve the prediction of complex traits and diseases. However, most existing PRS are primarily trained on European ancestry populations, limiting their transferability to non-European populations. In this article, we propose a novel method for generating multi-ancestry Polygenic Risk scOres based on enSemble of PEnalized Regression models (PROSPER). PROSPER integrates genome-wide association studies (GWAS) summary statistics from diverse populations to develop ancestry-specific PRS with improved predictive power for minority populations. The method uses a combination of ℒ1 (lasso) and ℒ2 (ridge) penalty functions, a parsimonious specification of the penalty parameters across populations, and an ensemble step to combine PRS generated across different penalty parameters. We evaluate the performance of PROSPER and other existing methods on large-scale simulated and real datasets, including those from 23andMe Inc., the Global Lipids Genetics Consortium, and All of Us. Results show that PROSPER can substantially improve multi-ancestry polygenic prediction compared to alternative methods across a wide variety of genetic architectures. In real data analyses, for example, PROSPER increased out-of-sample prediction R2 for continuous traits by an average of 70% compared to a state-of-the-art Bayesian method (PRS-CSx) in the African ancestry population. Further, PROSPER is computationally highly scalable for the analysis of large SNP contents and many diverse populations.

5.
Cell ; 186(21): 4514-4527.e14, 2023 10 12.
Artículo en Inglés | MEDLINE | ID: mdl-37757828

RESUMEN

Autozygosity is associated with rare Mendelian disorders and clinically relevant quantitative traits. We investigated associations between the fraction of the genome in runs of homozygosity (FROH) and common diseases in Genes & Health (n = 23,978 British South Asians), UK Biobank (n = 397,184), and 23andMe. We show that restricting analysis to offspring of first cousins is an effective way of reducing confounding due to social/environmental correlates of FROH. Within this group in G&H+UK Biobank, we found experiment-wide significant associations between FROH and twelve common diseases. We replicated associations with type 2 diabetes (T2D) and post-traumatic stress disorder via within-sibling analysis in 23andMe (median n = 480,282). We estimated that autozygosity due to consanguinity accounts for 5%-18% of T2D cases among British Pakistanis. Our work highlights the possibility of widespread non-additive genetic effects on common diseases and has important implications for global populations with high rates of consanguinity.


Asunto(s)
Consanguinidad , Diabetes Mellitus Tipo 2 , Humanos , Diabetes Mellitus Tipo 2/genética , Homocigoto , Fenotipo , Polimorfismo de Nucleótido Simple , Bancos de Muestras Biológicas , Genoma Humano , Predisposición Genética a la Enfermedad , Reino Unido
6.
Nat Genet ; 55(10): 1757-1768, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-37749244

RESUMEN

Polygenic risk scores (PRSs) increasingly predict complex traits; however, suboptimal performance in non-European populations raise concerns about clinical applications and health inequities. We developed CT-SLEB, a powerful and scalable method to calculate PRSs, using ancestry-specific genome-wide association study summary statistics from multiancestry training samples, integrating clumping and thresholding, empirical Bayes and superlearning. We evaluated CT-SLEB and nine alternative methods with large-scale simulated genome-wide association studies (~19 million common variants) and datasets from 23andMe, Inc., the Global Lipids Genetics Consortium, All of Us and UK Biobank, involving 5.1 million individuals of diverse ancestry, with 1.18 million individuals from four non-European populations across 13 complex traits. Results demonstrated that CT-SLEB significantly improves PRS performance in non-European populations compared with simple alternatives, with comparable or superior performance to a recent, computationally intensive method. Moreover, our simulation studies offered insights into sample size requirements and SNP density effects on multiancestry risk prediction.


Asunto(s)
Herencia Multifactorial , Salud Poblacional , Humanos , Herencia Multifactorial/genética , Estudio de Asociación del Genoma Completo , Teorema de Bayes , Polimorfismo de Nucleótido Simple/genética , Factores de Riesgo , Predisposición Genética a la Enfermedad
7.
bioRxiv ; 2023 Sep 21.
Artículo en Inglés | MEDLINE | ID: mdl-37090648

RESUMEN

Polygenic risk scores (PRS) are now showing promising predictive performance on a wide variety of complex traits and diseases, but there exists a substantial performance gap across different populations. We propose MUSSEL, a method for ancestry-specific polygenic prediction that borrows information in the summary statistics from genome-wide association studies (GWAS) across multiple ancestry groups. MUSSEL conducts Bayesian hierarchical modeling under a MUltivariate Spike-and-Slab model for effect-size distribution and incorporates an Ensemble Learning step using super learner to combine information across different tuning parameter settings and ancestry groups. In our simulation studies and data analyses of 16 traits across four distinct studies, totaling 5.7 million participants with a substantial ancestral diversity, MUSSEL shows promising performance compared to alternatives. The method, for example, has an average gain in prediction R2 across 11 continuous traits of 40.2% and 49.3% compared to PRS-CSx and CT-SLEB, respectively, in the African Ancestry population. The best-performing method, however, varies by GWAS sample size, target ancestry, underlying trait architecture, and the choice of reference samples for LD estimation, and thus ultimately, a combination of methods may be needed to generate the most robust PRS across diverse populations.

8.
Commun Biol ; 4(1): 1269, 2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34741098

RESUMEN

There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.


Asunto(s)
Genoma Humano , Genotipo , Adulto , Negro o Afroamericano , Anciano , Anciano de 80 o más Años , Humanos , Persona de Mediana Edad , Estados Unidos , Secuenciación Completa del Genoma , Adulto Joven
9.
Sci Adv ; 7(12)2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33731350

RESUMEN

The role of the nuclear genome in maintaining the stability of the mitochondrial genome (mtDNA) is incompletely known. mtDNA sequence variants can exist in a state of heteroplasmy, which denotes the coexistence of organellar genomes with different sequences. Heteroplasmic variants that impair mitochondrial capacity cause disease, and the state of heteroplasmy itself is deleterious. However, mitochondrial heteroplasmy may provide an intermediate state in the emergence of novel mitochondrial haplogroups. We used genome-wide genotyping data from 982,072 European ancestry individuals to evaluate variation in mitochondrial heteroplasmy and to identify the regions of the nuclear genome that affect it. Age, sex, and mitochondrial haplogroup were associated with the extent of heteroplasmy. GWAS identified 20 loci for heteroplasmy that exceeded genome-wide significance. This included a region overlapping mitochondrial transcription factor A (TFAM), which has multiple roles in mtDNA packaging, replication, and transcription. These results show that mitochondrial heteroplasmy has a heritable nuclear component.


Asunto(s)
Genoma Mitocondrial , Enfermedades Mitocondriales , Núcleo Celular/genética , ADN Mitocondrial/genética , Estudio de Asociación del Genoma Completo , Heteroplasmia , Humanos , Enfermedades Mitocondriales/genética
10.
PLoS Genet ; 16(3): e1008552, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-32150539

RESUMEN

The genetic diversity of humans, like many species, has been shaped by a complex pattern of population separations followed by isolation and subsequent admixture. This pattern, reaching at least as far back as the appearance of our species in the paleontological record, has left its traces in our genomes. Reconstructing a population's history from these traces is a challenging problem. Here we present a novel approach based on the Multiple Sequentially Markovian Coalescent (MSMC) to analyze the separation history between populations. Our approach, called MSMC-IM, uses an improved implementation of the MSMC (MSMC2) to estimate coalescence rates within and across pairs of populations, and then fits a continuous Isolation-Migration model to these rates to obtain a time-dependent estimate of gene flow. We show, using simulations, that our method can identify complex demographic scenarios involving post-split admixture or archaic introgression. We apply MSMC-IM to whole genome sequences from 15 worldwide populations, tracking the process of human genetic diversification. We detect traces of extremely deep ancestry between some African populations, with around 1% of ancestry dating to divergences older than a million years ago.


Asunto(s)
Flujo Génico/genética , Genoma Humano/genética , Población Negra/genética , Variación Genética/genética , Haplotipos/genética , Migración Humana , Humanos , Modelos Genéticos , Densidad de Población , Secuenciación Completa del Genoma/métodos
11.
Nature ; 562(7726): 203-209, 2018 10.
Artículo en Inglés | MEDLINE | ID: mdl-30305743

RESUMEN

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.


Asunto(s)
Bases de Datos Factuales , Genómica , Fenotipo , Adulto , Anciano , Alelos , Biomarcadores/sangre , Biomarcadores/orina , Estatura/genética , Encéfalo/diagnóstico por imagen , Estudios de Cohortes , Bases de Datos Genéticas , Registros Electrónicos de Salud , Familia , Femenino , Estudio de Asociación del Genoma Completo , Haplotipos/genética , Humanos , Estilo de Vida , Complejo Mayor de Histocompatibilidad/genética , Masculino , Persona de Mediana Edad , Control de Calidad , Grupos Raciales/genética , Reino Unido
12.
Bioinformatics ; 33(1): 142-144, 2017 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-27634946

RESUMEN

MOTIVATION: Ancestry and Kinship Toolkit (AKT) is a statistical genetics tool for analysing large cohorts of whole-genome sequenced samples. It can rapidly detect related samples, characterize sample ancestry, calculate correlation between variants, check Mendel consistency and perform data clustering. AKT brings together the functionality of many state-of-the-art methods, with a focus on speed and a unified interface. We believe it will be an invaluable tool for the curation of large WGS datasets. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://illumina.github.io/akt CONTACTS: joconnell@illumina.com or rudy.d.arthur@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma Humano , Linaje , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Análisis por Conglomerados , Familia , Femenino , Humanos , Masculino , Filogenia
13.
Cell ; 167(5): 1415-1429.e19, 2016 11 17.
Artículo en Inglés | MEDLINE | ID: mdl-27863252

RESUMEN

Many common variants have been associated with hematological traits, but identification of causal genes and pathways has proven challenging. We performed a genome-wide association analysis in the UK Biobank and INTERVAL studies, testing 29.5 million genetic variants for association with 36 red cell, white cell, and platelet properties in 173,480 European-ancestry participants. This effort yielded hundreds of low frequency (<5%) and rare (<1%) variants with a strong impact on blood cell phenotypes. Our data highlight general properties of the allelic architecture of complex traits, including the proportion of the heritable component of each blood trait explained by the polygenic signal across different genome regulatory domains. Finally, through Mendelian randomization, we provide evidence of shared genetic pathways linking blood cell indices with complex pathologies, including autoimmune diseases, schizophrenia, and coronary heart disease and evidence suggesting previously reported population associations between blood cell indices and cardiovascular disease may be non-causal.


Asunto(s)
Variación Genética , Estudio de Asociación del Genoma Completo , Células Madre Hematopoyéticas/metabolismo , Enfermedades del Sistema Inmune/genética , Alelos , Diferenciación Celular , Predisposición Genética a la Enfermedad , Células Madre Hematopoyéticas/patología , Humanos , Enfermedades del Sistema Inmune/patología , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , Población Blanca/genética
14.
Nat Genet ; 48(7): 817-20, 2016 07.
Artículo en Inglés | MEDLINE | ID: mdl-27270105

RESUMEN

The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.


Asunto(s)
Algoritmos , Bancos de Muestras Biológicas , Biología Computacional/métodos , Genética de Población , Haplotipos/genética , Estudios de Cohortes , Conjuntos de Datos como Asunto , Genoma Humano , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN/métodos , Reino Unido , Población Blanca
15.
Bioinformatics ; 32(15): 2306-12, 2016 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-27153730

RESUMEN

MOTIVATION: Whole-genome low-coverage sequencing has been combined with linkage-disequilibrium (LD)-based genotype refinement to accurately and cost-effectively infer genotypes in large cohorts of individuals. Most genotype refinement methods are based on hidden Markov models, which are accurate but computationally expensive. We introduce an algorithm that models LD using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed. RESULTS: Our method is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low- and high-coverage samples. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/illumina/marvin CONTACT: rarthur@illumina.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genotipo , Desequilibrio de Ligamiento , Programas Informáticos , Algoritmos , Humanos , Distribución Normal
16.
Lancet Respir Med ; 3(10): 769-81, 2015 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-26423011

RESUMEN

BACKGROUND: Understanding the genetic basis of airflow obstruction and smoking behaviour is key to determining the pathophysiology of chronic obstructive pulmonary disease (COPD). We used UK Biobank data to study the genetic causes of smoking behaviour and lung health. METHODS: We sampled individuals of European ancestry from UK Biobank, from the middle and extremes of the forced expiratory volume in 1 s (FEV1) distribution among heavy smokers (mean 35 pack-years) and never smokers. We developed a custom array for UK Biobank to provide optimum genome-wide coverage of common and low-frequency variants, dense coverage of genomic regions already implicated in lung health and disease, and to assay rare coding variants relevant to the UK population. We investigated whether there were shared genetic causes between different phenotypes defined by extremes of FEV1. We also looked for novel variants associated with extremes of FEV1 and smoking behaviour and assessed regions of the genome that had already shown evidence for a role in lung health and disease. We set genome-wide significance at p<5 × 10(-8). FINDINGS: UK Biobank participants were recruited from March 15, 2006, to July 7, 2010. Sample selection for the UK BiLEVE study started on Nov 22, 2012, and was completed on Dec 20, 2012. We selected 50,008 unique samples: 10,002 individuals with low FEV1, 10,000 with average FEV1, and 5002 with high FEV1 from each of the heavy smoker and never smoker groups. We noted a substantial sharing of genetic causes of low FEV1 between heavy smokers and never smokers (p=2.29 × 10(-16)) and between individuals with and without doctor-diagnosed asthma (p=6.06 × 10(-11)). We discovered six novel genome-wide significant signals of association with extremes of FEV1, including signals at four novel loci (KANSL1, TSEN54, TET2, and RBM19/TBX5) and independent signals at two previously reported loci (NPNT and HLA-DQB1/HLA-DQA2). These variants also showed association with COPD, including in individuals with no history of smoking. The number of copies of a 150 kb region containing the 5' end of KANSL1, a gene that is important for epigenetic gene regulation, was associated with extremes of FEV1. We also discovered five new genome-wide significant signals for smoking behaviour, including a variant in NCAM1 (chromosome 11) and a variant on chromosome 2 (between TEX41 and PABPC1P2) that has a trans effect on expression of NCAM1 in brain tissue. INTERPRETATION: By sampling from the extremes of the lung function distribution in UK Biobank, we identified novel genetic causes of lung function and smoking behaviour. These results provide new insight into the specific mechanisms underlying airflow obstruction, COPD, and tobacco addiction, and show substantial shared genetic architecture underlying airflow obstruction across individuals, irrespective of smoking behaviour and other airway disease. FUNDING: Medical Research Council.


Asunto(s)
Pulmón/fisiopatología , Enfermedad Pulmonar Obstructiva Crónica/genética , Fumar/genética , Adolescente , Adulto , Anciano , Anciano de 80 o más Años , Bancos de Muestras Biológicas , Estudios de Casos y Controles , Femenino , Volumen Espiratorio Forzado/genética , Estudios de Asociación Genética , Humanos , Masculino , Persona de Mediana Edad , Polimorfismo de Nucleótido Simple , Factores de Riesgo , Reino Unido , Adulto Joven
17.
Nat Commun ; 6: 7846, 2015 Aug 05.
Artículo en Inglés | MEDLINE | ID: mdl-26242864

RESUMEN

Several studies have reported that the number of crossovers increases with maternal age in humans, but others have found the opposite. Resolving the true effect has implications for understanding the maternal age effect on aneuploidies. Here, we revisit this question in the largest sample to date using single nucleotide polymorphism (SNP)-chip data, comprising over 6,000 meioses from nine cohorts. We develop and fit a hierarchical model to allow for differences between cohorts and between mothers. We estimate that over 10 years, the expected number of maternal crossovers increases by 2.1% (95% credible interval (0.98%, 3.3%)). Our results are not consistent with the larger positive and negative effects previously reported in smaller cohorts. We see heterogeneity between cohorts that is likely due to chance effects in smaller samples, or possibly to confounders, emphasizing that care should be taken when interpreting results from any specific cohort about the effect of maternal age on recombination.


Asunto(s)
Intercambio Genético , Edad Materna , Recombinación Genética , Aneuploidia , Teorema de Bayes , Estudios de Cohortes , Humanos , Modelos Lineales , Modelos Genéticos
18.
PeerJ ; 3: e996, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26056623

RESUMEN

Scaffolding errors and incorrect repeat disambiguation during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub or PyPI, the Python Package Index; a tutorial and user documentation are also available.

19.
Bioinformatics ; 31(12): 2035-7, 2015 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-25661542

RESUMEN

MOTIVATION: Mate pair protocols add to the utility of paired-end sequencing by boosting the genomic distance spanned by each pair of reads, potentially allowing larger repeats to be bridged and resolved. The Illumina Nextera Mate Pair (NMP) protocol uses a circularization-based strategy that leaves behind 38-bp adapter sequences, which must be computationally removed from the data. While 'adapter trimming' is a well-studied area of bioinformatics, existing tools do not fully exploit the particular properties of NMP data and discard more data than is necessary. RESULTS: We present NxTrim, a tool that strives to discard as little sequence as possible from NMP reads. NxTrim makes full use of the sequence on both sides of the adapter site to build 'virtual libraries' of mate pairs, paired-end reads and single-ended reads. For bacterial data, we show that aggregating these datasets allows a single NMP library to yield an assembly whose quality compares favourably to that obtained from regular paired-end reads. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/sequencing/NxTrim


Asunto(s)
Bacterias/genética , Biología Computacional/métodos , Genoma Bacteriano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Biblioteca de Genes
20.
PLoS Genet ; 10(4): e1004234, 2014 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-24743097

RESUMEN

Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. Haplotype estimation in such cohorts is a central step in many downstream analyses. Using genotypes from six cohorts from isolated populations and two cohorts from non-isolated populations, we have investigated the performance of different phasing methods designed for nominally 'unrelated' individuals. We find that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, including those designed specifically for isolated populations. In particular, when large amounts of IBD sharing is present, SHAPEIT2 infers close to perfect haplotypes. Based on these results we have developed a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals. First SHAPEIT2 is run ignoring all explicit family information. We then apply a novel HMM method (duoHMM) to combine the SHAPEIT2 haplotypes with any family information to infer the inheritance pattern of each meiosis at all sites across each chromosome. This allows the correction of switch errors, detection of recombination events and genotyping errors. We show that the method detects numbers of recombination events that align very well with expectations based on genetic maps, and that it infers far fewer spurious recombination events than Merlin. The method can also detect genotyping errors and infer recombination events in otherwise uninformative families, such as trios and duos. The detected recombination events can be used in association scans for recombination phenotypes. The method provides a simple and unified approach to haplotype estimation, that will be of interest to researchers in the fields of human, animal and plant genetics.


Asunto(s)
Haplotipos/genética , Mapeo Cromosómico/métodos , Efecto de Cohortes , Familia , Genotipo , Humanos , Modelos Genéticos , Linaje , Fenotipo , Recombinación Genética/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...