Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
Add more filters











Publication year range
1.
Gigascience ; 132024 Jan 02.
Article in English | MEDLINE | ID: mdl-39347649

ABSTRACT

The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, scaling to large sample sizes, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting 2 large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.


Subject(s)
Genetic Variation , Genome, Viral , High-Throughput Nucleotide Sequencing , Software , High-Throughput Nucleotide Sequencing/methods , Computational Biology/methods , Genomics/methods , Viruses/genetics , Humans
2.
F1000Res ; 13: 556, 2024.
Article in English | MEDLINE | ID: mdl-38984017

ABSTRACT

Background: Determining the appropriate computational requirements and software performance is essential for efficient genomic surveillance. The lack of standardized benchmarking complicates software selection, especially with limited resources. Methods: We developed a containerized benchmarking pipeline to evaluate seven long-read assemblers-Canu, GoldRush, MetaFlye, Strainline, HaploDMF, iGDA, and RVHaplo-for viral haplotype reconstruction, using both simulated and experimental Oxford Nanopore sequencing data of HIV-1 and other viruses. Benchmarking was conducted on three computational systems to assess each assembler's performance, utilizing QUAST and BLASTN for quality assessment. Results: Our findings show that assembler choice significantly impacts assembly time, with CPU and memory usage having minimal effect. Assembler selection also influences the size of the contigs, with a minimum read length of 2,000 nucleotides required for quality assembly. A 4,000-nucleotide read length improves quality further. Canu was efficient among de novo assemblers but not suitable for multi-strain mixtures, while GoldRush produced only consensus assemblies. Strainline and MetaFlye were suitable for metagenomic sequencing data, with Strainline requiring high memory and MetaFlye operable on low-specification machines. Among reference-based assemblers, iGDA had high error rates, RVHaplo showed the best runtime and accuracy but became ineffective with similar sequences, and HaploDMF, utilizing machine learning, had fewer errors with a slightly longer runtime. Conclusions: The HIV-64148 pipeline, containerized using Docker, facilitates easy deployment and offers flexibility to select from a range of assemblers to match computational systems or study requirements. This tool aids in genome assembly and provides valuable information on HIV-1 sequences, enhancing viral evolution monitoring and understanding.


Subject(s)
Computational Biology , Genomics , HIV-1 , Software , HIV-1/genetics , Computational Biology/methods , Genomics/methods , Humans , Genome, Viral/genetics
3.
Annu Rev Virol ; 11(1): 67-87, 2024 Sep.
Article in English | MEDLINE | ID: mdl-38848592

ABSTRACT

The arrival of novel sequencing technologies throughout the past two decades has led to a paradigm shift in our understanding of herpesvirus genomic diversity. Previously, herpesviruses were seen as a family of DNA viruses with low genomic diversity. However, a growing body of evidence now suggests that herpesviruses exist as dynamic populations that possess standing variation and evolve at much faster rates than previously assumed. In this review, we explore how strategies such as deep sequencing, long-read sequencing, and haplotype reconstruction are allowing scientists to dissect the genomic composition of herpesvirus populations. We also discuss the challenges that need to be addressed before a detailed picture of herpesvirus diversity can emerge.


Subject(s)
Genetic Variation , Genome, Viral , Herpesviridae , High-Throughput Nucleotide Sequencing , Herpesviridae/genetics , Herpesviridae/classification , Humans , High-Throughput Nucleotide Sequencing/methods , Genomics/methods , Herpesviridae Infections/virology , Haplotypes , Evolution, Molecular , Animals , Sequence Analysis, DNA/methods , Phylogeny
4.
Genet Epidemiol ; 48(1): 3-26, 2024 Feb.
Article in English | MEDLINE | ID: mdl-37830494

ABSTRACT

Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer-cell immunoglobulin-like receptor (KIR) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation-maximization (EM)-algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM-algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM-algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study. Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.


Subject(s)
DNA Copy Number Variations , Models, Genetic , Humans , Haplotypes , Gene Frequency , Genotype
5.
Front Microbiol ; 14: 1182695, 2023.
Article in English | MEDLINE | ID: mdl-37396376

ABSTRACT

Nervous necrosis virus, NNV, is a neurotropic virus that causes viral nervous necrosis disease in a wide range of fish species, including European sea bass (Dicentrarchus labrax). NNV has a bisegmented (+) ssRNA genome consisting of RNA1, which encodes the RNA polymerase, and RNA2, encoding the capsid protein. The most prevalent NNV species in sea bass is red-spotted grouper nervous necrosis virus (RGNNV), causing high mortality in larvae and juveniles. Reverse genetics studies have associated amino acid 270 of the RGNNV capsid protein with RGNNV virulence in sea bass. NNV infection generates quasispecies and reassortants able to adapt to various selective pressures, such as host immune response or switching between host species. To better understand the variability of RGNNV populations and their association with RGNNV virulence, sea bass specimens were infected with two RGNNV recombinant viruses, a wild-type, rDl956, highly virulent to sea bass, and a single-mutant virus, Mut270Dl965, less virulent to this host. Both viral genome segments were quantified in brain by RT-qPCR, and genetic variability of whole-genome quasispecies was studied by Next Generation Sequencing (NGS). Copies of RNA1 and RNA2 in brains of fish infected with the low virulent virus were 1,000-fold lower than those in brains of fish infected with the virulent virus. In addition, differences between the two experimental groups in the Ts/Tv ratio, recombination frequency and genetic heterogeneity of the mutant spectra in the RNA2 segment were found. These results show that the entire quasispecies of a bisegmented RNA virus changes as a consequence of a single point mutation in the consensus sequence of one of its segments. Sea bream (Sparus aurata) is an asymptomatic carrier for RGNNV, thus rDl965 is considered a low-virulence isolate in this species. To assess whether the quasispecies characteristics of rDl965 were conserved in another host showing different susceptibility, juvenile sea bream were infected with rDl965 and analyzed as above described. Interestingly, both viral load and genetic variability of rDl965 in seabream were similar to those of Mut270Dl965 in sea bass. This result suggests that the genetic variability and evolution of RGNNV mutant spectra may be associated with its virulence.

6.
Virus Evol ; 8(2): veac093, 2022.
Article in English | MEDLINE | ID: mdl-36478783

ABSTRACT

Longitudinal deep sequencing of viruses can provide detailed information about intra-host evolutionary dynamics including how viruses interact with and transmit between hosts. Many analyses require haplotype reconstruction, identifying which variants are co-located on the same genomic element. Most current methods to perform this reconstruction are based on a high density of variants and cannot perform this reconstruction for slowly evolving viruses. We present a new approach, HaROLD (HAplotype Reconstruction Of Longitudinal Deep sequencing data), which performs this reconstruction based on identifying co-varying variant frequencies using a probabilistic framework. We illustrate HaROLD on both RNA and DNA viruses with synthetic Illumina paired read data created from mixed human cytomegalovirus (HCMV) and norovirus genomes, and clinical datasets of HCMV and norovirus samples, demonstrating high accuracy, especially when longitudinal samples are available.

7.
Malar J ; 20(1): 311, 2021 Jul 10.
Article in English | MEDLINE | ID: mdl-34246273

ABSTRACT

BACKGROUND: Malaria patients can have two or more haplotypes in their blood sample making it challenging to identify which haplotypes they carry. In addition, there are challenges in measuring the type and frequency of resistant haplotypes in populations. This study presents a novel statistical method Gibbs sampler algorithm to investigate this issue. RESULTS: The performance of the algorithm is evaluated on simulated datasets consisting of patient blood samples characterized by their multiplicity of infection (MOI) and malaria genotype. The simulation used different resistance allele frequencies (RAF) at each Single Nucleotide Polymorphisms (SNPs) and different limit of detection (LoD) of the SNPs and the MOI. The Gibbs sampler algorithm presents higher accuracy among high LoD of the SNPs or the MOI, validated, and deals with missing MOI compared to previous related statistical approaches. CONCLUSIONS: The Gibbs sampler algorithm provided robust results when faced with genotyping errors caused by LoDs and functioned well even in the absence of MOI data on individual patients.


Subject(s)
Algorithms , Malaria/blood , Plasmodium/genetics , Haplotypes , Humans , Markov Chains , Monte Carlo Method
8.
Mol Biol Evol ; 38(6): 2660-2672, 2021 05 19.
Article in English | MEDLINE | ID: mdl-33547786

ABSTRACT

DNA sequencing technologies provide unprecedented opportunities to analyze within-host evolution of microorganism populations. Often, within-host populations are analyzed via pooled sequencing of the population, which contains multiple individuals or "haplotypes." However, current next-generation sequencing instruments, in conjunction with single-molecule barcoded linked-reads, cannot distinguish long haplotypes directly. Computational reconstruction of haplotypes from pooled sequencing has been attempted in virology, bacterial genomics, metagenomics, and human genetics, using algorithms based on either cross-host genetic sharing or within-host genomic reads. Here, we describe PoolHapX, a flexible computational approach that integrates information from both genetic sharing and genomic sequencing. We demonstrated that PoolHapX outperforms state-of-the-art tools tailored to specific organismal systems, and is robust to within-host evolution. Importantly, together with barcoded linked-reads, PoolHapX can infer whole-chromosome-scale haplotypes from 50 pools each containing 12 different haplotypes. By analyzing real data, we uncovered dynamic variations in the evolutionary processes of within-patient HIV populations previously unobserved in single position-based analysis.


Subject(s)
Genetic Techniques , Genetics, Microbial/methods , Haplotypes , Software , Algorithms , Biological Evolution , HIV/genetics , Humans , Plasmodium vivax/genetics
9.
Mol Ecol Resour ; 21(1): 93-109, 2021 Jan.
Article in English | MEDLINE | ID: mdl-32810339

ABSTRACT

Shifting from the analysis of single nucleotide polymorphisms to the reconstruction of selected haplotypes greatly facilitates the interpretation of evolve and resequence (E&R) experiments. Merging highly correlated hitchhiker SNPs into haplotype blocks reduces thousands of candidates to few selected regions. Current methods of haplotype reconstruction from Pool-seq data need a variety of data-specific parameters that are typically defined ad hoc and require haplotype sequences for validation. Here, we introduce haplovalidate, a tool which detects selected haplotypes in Pool-seq time series data without the need for sequenced haplotypes. Haplovalidate makes data-driven choices of two key parameters for the clustering procedure, the minimum correlation between SNPs constituting a cluster and the window size. Applying haplovalidate to simulated E&R data reliably detects selected haplotype blocks with low false discovery rates. Importantly, our analyses identified a restriction of the haplotype block-based approach to describe the genomic architecture of adaptation. We detected a substantial fraction of haplotypes containing multiple selection targets. These blocks were considered as one region of selection and therefore led to underestimation of the number of selection targets. We demonstrate that the separate analysis of earlier time points can significantly increase the separation of selection targets into individual haplotype blocks. We conclude that the analysis of selected haplotype blocks has great potential for the characterization of the adaptive architecture with E&R experiments.


Subject(s)
Genomics , Haplotypes , Models, Genetic , Polymorphism, Single Nucleotide , Genome , Linkage Disequilibrium
10.
Infect Genet Evol ; 82: 104277, 2020 08.
Article in English | MEDLINE | ID: mdl-32151775

ABSTRACT

Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.


Subject(s)
Computational Biology/methods , Genome, Viral , Haplotypes , High-Throughput Nucleotide Sequencing/methods , Genetic Variation , HIV Infections/virology , HIV-1/genetics , Host-Pathogen Interactions/genetics , Humans , Mutation Rate , Population Density
11.
Microorganisms ; 8(1)2020 Jan 17.
Article in English | MEDLINE | ID: mdl-31963512

ABSTRACT

The diversity of RNA viruses dictates their evolution in a particular host, community or environment. Here, we reported within- and between-host pH1N1virus diversity at consensus and sub-consensus levels over a three-year period (2015-2017) and its implications on disease severity. A total of 90 nasal samples positive for the pH1N1 virus were deep-sequenced and analyzed to detect low-frequency variants (LFVs) and haplotypes. Parallel evolution of LFVs was seen in the hemagglutinin (HA) gene across three scales: among patients (33%), across years (22%), and at global scale. Remarkably, investigating the emergence of LFVs at the consensus level demonstrated that within-host virus evolution recapitulates evolutionary dynamics seen at the global scale. Analysis of virus diversity at the HA haplotype level revealed the clustering of low-frequency haplotypes from early 2015 with dominant strains of 2016, indicating rapid haplotype evolution. Haplotype sharing was also noticed in all years, strongly suggesting haplotype transmission among patients infected during a specific influenza season. Finally, more than half of patients with severe symptoms harbored a larger number of haplotypes, mostly in patients under the age of five. Therefore, patient age, haplotype diversity, and the presence of certain LFVs should be considered when interpreting illness severity. In addition to its importance in understanding virus evolution, sub-consensus virus diversity together with whole genome sequencing is essential to explain variabilities in clinical outcomes that cannot be explained by either analysis alone.

12.
Brief Bioinform ; 21(5): 1766-1775, 2020 09 25.
Article in English | MEDLINE | ID: mdl-31697321

ABSTRACT

Deep sequencing of viral genomes is a powerful tool to study RNA virus complexity. However, the analysis of next-generation sequencing data might be challenging for researchers who have never approached the study of viral quasispecies by this methodology. In this work we present a suitable and affordable guide to explore the sub-consensus variability and to reconstruct viral quasispecies from Illumina sequencing data. The guide includes a complete analysis pipeline along with user-friendly descriptions of software and file formats. In addition, we assessed the feasibility of the workflow proposed by analyzing a set of foot-and-mouth disease viruses (FMDV) with different degrees of variability. This guide introduces the analysis of quasispecies of FMDV and other viruses through this kind of approach.


Subject(s)
Foot-and-Mouth Disease Virus/genetics , Haplotypes , High-Throughput Nucleotide Sequencing/methods , Quasispecies , Animals , Foot-and-Mouth Disease Virus/classification , Genes, Viral
13.
BMC Bioinformatics ; 19(1): 389, 2018 Oct 22.
Article in English | MEDLINE | ID: mdl-30348075

ABSTRACT

BACKGROUND: Pooling techniques, where multiple sub-samples are mixed in a single sample, are widely used to take full advantage of high-throughput DNA sequencing. Recently, Ranjard et al. (PLoS ONE 13:0195090, 2018) proposed a pooling strategy without the use of barcodes. Three sub-samples were mixed in different known proportions (i.e. 62.5%, 25% and 12.5%), and a method was developed to use these proportions to reconstruct the three haplotypes effectively. RESULTS: HaploJuice provides an alternative haplotype reconstruction algorithm for Ranjard et al.'s pooling strategy. HaploJuice significantly increases the accuracy by first identifying the empirical proportions of the three mixed sub-samples and then assembling the haplotypes using a dynamic programming approach. HaploJuice was evaluated against five different assembly algorithms, Hmmfreq (Ranjard et al., PLoS ONE 13:0195090, 2018), ShoRAH (Zagordi et al., BMC Bioinformatics 12:119, 2011), SAVAGE (Baaijens et al., Genome Res 27:835-848, 2017), PredictHaplo (Prabhakaran et al., IEEE/ACM Trans Comput Biol Bioinform 11:182-91, 2014) and QuRe (Prosperi and Salemi, Bioinformatics 28:132-3, 2012). Using simulated and real data sets, HaploJuice reconstructed the true sequences with the highest coverage and the lowest error rate. CONCLUSION: HaploJuice provides high accuracy in haplotype reconstruction, making Ranjard et al.'s pooling strategy more efficient, feasible, and applicable, with the benefit of reducing the sequencing cost.


Subject(s)
Algorithms , Haplotypes/genetics , Base Sequence , Computer Simulation , Databases, Genetic , Humans
14.
Comput Biol Chem ; 72: 1-10, 2018 Feb.
Article in English | MEDLINE | ID: mdl-29289750

ABSTRACT

In this paper, a method for single individual haplotype (SIH) reconstruction using Asexual reproduction optimization (ARO) is proposed. Haplotypes, as a set of genetic variations in each chromosome, contain vital information such as the relationship between human genome and diseases. Finding haplotypes in diploid organisms is a challenging task. Experimental methods are expensive and require special equipment. In SIH problem, we encounter with several fragments and each fragment covers some parts of desired haplotype. The main goal is bi-partitioning of the fragments with minimum error correction (MEC). This problem is addressed as NP-hard and several attempts have been made in order to solve it using heuristic methods. The current method, AROHap, has two main phases. In the first phase, most of the fragments are clustered based on a practical metric distance. In the second phase, ARO algorithm as a fast convergence bio-inspired method is used to improve the initial bi-partitioning of the fragments in the previous step. AROHap is implemented with several benchmark datasets. The experimental results demonstrate that satisfactory results were obtained, proving that AROHap can be used for SIH reconstruction problem.


Subject(s)
Algorithms , Haplotypes , Models, Biological , Computational Biology , Humans , Reproduction, Asexual
15.
Virus Res ; 239: 17-32, 2017 07 15.
Article in English | MEDLINE | ID: mdl-27693290

ABSTRACT

Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.


Subject(s)
Genetic Variation , Viruses/classification , Viruses/genetics , Animals , Computational Biology/methods , Genome, Viral , Genomics/methods , Haplotypes , High-Throughput Nucleotide Sequencing , Humans , Polymorphism, Single Nucleotide , RNA Viruses/classification , RNA Viruses/genetics , Sequence Analysis, DNA , Software
16.
Mol Biol Evol ; 34(1): 174-184, 2017 01.
Article in English | MEDLINE | ID: mdl-27702776

ABSTRACT

The genetic analysis of experimentally evolving populations typically relies on short reads from pooled individuals (Pool-Seq). While this method provides reliable allele frequency estimates, the underlying haplotype structure remains poorly characterized. With small population sizes and adaptive variants that start from low frequencies, the interpretation of selection signatures in most Evolve and Resequencing studies remains challenging. To facilitate the characterization of selection targets, we propose a new approach that reconstructs selected haplotypes from replicated time series, using Pool-Seq data. We identify selected haplotypes through the correlated frequencies of alleles carried by them. Computer simulations indicate that selected haplotype-blocks of several Mb can be reconstructed with high confidence and low error rates, even when allele frequencies change only by 20% across three replicates. Applying this method to real data from D. melanogaster populations adapting to a hot environment, we identify a selected haplotype-block of 6.93 Mb. We confirm the presence of this haplotype-block in evolved populations by experimental haplotyping, demonstrating the power and accuracy of our haplotype reconstruction from Pool-Seq data. We propose that the combination of allele frequency estimates with haplotype information will provide the key to understanding the dynamics of adaptive alleles.


Subject(s)
Biological Evolution , Directed Molecular Evolution/methods , Drosophila melanogaster/genetics , Alleles , Animals , Computer Simulation , Female , Founder Effect , Gene Frequency , Genetic Association Studies/methods , Genetic Linkage , Genetics, Population , Haplotypes , Sequence Analysis, DNA/methods
17.
Malar J ; 15(1): 430, 2016 08 25.
Article in English | MEDLINE | ID: mdl-27557806

ABSTRACT

BACKGROUND: Haplotypes are important in anti-malarial drug resistance because genes encoding drug resistance may accumulate mutations at several codons in the same gene, each mutation increasing the level of drug resistance and, possibly, reducing the metabolic costs of previous mutation. Patients often have two or more haplotypes in their blood sample which may make it impossible to identify exactly which haplotypes they carry, and hence to measure the type and frequency of resistant haplotypes in the malaria population. RESULTS: This study presents two novel statistical methods expectation-maximization (EM) and Markov chain Monte Carlo (MCMC) algorithms to investigate this issue. The performance of the algorithms is evaluated on simulated datasets consisting of patient blood characterized by their multiplicity of infection (MOI) and malaria genotype. The datasets are generated using different resistance allele frequencies (RAF) at each single nucleotide polymorphisms (SNPs) and different limit of detection (LoD) of the SNPs and the MOI. The EM and the MCMC algorithm are validated and appear more accurate, faster and slightly less affected by LoD of the SNPs and the MOI compared to previous related statistical approaches. CONCLUSIONS: The EM and the MCMC algorithms perform well when analysing malaria genetic data obtained from infected human blood samples. The results are robust to genotyping errors caused by LoDs and function well even in the absence of MOI data on individual patients.


Subject(s)
Coinfection/epidemiology , Coinfection/parasitology , Haplotypes , Malaria/epidemiology , Malaria/parasitology , Plasmodium/genetics , Plasmodium/isolation & purification , Algorithms , Biostatistics , Humans , Markov Chains , Plasmodium/classification
18.
Genetics ; 200(4): 1073-87, 2015 Aug.
Article in English | MEDLINE | ID: mdl-26048018

ABSTRACT

We present a general hidden Markov model framework called R: econstructing A: ncestry B: locks BIT: by bit (RABBIT) for reconstructing genome ancestry blocks from single-nucleotide polymorphism (SNP) array data, a required step for quantitative trait locus (QTL) mapping. The framework can be applied to a wide range of mapping populations such as the Arabidopsis multiparent advanced generation intercross (MAGIC), the mouse Collaborative Cross (CC), and the diversity outcross (DO) for both autosomes and X chromosomes if they exist. The model underlying RABBIT accounts for the joint pattern of recombination breakpoints between two homologous chromosomes and missing data and allelic typing errors in the genotype data of both sampled individuals and founders. Studies on simulated data of the MAGIC and the CC and real data of the MAGIC, the DO, and the CC demonstrate that RABBIT is more robust and accurate in reconstructing recombination bin maps than some commonly used methods.


Subject(s)
Genomics/methods , Models, Genetic , Animals , Arabidopsis/genetics , Chromosome Mapping , Markov Chains , Mice , Polymorphism, Single Nucleotide , Quantitative Trait Loci/genetics , Software
19.
Genetics ; 198(1): 59-73, 2014 Sep.
Article in English | MEDLINE | ID: mdl-25236449

ABSTRACT

Massively parallel RNA sequencing (RNA-seq) has yielded a wealth of new insights into transcriptional regulation. A first step in the analysis of RNA-seq data is the alignment of short sequence reads to a common reference genome or transcriptome. Genetic variants that distinguish individual genomes from the reference sequence can cause reads to be misaligned, resulting in biased estimates of transcript abundance. Fine-tuning of read alignment algorithms does not correct this problem. We have developed Seqnature software to construct individualized diploid genomes and transcriptomes for multiparent populations and have implemented a complete analysis pipeline that incorporates other existing software tools. We demonstrate in simulated and real data sets that alignment to individualized transcriptomes increases read mapping accuracy, improves estimation of transcript abundance, and enables the direct estimation of allele-specific expression. Moreover, when applied to expression QTL mapping we find that our individualized alignment strategy corrects false-positive linkage signals and unmasks hidden associations. We recommend the use of individualized diploid genomes over reference sequence alignment for all applications of high-throughput sequencing technology in genetically diverse populations.


Subject(s)
Sequence Alignment/methods , Sequence Analysis, RNA/methods , Software , Transcriptome , Animals , Female , Genome , Male , Mice , Quantitative Trait Loci
20.
G3 (Bethesda) ; 4(9): 1623-33, 2014 Sep 18.
Article in English | MEDLINE | ID: mdl-25237114

ABSTRACT

Genetic mapping studies in the mouse and other model organisms are used to search for genes underlying complex phenotypes. Traditional genetic mapping studies that employ single-generation crosses have poor mapping resolution and limit discovery to loci that are polymorphic between the two parental strains. Multiparent outbreeding populations address these shortcomings by increasing the density of recombination events and introducing allelic variants from multiple founder strains. However, multiparent crosses present new analytical challenges and require specialized software to take full advantage of these benefits. Each animal in an outbreeding population is genetically unique and must be genotyped using a high-density marker set; regression models for mapping must accommodate multiple founder alleles, and complex breeding designs give rise to polygenic covariance among related animals that must be accounted for in mapping analysis. The Diversity Outbred (DO) mice combine the genetic diversity of eight founder strains in a multigenerational breeding design that has been maintained for >16 generations. The large population size and randomized mating ensure the long-term genetic stability of this population. We present a complete analytical pipeline for genetic mapping in DO mice, including algorithms for probabilistic reconstruction of founder haplotypes from genotyping array intensity data, and mapping methods that accommodate multiple founder haplotypes and account for relatedness among animals. Power analysis suggests that studies with as few as 200 DO mice can detect loci with large effects, but loci that account for <5% of trait variance may require a sample size of up to 1000 animals. The methods described here are implemented in the freely available R package DOQTL.


Subject(s)
Animals, Outbred Strains/genetics , Chromosome Mapping/methods , Quantitative Trait Loci , Animals , Computer Simulation , Genotype , Leukocyte Count , Mice , Models, Genetic , Neutrophils/cytology , Phenotype , Polymorphism, Single Nucleotide , Software
SELECTION OF CITATIONS
SEARCH DETAIL