RESUMO
Summary: Reconstructing haplotypes of an organism from a set of sequencing reads is a computationally challenging (NP-hard) problem. In reference-guided settings, at the core of haplotype assembly is the task of clustering reads according to their origin, i.e. grouping together reads that sample the same haplotype. Read length limitations and sequencing errors render this problem difficult even for diploids; the complexity of the problem grows with the ploidy of the organism. We present XHap, a novel method for haplotype assembly that aims to learn correlations between pairs of sequencing reads, including those that do not overlap but may be separated by large genomic distances, and utilize the learned correlations to assemble the haplotypes. This is accomplished by leveraging transformers, a powerful deep-learning technique that relies on the attention mechanism to discover dependencies between non-overlapping reads. Experiments on semi-experimental and real data demonstrate that the proposed method significantly outperforms state-of-the-art techniques in diploid and polyploid haplotype assembly tasks on both short and long sequencing reads. Availability and implementation: The code for XHap and the included experiments is available at https://github.com/shoryaconsul/XHap.
RESUMO
Understanding the patterns of viral disease transmissions helps establish public health policies and aids in controlling and ending a disease outbreak. Classical methods for studying disease transmission dynamics that rely on epidemiological data, such as times of sample collection and duration of exposure intervals, struggle to provide desired insight due to limited informativeness of such data. A more precise characterization of disease transmissions may be acquired from sequencing data that reveal genetic distance between viral genomes in patient samples. Indeed, genetic distance between viral strains present in hosts contains valuable information about transmission history, thus motivating the design of methods that rely on genomic data to reconstruct a directed disease transmission network, detect transmission clusters, and identify significant network nodes (e.g., super-spreaders). In this article, we present a novel end-to-end framework for the analysis of viral transmissions utilizing viral genomic (sequencing) data. The proposed framework groups infected hosts into transmission clusters based on the reconstructed viral strains infecting them; the genetic distance between a pair of hosts is calculated using Earth Mover's Distance, and further used to infer transmission direction between the hosts. To quantify the significance of a host in the transmission network, the importance score is calculated by a graph convolutional autoencoder. The viral transmission network is represented by a directed minimum spanning tree utilizing the Edmond's algorithm modified to incorporate constraints on the importance scores of the hosts. The proposed framework outperforms state-of-the-art techniques for the analysis of viral transmission dynamics in several experiments on semiexperimental as well as experimental data.
Assuntos
Genoma Viral , Genômica , Humanos , AlgoritmosRESUMO
BACKGROUND: Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. RESULTS: We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. CONCLUSIONS: Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Diploide , Haplótipos , Humanos , Poliploidia , Análise de Sequência de DNARESUMO
The emergence of pathogens resistant to existing antimicrobial drugs is a growing worldwide health crisis that threatens a return to the pre-antibiotic era. To decrease the overuse of antibiotics, molecular diagnostics systems are needed that can rapidly identify pathogens in a clinical sample and determine the presence of mutations that confer drug resistance at the point of care. We developed a fully integrated, miniaturized semiconductor biochip and closed-tube detection chemistry that performs multiplex nucleic acid amplification and sequence analysis. The approach had a high dynamic range of quantification of microbial load and was able to perform comprehensive mutation analysis on up to 1,000 sequences or strands simultaneously in <2 h. We detected and quantified multiple DNA and RNA respiratory viruses in clinical samples with complete concordance to a commercially available test. We also identified 54 drug-resistance-associated mutations that were present in six genes of Mycobacterium tuberculosis, all of which were confirmed by next-generation sequencing.
Assuntos
Vírus de DNA/efeitos dos fármacos , Genótipo , Mycobacterium tuberculosis/efeitos dos fármacos , Vírus de RNA/efeitos dos fármacos , Semicondutores , Contagem de Colônia Microbiana , Sondas de DNA , Vírus de DNA/genética , Vírus de DNA/isolamento & purificação , DNA Viral/análise , Farmacorresistência Bacteriana/genética , Farmacorresistência Viral/genética , Estudos de Viabilidade , Genoma Bacteriano , Humanos , Miniaturização , Mutação , Mycobacterium tuberculosis/genética , Mycobacterium tuberculosis/isolamento & purificação , Técnicas de Amplificação de Ácido Nucleico , Vírus de RNA/genética , Vírus de RNA/isolamento & purificação , RNA Viral/análiseRESUMO
Motivation: As RNA viruses mutate and adapt to environmental changes, often developing resistance to anti-viral vaccines and drugs, they form an ensemble of viral strains--a viral quasispecies. While high-throughput sequencing (HTS) has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small. Results: This paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze HTS data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1-10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains. Availability and implementation: TenSQR is available at https://github.com/SoYeonA/TenSQR. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Quase-Espécies , Vírus de RNA/genética , Análise de Sequência de RNA/métodos , Análise por Conglomerados , HIV-1/genética , Software , Zika virus/genéticaRESUMO
BACKGROUND: Haplotype assembly is the task of reconstructing haplotypes of an individual from a mixture of sequenced chromosome fragments. Haplotype information enables studies of the effects of genetic variations on an organism's phenotype. Most of the mathematical formulations of haplotype assembly are known to be NP-hard and haplotype assembly becomes even more challenging as the sequencing technology advances and the length of the paired-end reads and inserts increases. Assembly of haplotypes polyploid organisms is considerably more difficult than in the case of diploids. Hence, scalable and accurate schemes with provable performance are desired for haplotype assembly of both diploid and polyploid organisms. RESULTS: We propose a framework that formulates haplotype assembly from sequencing data as a sparse tensor decomposition. We cast the problem as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and [Formula: see text]; tensor [Formula: see text] reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or polyploid organisms by iteratively solving this decomposition problem is proposed. The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. The developed framework is applicable to diploid, biallelic and polyallelic polyploid species. The code for AltHap is freely available from https://github.com/realabolfazl/AltHap . CONCLUSION: AltHap was tested in a number of different scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and significantly outperforms existing techniques when applied to haplotype assembly of polyploids.
Assuntos
Algoritmos , Diploide , Genoma Humano , Haplótipos , Poliploidia , Análise de Sequência de DNA/métodos , Humanos , Modelos Genéticos , Fenótipo , Polimorfismo de Nucleotídeo ÚnicoRESUMO
RNA viruses replicate with high mutation rates, creating closely related viral populations. The heterogeneous virus populations, referred to as viral quasispecies, rapidly adapt to environmental changes thus adversely affecting efficiency of antiviral drugs and vaccines. Therefore, studying the underlying genetic heterogeneity of viral populations plays a significant role in the development of effective therapeutic treatments. Recent high-throughput sequencing technologies have provided invaluable opportunity for uncovering the structure of quasispecies populations. However, accurate reconstruction of viral quasispecies remains difficult due to limited read lengths and presence of sequencing errors. The problem is particularly challenging when the strains in a population are highly similar, that is, the sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare; this is the setting where state-of-the-art methods struggle. In this article, we present a novel viral quasispecies reconstruction algorithm, aBayesQR, that uses a maximum-likelihood framework to infer individual sequences in a mixture from high-throughput sequencing data. The search for the most likely quasispecies is conducted on long contigs that our method constructs from the set of short reads via agglomerative hierarchical clustering; operating on contigs rather than short reads enables identification of close strains in a population and provides computational tractability of the Bayesian method. Results on both simulated and real HIV-1 data demonstrate that the proposed algorithm generally outperforms state-of-the-art methods; aBayesQR particularly stands out when reconstructing a set of closely related viral strains (e.g., quasispecies characterized by low diversity).
Assuntos
Heterogeneidade Genética , Genoma Viral/genética , Vírus de RNA/genética , Replicação Viral/genética , Algoritmos , Teorema de Bayes , Análise por Conglomerados , Variação Genética , HIV-1/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Filogenia , Análise de Sequência de RNA , SoftwareRESUMO
RNA viruses are characterized by high mutation rates that give rise to populations of closely related genomes, known as viral quasispecies. Underlying heterogeneity enables the quasispecies to adapt to changing conditions and proliferate over the course of an infection. Determining genetic diversity of a virus (i.e., inferring haplotypes and their proportions in the population) is essential for understanding its mutation patterns, and for effective drug developments. Here, we present QSdpR, a method and software for the reconstruction of quasispecies from short sequencing reads. The reconstruction is achieved by solving a correlation clustering problem on a read-similarity graph and the results of the clustering are used to estimate frequencies of sub-species; the number of sub-species is determined using pseudo F index. Extensive tests on both synthetic datasets and experimental HIV-1 and Zika virus data demonstrate that QSdpR compares favorably to existing methods in terms of various performance metrics.
Assuntos
Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Quase-Espécies , Vírus de RNA/genética , Análise de Sequência de RNA/métodos , Software , HIV-1/genética , Zika virus/genéticaRESUMO
At the core of Illumina's high-throughput DNA sequencing platforms lies a biophysical surface process that results in a random geometry of clusters of homogeneous short DNA fragments typically hundreds of base pairs long-bridge amplification. The statistical properties of this random process and the lengths of the fragments are critical as they affect the information that can be subsequently extracted, that is, density of successfully inferred DNA fragment reads. The ensembles of overlapping DNA fragment reads are then used to computationally reconstruct the much longer target genome sequence. The success of the reconstruction in turn depends on having a sufficiently large ensemble of DNA fragments that are sufficiently long. In this article using stochastic geometry, we model and optimize the end-to-end flow cell synthesis and target genome sequencing process, linking and partially controlling the statistics of the physical processes to the success of the final computational step. Based on a rough calibration of our model, we provide, for the first time, a mathematical framework capturing the salient features of the sequencing platform that serves as a basis for optimizing cost, performance, and/or sensitivity analysis to various parameters.
Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Modelos Teóricos , Análise de Sequência de DNA/métodos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Análise de Sequência de DNA/normasRESUMO
High-throughput DNA sequencing technologies allow fast and affordable sequencing of individual genomes and thus enable unprecedented studies of genetic variations. Information about variations in the genome of an individual is provided by haplotypes, ordered collections of single nucleotide polymorphisms. Knowledge of haplotypes is instrumental in finding genes associated with diseases, drug development, and evolutionary studies. Haplotype assembly from high-throughput sequencing data is challenging due to errors and limited lengths of sequencing reads. The key observation made in this paper is that the minimum error-correction formulation of the haplotype assembly problem is identical to the task of deciphering a coded message received over a noisy channel-a classical problem in the mature field of communication theory. Exploiting this connection, we develop novel haplotype assembly schemes that rely on the bit-flipping and belief propagation algorithms often used in communication systems. The latter algorithm is then adapted to the haplotype assembly of polyploids. We demonstrate on both simulated and experimental data that the proposed algorithms compare favorably with state-of-the-art haplotype assembly methods in terms of accuracy, while being scalable and computationally efficient.
Assuntos
Variação Genética/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Genoma Humano/genética , Humanos , Modelos GenéticosRESUMO
Soft-constraint semi-supervised affinity propagation (SCSSAP) adds supervision to the affinity propagation (AP) clustering algorithm without strictly enforcing instance-level constraints. Constraint violations lead to an adjustment of the AP similarity matrix at every iteration of the proposed algorithm and to addition of a penalty to the objective function. This formulation is particularly advantageous in the presence of noisy labels or noisy constraints since the penalty parameter of SCSSAP can be tuned to express our confidence in instance-level constraints. When the constraints are noiseless, SCSSAP outperforms unsupervised AP and performs at least as well as the previously proposed semi-supervised AP and constrained expectation maximization. In the presence of label and constraint noise, SCSSAP results in a more accurate clustering than either of the aforementioned established algorithms. Finally, we present an extension of SCSSAP which incorporates metric learning in the optimization objective and can further improve the performance of clustering.
Assuntos
Algoritmos , Análise por Conglomerados , Aprendizado de Máquina , Biologia Computacional , Bases de Dados Factuais , HumanosRESUMO
BACKGROUND: Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. RESULTS: We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. CONCLUSIONS: The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes. A C code implementation of ParticleHap will be available for download from https://sites.google.com/site/asynoeun/particlehap.
Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Animais , Genoma , Genótipo , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Internet , Método de Monte Carlo , Polimorfismo de Nucleotídeo Único , Interface Usuário-ComputadorRESUMO
Many in-hospital mortality risk prediction scores dichotomize predictive variables to simplify the score calculation. However, hard thresholding in these additive stepwise scores of the form "add x points if variable v is above/below threshold t" may lead to critical failures. In this paper, we seek to develop risk prediction scores that preserve clinical knowledge embedded in features and structure of the existing additive stepwise scores while addressing limitations caused by variable dichotomization. To this end, we propose a novel score structure that relies on a transformation of predictive variables by means of nonlinear logistic functions facilitating smooth differentiation between critical and normal values of the variables. We develop an optimization framework for inferring parameters of the logistic functions for a given patient population via cyclic block coordinate descent. The parameters may readily be updated as the patient population and standards of care evolve. We tested the proposed methodology on two populations: (1) brain trauma patients admitted to the intensive care unit of the Dell Children's Medical Center of Central Texas between 2007 and 2012, and (2) adult ICU patient data from the MIMIC II database. The results are compared with those obtained by the widely used PRISM III and SOFA scores. The prediction power of a score is evaluated using area under ROC curve, Youden's index, and precision-recall balance in a cross-validation study. The results demonstrate that the new framework enables significant performance improvements over PRISM III and SOFA in terms of all three criteria.
Assuntos
Mortalidade Hospitalar , Informática Médica/métodos , Medição de Risco/métodos , Adulto , Algoritmos , Lesões Encefálicas/epidemiologia , Criança , Cuidados Críticos , Bases de Dados Factuais , Humanos , Unidades de Terapia Intensiva , Unidades de Terapia Intensiva Pediátrica , Modelos Estatísticos , Avaliação de Resultados em Cuidados de Saúde , Valor Preditivo dos Testes , Prognóstico , Curva ROC , Análise de RegressãoRESUMO
BACKGROUND: The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed. RESULTS: We develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure - namely, the low rank of the underlying solution - to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap . CONCLUSION: Extensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided.
Assuntos
Algoritmos , Diploide , Poliploidia , Software , Genoma Humano , Haplótipos , Homozigoto , HumanosRESUMO
The dynamics of complex diseases are governed by intricate interactions of myriad factors. Drug combinations, formed by mixing several single-drug treatments at various doses, can enhance the effectiveness of the therapy by targeting multiple contributing factors. The main challenge in designing drug combinations is the highly nonlinear interaction of the constituent drugs. Prior work focused on guided space-exploratory heuristics that require discretization of drug doses. While being more efficient than random sampling, these methods are impractical if the drug space is high dimensional or if the drug sensitivity is unknown. Furthermore, the effectiveness of the obtained combinations may decrease if the resolution of the discretization grid is not sufficiently fine. In this paper, we model the biological system response to a continuous combination of drug doses by a Gaussian process (GP). We perform closed-loop experiments that rely on the expected improvement criterion to efficiently guide the exploration process toward drug combinations with the optimal response. When computing the criterion, we marginalize out the GP hyperparameters in a fully Bayesian manner using a particle filter. Finally, we employ a hybrid Monte Carlo algorithm to rapidly explore the high-dimensional continuous search space. We demonstrate the effectiveness of our approach on a fully factorial Drosophila dataset, an antiviral drug dataset for Herpes simplex virus type 1, and simulated human Apoptosis networks. The results show that our approach significantly reduces the number of required trials compared to existing methods.
Assuntos
Biologia Computacional/métodos , Combinação de Medicamentos , Modelos Teóricos , Farmacologia/métodos , Algoritmos , Animais , Antivirais/farmacologia , Apoptose/efeitos dos fármacos , Teorema de Bayes , Bases de Dados Genéticas , Drosophila , Herpesvirus Humano 1/efeitos dos fármacos , Humanos , Método de Monte Carlo , Distribuição NormalRESUMO
BACKGROUND: Next-generation DNA sequencing platforms are capable of generating millions of reads in a matter of days at rapidly reducing costs. Despite its proliferation and technological improvements, the performance of next-generation sequencing remains adversely affected by the imperfections in the underlying biochemical and signal acquisition procedures. To this end, various techniques, including statistical methods, are used to improve read lengths and accuracy of these systems. Development of high performing base calling algorithms that are computationally efficient and scalable is an ongoing challenge. RESULTS: We develop model-based statistical methods for fast and accurate base calling in Illumina's next-generation sequencing platforms. In particular, we propose a computationally tractable parametric model which enables dynamic programming formulation of the base calling problem. Forward-backward and soft-output Viterbi algorithms are developed, and their performance and complexity are investigated and compared with the existing state-of-the-art base calling methods for this platform. A C code implementation of our algorithm named Softy can be downloaded from https://sourceforge.net/projects/dynamicprog. CONCLUSION: We demonstrate high accuracy and speed of the proposed methods on reads obtained using Illumina's Genome Analyzer II and HiSeq2000. In addition to performing reliable and fast base calling, the developed algorithms enable incorporation of prior knowledge which can be utilized for parameter estimation and is potentially beneficial in various downstream applications.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Modelos EstatísticosRESUMO
BACKGROUND: Next-generation sequencing systems are capable of rapid and cost-effective DNA sequencing, thus enabling routine sequencing tasks and taking us one step closer to personalized medicine. Accuracy and lengths of their reads, however, are yet to surpass those provided by the conventional Sanger sequencing method. This motivates the search for computationally efficient algorithms capable of reliable and accurate detection of the order of nucleotides in short DNA fragments from the acquired data. RESULTS: In this paper, we consider Illumina's sequencing-by-synthesis platform which relies on reversible terminator chemistry and describe the acquired signal by reformulating its mathematical model as a Hidden Markov Model. Relying on this model and sequential Monte Carlo methods, we develop a parameter estimation and base calling scheme called ParticleCall. ParticleCall is tested on a data set obtained by sequencing phiX174 bacteriophage using Illumina's Genome Analyzer II. The results show that the developed base calling scheme is significantly more computationally efficient than the best performing unsupervised method currently available, while achieving the same accuracy. CONCLUSIONS: The proposed ParticleCall provides more accurate calls than the Illumina's base calling algorithm, Bustard. At the same time, ParticleCall is significantly more computationally efficient than other recent schemes with similar performance, rendering it more feasible for high-throughput sequencing data analysis. Improvement of base calling accuracy will have immediate beneficial effects on the performance of downstream applications such as SNP and genotype calling.
Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Cadeias de Markov , Método de Monte Carlo , Polimorfismo de Nucleotídeo Único , SoftwareRESUMO
MOTIVATION: Next-generation DNA sequencing platforms are becoming increasingly cost-effective and capable of providing enormous number of reads in a relatively short time. However, their accuracy and read lengths are still lagging behind those of conventional Sanger sequencing method. Performance of next-generation sequencing platforms is fundamentally limited by various imperfections in the sequencing-by-synthesis and signal acquisition processes. This drives the search for accurate, scalable and computationally tractable base calling algorithms capable of accounting for such imperfections. RESULTS: Relying on a statistical model of the sequencing-by-synthesis process and signal acquisition procedure, we develop a computationally efficient base calling method for Illumina's sequencing technology (specifically, Genome Analyzer II platform). Parameters of the model are estimated via a fast unsupervised online learning scheme, which uses the generalized expectation-maximization algorithm and requires only 3 s of running time per tile (on an Intel i7 machine @3.07GHz, single core)-a three orders of magnitude speed-up over existing parametric model-based methods. To minimize the latency between the end of the sequencing run and the generation of the base calling reports, we develop a fast online scalable decoding algorithm, which requires only 9 s/tile and achieves significantly lower error rates than the Illumina's base calling software. Moreover, it is demonstrated that the proposed online parameter estimation scheme efficiently computes tile-dependent parameters, which can thereafter be provided to the base calling algorithm, resulting in significant improvements over previously developed base calling methods for the considered platform in terms of performance, time/complexity and latency. AVAILABILITY: A C code implementation of our algorithm can be downloaded from http://www.cerc.utexas.edu/OnlineCall/.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Modelos Estatísticos , SoftwareRESUMO
We present a quantification method for affinity-based DNA microarrays which is based on the real-time measurements of hybridization kinetics. This method, i.e., real-time DNA microarrays, enhances the detection dynamic range of conventional systems by being impervious to probe saturation, washing artifacts, microarray spot-to-spot variations, and other intensity-affecting impediments. We demonstrate in both theory and practice that the time-constant of target capturing is inversely proportional to the concentration of the target analyte, which we take advantage of as the fundamental parameter to estimate the concentration of the analytes. Furthermore, to experimentally validate the capabilities of this method in practical applications, we present a FRET-based assay which enables the real-time detection in gene expression DNA microarrays.
Assuntos
Transferência Ressonante de Energia de Fluorescência , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Animais , Calibragem , Sondas de DNA/síntese química , Perfilação da Expressão Gênica/métodos , Cinética , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos/normas , Padrões de ReferênciaRESUMO
Affinity-based biosensors rely on chemical attraction between analytes (targets) and their molecular complements (probes) to detect presence and quantify amounts of the analytes of interest. Real-time DNA microarrays acquire multiple temporal samples of the target-probe binding process. In this paper, estimation of the amount of targets based on early kinetics of the binding reaction is studied. A dual Kalman filter for the parameter-state estimation is proposed. Computational studies demonstrate efficacy of the proposed method.