Búsqueda | Portal Regional de la BVS

1.

MCPNet: a parallel maximum capacity-based genome-scale gene network construction framework.

Pan, Tony C; Chockalingam, Sriram P; Aluru, Maneesha; Aluru, Srinivas.

Bioinformatics ; 39(6)2023 06 01.

Artículo en Inglés | MEDLINE | ID: mdl-37289522

RESUMEN

MOTIVATION: Gene network reconstruction from gene expression profiles is a compute- and data-intensive problem. Numerous methods based on diverse approaches including mutual information, random forests, Bayesian networks, correlation measures, as well as their transforms and filters such as data processing inequality, have been proposed. However, an effective gene network reconstruction method that performs well in all three aspects of computational efficiency, data size scalability, and output quality remains elusive. Simple techniques such as Pearson correlation are fast to compute but ignore indirect interactions, while more robust methods such as Bayesian networks are prohibitively time consuming to apply to tens of thousands of genes. RESULTS: We developed maximum capacity path (MCP) score, a novel maximum-capacity-path-based metric to quantify the relative strengths of direct and indirect gene-gene interactions. We further present MCPNet, an efficient, parallelized gene network reconstruction software based on MCP score, to reverse engineer networks in unsupervised and ensemble manners. Using synthetic and real Saccharomyces cervisiae datasets as well as real Arabidopsis thaliana datasets, we demonstrate that MCPNet produces better quality networks as measured by AUPRC, is significantly faster than all other gene network reconstruction software, and also scales well to tens of thousands of genes and hundreds of CPU cores. Thus, MCPNet represents a new gene network reconstruction tool that simultaneously achieves quality, performance, and scalability requirements. AVAILABILITY AND IMPLEMENTATION: Source code freely available for download at https://doi.org/10.5281/zenodo.6499747 and https://github.com/AluruLab/MCPNet, implemented in C++ and supported on Linux.

Asunto(s)

Algoritmos , Arabidopsis , Redes Reguladoras de Genes , Teorema de Bayes , Programas Informáticos , Genoma , Arabidopsis/genética

2.

On the Hardness of Sequence Alignment on De Bruijn Graphs.

Gibney, Daniel; Thankachan, Sharma V; Aluru, Srinivas.

J Comput Biol ; 29(12): 1377-1396, 2022 12.

Artículo en Inglés | MEDLINE | ID: mdl-36450127

RESUMEN

The problem of aligning a sequence to a walk in a labeled graph is of fundamental importance to Computational Biology. For an arbitrary graph G=(V,E) and a pattern P of length m, a lower bound based on the Strong Exponential Time Hypothesis implies that an algorithm for finding a walk in G exactly matching P significantly faster than O(|E|m) time is unlikely. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is solvable in O(|E|m) time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets. Moreover, NP-completeness continues to hold even when edits are restricted to only substitutions. Despite the popularity of the de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on the de Bruijn graphs remained unknown. We investigate this problem and show that the properties that make the de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. We also demonstrate that an algorithm significantly faster than O(|E|m) is unlikely for the de Bruijn graphs in the case where substitutions are only allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, such as on the de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic Õ(nm) time, where n is the text's length.

Asunto(s)

Algoritmos , Biología Computacional , Alineación de Secuencia , Análisis de Secuencia de ADN , Dureza

3.

GRNUlar: A Deep Learning Framework for Recovering Single-Cell Gene Regulatory Networks.

Shrivastava, Harsh; Zhang, Xiuwei; Song, Le; Aluru, Srinivas.

J Comput Biol ; 29(1): 27-44, 2022 01.

Artículo en Inglés | MEDLINE | ID: mdl-35050715

RESUMEN

We propose GRNUlar, a novel deep learning framework for supervised learning of gene regulatory networks (GRNs) from single-cell RNA-Sequencing (scRNA-Seq) data. Our framework incorporates two intertwined models. First, we leverage the expressive ability of neural networks to capture complex dependencies between transcription factors and the corresponding genes they regulate, by developing a multitask learning framework. Second, to capture sparsity of GRNs observed in the real world, we design an unrolled algorithm technique for our framework. Our deep architecture requires supervision for training, for which we repurpose existing synthetic data simulators that generate scRNA-Seq data guided by an underlying GRN. Experimental results demonstrate that GRNUlar outperforms state-of-the-art methods on both synthetic and real data sets. Our study also demonstrates the novel and successful use of expression data simulators for supervised learning of GRN inference.

Asunto(s)

Aprendizaje Profundo , Redes Reguladoras de Genes , Análisis de la Célula Individual/estadística & datos numéricos , Algoritmos , Animales , Sesgo , Biología Computacional , Simulación por Computador , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Escherichia coli/genética , Humanos , Ratones , Redes Neurales de la Computación , RNA-Seq/estadística & datos numéricos , Saccharomyces cerevisiae/genética , Aprendizaje Automático Supervisado

4.

EnGRaiN: a supervised ensemble learning method for recovery of large-scale gene regulatory networks.

Aluru, Maneesha; Shrivastava, Harsh; Chockalingam, Sriram P; Shivakumar, Shruti; Aluru, Srinivas.

Bioinformatics ; 38(5): 1312-1319, 2022 02 07.

Artículo en Inglés | MEDLINE | ID: mdl-34888624

RESUMEN

MOTIVATION: Reconstruction of genome-scale networks from gene expression data is an actively studied problem. A wide range of methods that differ between the types of interactions they uncover with varying trade-offs between sensitivity and specificity have been proposed. To leverage benefits of multiple such methods, ensemble network methods that combine predictions from resulting networks have been developed, promising results better than or as good as the individual networks. Perhaps owing to the difficulty in obtaining accurate training examples, these ensemble methods hitherto are unsupervised. RESULTS: In this article, we introduce EnGRaiN, the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs. We demonstrate the effectiveness of EnGRaiN using simulated datasets as well as a curated collection of Arabidopsis thaliana datasets we created from microarray datasets available from public repositories. EnGRaiN shows better results not only in terms of receiver operating characteristic and PR characteristics for both real and simulated datasets compared with unsupervised methods for ensemble network construction, but also generates networks that can be mined for elucidating complex biological interactions. AVAILABILITY AND IMPLEMENTATION: EnGRaiN software and the datasets used in the study are publicly available at the github repository: https://github.com/AluruLab/EnGRaiN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Arabidopsis , Redes Reguladoras de Genes , Programas Informáticos , Genoma , Arabidopsis/genética , Aprendizaje Automático

5.

Fast alignment and preprocessing of chromatin profiles with Chromap.

Zhang, Haowen; Song, Li; Wang, Xiaotao; Cheng, Haoyu; Wang, Chenfei; Meyer, Clifford A; Liu, Tao; Tang, Ming; Aluru, Srinivas; Yue, Feng; Liu, X Shirley; Li, Heng.

Nat Commun ; 12(1): 6566, 2021 11 12.

Artículo en Inglés | MEDLINE | ID: mdl-34772935

RESUMEN

As sequencing depth of chromatin studies continually grows deeper for sensitive profiling of regulatory elements or chromatin spatial structures, aligning and preprocessing of these sequencing data have become the bottleneck for analysis. Here we present Chromap, an ultrafast method for aligning and preprocessing high throughput chromatin profiles. Chromap is comparable to BWA-MEM and Bowtie2 in alignment accuracy and is over 10 times faster than traditional workflows on bulk ChIP-seq/Hi-C profiles and than 10x Genomics' CellRanger v2.0.0 pipeline on single-cell ATAC-seq profiles.

Asunto(s)

Cromatina , Genómica/métodos , Secuenciación de Inmunoprecipitación de Cromatina , Biología Computacional , Perfilación de la Expresión Génica/métodos , Humanos , Secuencias Reguladoras de Ácidos Nucleicos , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual

6.

Reply to: "Re-evaluating the evidence for a universal genetic boundary among microbial species".

Rodriguez-R, Luis M; Jain, Chirag; Conrad, Roth E; Aluru, Srinivas; Konstantinidis, Konstantinos T.

Nat Commun ; 12(1): 4060, 2021 07 07.

Artículo en Inglés | MEDLINE | ID: mdl-34234115

Asunto(s)

Parto , Femenino , Humanos , Embarazo

7.

Real-time mapping of nanopore raw signals.

Zhang, Haowen; Li, Haoran; Jain, Chirag; Cheng, Haoyu; Au, Kin Fai; Li, Heng; Aluru, Srinivas.

Bioinformatics ; 37(Suppl_1): i477-i483, 2021 07 12.

Artículo en Inglés | MEDLINE | ID: mdl-34252938

RESUMEN

MOTIVATION: Oxford Nanopore Technologies sequencing devices support adaptive sequencing, in which undesired reads can be ejected from a pore in real time. This feature allows targeted sequencing aided by computational methods for mapping partial reads, rather than complex library preparation protocols. However, existing mapping methods either require a computationally expensive base-calling procedure before using aligners to map partial reads or work well only on small genomes. RESULTS: In this work, we present a new streaming method that can map nanopore raw signals for real-time selective sequencing. Rather than converting read signals to bases, we propose to convert reference genomes to signals and fully operate in the signal space. Our method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored toward the current signal characteristics. We implemented the method as a tool Sigmap. Then we evaluated it on both simulated and real data and compared it to the state-of-the-art nanopore raw signal mapper Uncalled. Our results show that Sigmap yields comparable performance on mapping yeast simulated raw signals, and better mapping accuracy on mapping yeast real raw signals with a 4.4× speedup. Moreover, our method performed well on mapping raw signals to genomes of size >100 Mbp and correctly mapped 11.49% more real raw signals of green algae, which leads to a significantly higher F1-score (0.9354 versus 0.8660). AVAILABILITY AND IMPLEMENTATION: Sigmap code is accessible at https://github.com/haowenz/sigmap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Nanoporos , Algoritmos , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Programas Informáticos

8.

A variant selection framework for genome graphs.

Jain, Chirag; Tavakoli, Neda; Aluru, Srinivas.

Bioinformatics ; 37(Suppl_1): i460-i467, 2021 07 12.

Artículo en Inglés | MEDLINE | ID: mdl-34252945

RESUMEN

MOTIVATION: Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. RESULTS: In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most Î´ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and Î´ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, Î´ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. AVAILABILITY AND IMPLEMENTATION: : https://github.com/AT-CG/VF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Genoma , Programas Informáticos , Algoritmos , Genoma Humano , Humanos , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN

9.

A comprehensive evaluation of long read error correction methods.

Zhang, Haowen; Jain, Chirag; Aluru, Srinivas.

BMC Genomics ; 21(Suppl 6): 889, 2020 Dec 21.

Artículo en Inglés | MEDLINE | ID: mdl-33349243

RESUMEN

BACKGROUND: Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. RESULTS: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. CONCLUSIONS: Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .

Asunto(s)

Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Genómica , Análisis de Secuencia de ADN , Programas Informáticos

10.

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.

Chockalingam, Sriram P; Pannu, Jodh; Hooshmand, Sahar; Thankachan, Sharma V; Aluru, Srinivas.

BMC Bioinformatics ; 21(Suppl 6): 404, 2020 Nov 18.

Artículo en Inglés | MEDLINE | ID: mdl-33203364

RESUMEN

BACKGROUND: Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. RESULTS: In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. CONCLUSIONS: Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .

Asunto(s)

Biología Computacional , Heurística , Filogenia , Algoritmos , Alineación de Secuencia , Programas Informáticos

11.

Fast de Bruijn Graph Compaction in Distributed Memory Environments.

Pan, Tony; Nihalani, Rahul; Aluru, Srinivas.

IEEE/ACM Trans Comput Biol Bioinform ; 17(1): 136-148, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-30072337

RESUMEN

De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is 3.7× and 2.0× faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://github.com/ParBLiSS/bruno.

Asunto(s)

Algoritmos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Gráficos por Computador , Bases de Datos Genéticas , Genoma/genética , Humanos

12.

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

Pan, Tony; Flick, Patrick; Jain, Chirag; Liu, Yongchao; Aluru, Srinivas.

IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1117-1131, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-28991750

RESUMEN

Counting and indexing fixed length substrings, or $k$k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases per 3-day experiment from a single sequencer. We present Kmerind, a high performance parallel $k$k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's $k$k-mer counter performs similarly or better than the best existing $k$k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts $k$k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1 percent of the $k$k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first $k$k-mer indexing library for distributed memory environments, and the first extensible library for general $k$k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.

Asunto(s)

Biología Computacional/métodos , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Algoritmos , Redes de Comunicación de Computadores , Computadores , Biblioteca de Genes , Genoma , Humanos , Lenguajes de Programación , Semántica , Programas Informáticos

13.

High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.

Jain, Chirag; Rodriguez-R, Luis M; Phillippy, Adam M; Konstantinidis, Konstantinos T; Aluru, Srinivas.

Nat Commun ; 9(1): 5114, 2018 11 30.

Artículo en Inglés | MEDLINE | ID: mdl-30504855

RESUMEN

A fundamental question in microbiology is whether there is continuum of genetic diversity among genomes, or clear species boundaries prevail instead. Whole-genome similarity metrics such as Average Nucleotide Identity (ANI) help address this question by facilitating high resolution taxonomic analysis of thousands of genomes from diverse phylogenetic lineages. To scale to available genomes and beyond, we present FastANI, a new method to estimate ANI using alignment-free approximate sequence mapping. FastANI is accurate for both finished and draft genomes, and is up to three orders of magnitude faster compared to alignment-based approaches. We leverage FastANI to compute pairwise ANI values among all prokaryotic genomes available in the NCBI database. Our results reveal clear genetic discontinuity, with 99.8% of the total 8 billion genome pairs analyzed conforming to >95% intra-species and <83% inter-species ANI values. This discontinuity is manifested with or without the most frequently sequenced species, and is robust to historic additions in the genome databases.

Asunto(s)

Células Procariotas/metabolismo , Bases de Datos Factuales , Variación Genética/genética , Genoma Bacteriano/genética , Filogenia , Análisis de Secuencia de ADN

14.

An SVM-based method for assessment of transcription factor-DNA complex models.

Corona, Rosario I; Sudarshan, Sanjana; Aluru, Srinivas; Guo, Jun-Tao.

BMC Bioinformatics ; 19(Suppl 20): 506, 2018 Dec 21.

Artículo en Inglés | MEDLINE | ID: mdl-30577740

RESUMEN

BACKGROUND: Atomic details of protein-DNA complexes can provide insightful information for better understanding of the function and binding specificity of DNA binding proteins. In addition to experimental methods for solving protein-DNA complex structures, protein-DNA docking can be used to predict native or near-native complex models. A docking program typically generates a large number of complex conformations and predicts the complex model(s) based on interaction energies between protein and DNA. However, the prediction accuracy is hampered by current approaches to model assessment, especially when docking simulations fail to produce any near-native models. RESULTS: We present here a Support Vector Machine (SVM)-based approach for quality assessment of the predicted transcription factor (TF)-DNA complex models. Besides a knowledge-based protein-DNA interaction potential DDNA3, we applied several structural features that have been shown to play important roles in binding specificity between transcription factors and DNA molecules to quality assessment of complex models. To address the issue of unbalanced positive and negative cases in the training dataset, we applied hard-negative mining, an iterative training process that selects an initial training dataset by combining all of the positive cases and a random sample from the negative cases. Results show that the SVM model greatly improves prediction accuracy (84.2%) over two knowledge-based protein-DNA interaction potentials, orientation potential (60.8%) and DDNA3 (68.4%). The improvement is achieved through reducing the number of false positive predictions, especially for the hard docking cases, in which a docking algorithm fails to produce any near-native complex models. CONCLUSIONS: A learning-based SVM scoring model with structural features for specific protein-DNA binding and an atomic-level protein-DNA interaction potential DDNA3 significantly improves prediction accuracy of complex models by successfully identifying cases without near-native structural models.

Asunto(s)

ADN/metabolismo , Modelos Moleculares , Máquina de Vectores de Soporte , Factores de Transcripción/metabolismo , Algoritmos , ADN/química , Unión Proteica

15.

A fast adaptive algorithm for computing whole-genome homology maps.

Jain, Chirag; Koren, Sergey; Dilthey, Alexander; Phillippy, Adam M; Aluru, Srinivas.

Bioinformatics ; 34(17): i748-i756, 2018 09 01.

Artículo en Inglés | MEDLINE | ID: mdl-30423094

RESUMEN

Motivation: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements. Results: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be >97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and ≥90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser's segmental duplication annotation. Availability and implementation: https://github.com/marbl/MashMap.

Asunto(s)

Algoritmos , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuencia de Bases , Mapeo Cromosómico , Genoma Humano , Genómica/métodos , Humanos , Duplicaciones Segmentarias en el Genoma , Alineación de Secuencia , Programas Informáticos , Factores de Tiempo

16.

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

Jain, Chirag; Dilthey, Alexander; Koren, Sergey; Aluru, Srinivas; Phillippy, Adam M.

J Comput Biol ; 25(7): 766-779, 2018 07.

Artículo en Inglés | MEDLINE | ID: mdl-29708767

RESUMEN

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.

Asunto(s)

Genoma Humano/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Algoritmos , Bases de Datos Factuales , Humanos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN

17.

A greedy alignment-free distance estimator for phylogenetic inference.

Thankachan, Sharma V; Chockalingam, Sriram P; Liu, Yongchao; Krishnan, Ambujam; Aluru, Srinivas.

BMC Bioinformatics ; 18(Suppl 8): 238, 2017 Jun 07.

Artículo en Inglés | MEDLINE | ID: mdl-28617225

RESUMEN

BACKGROUND: Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings. RESULTS: We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed. CONCLUSIONS: ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).

Asunto(s)

Algoritmos , Biología Computacional/métodos , Filogenia , Análisis de Secuencia/métodos

18.

Efficient detection of viral transmissions with Next-Generation Sequencing data.

Rytsareva, Inna; Campo, David S; Zheng, Yueli; Sims, Seth; Thankachan, Sharma V; Tetik, Cansu; Chirag, Jain; Chockalingam, Sriram P; Sue, Amanda; Aluru, Srinivas; Khudyakov, Yury.

BMC Genomics ; 18(Suppl 4): 372, 2017 05 24.

Artículo en Inglés | MEDLINE | ID: mdl-28589864

RESUMEN

BACKGROUND: Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed. The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples. METHODS: We developed a three-step strategy that filters pairs of samples according to different criteria: (i) a k-mer bloom filter, (ii) a Levenhstein filter and (iii) a filter of identical sequences. We applied these three filters on a set of samples that cover the spectrum of genetic relationships among HCV cases, from being part of the same transmission cluster, to belonging to different subtypes. RESULTS: Our three-step filtering strategy rapidly removes 85.1% of all the pairwise sample comparisons and 91.0% of all pairwise sequence comparisons, accurately establishing which pairs of HCV samples are below the relatedness threshold. CONCLUSIONS: We present a fast and efficient three-step filtering strategy that removes most sequence comparisons and accurately establishes transmission links of any threshold-based method. This highly efficient workflow will allow a faster response and molecular detection capacity, improving the rate of detection of viral transmissions with molecular data.

Asunto(s)

Hepacivirus/genética , Hepacivirus/fisiología , Secuenciación de Nucleótidos de Alto Rendimiento , Algoritmos , Estadística como Asunto

19.

RD26 mediates crosstalk between drought and brassinosteroid signalling pathways.

Ye, Huaxun; Liu, Sanzhen; Tang, Buyun; Chen, Jiani; Xie, Zhouli; Nolan, Trevor M; Jiang, Hao; Guo, Hongqing; Lin, Hung-Ying; Li, Lei; Wang, Yanqun; Tong, Hongning; Zhang, Mingcai; Chu, Chengcai; Li, Zhaohu; Aluru, Maneesha; Aluru, Srinivas; Schnable, Patrick S; Yin, Yanhai.

Nat Commun ; 8: 14573, 2017 02 24.

Artículo en Inglés | MEDLINE | ID: mdl-28233777

RESUMEN

Brassinosteroids (BRs) regulate plant growth and stress responses via the BES1/BZR1 family of transcription factors, which regulate the expression of thousands of downstream genes. BRs are involved in the response to drought, however the mechanistic understanding of interactions between BR signalling and drought response remains to be established. Here we show that transcription factor RD26 mediates crosstalk between drought and BR signalling. When overexpressed, BES1 target gene RD26 can inhibit BR-regulated growth. Global gene expression studies suggest that RD26 can act antagonistically to BR to regulate the expression of a subset of BES1-regulated genes, thereby inhibiting BR function. We show that RD26 can interact with BES1 protein and antagonize BES1 transcriptional activity on BR-regulated genes and that BR signalling can also repress expression of RD26 and its homologues and inhibit drought responses. Our results thus reveal a mechanism coordinating plant growth and drought tolerance.

Asunto(s)

Proteínas de Arabidopsis/metabolismo , Arabidopsis/fisiología , Brasinoesteroides/metabolismo , Proteínas Nucleares/metabolismo , Reguladores del Crecimiento de las Plantas/fisiología , Factores de Transcripción/metabolismo , Adaptación Fisiológica , Proteínas de Arabidopsis/genética , Proteínas de Unión al ADN , Sequías , Regulación de la Expresión Génica de las Plantas/fisiología , Mutación con Pérdida de Función , Fosforilación , Plantas Modificadas Genéticamente , Transducción de Señal/fisiología , Factores de Transcripción/genética

20.

Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories.

Chockalingam, Sriram; Aluru, Maneesha; Aluru, Srinivas.

Microarrays (Basel) ; 5(3)2016 Sep 19.

Artículo en Inglés | MEDLINE | ID: mdl-27657141

RESUMEN

Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA