Búsqueda | Biblioteca Virtual en Salud

MCPNet: a parallel maximum capacity-based genome-scale gene network construction framework.

Pan, Tony C; Chockalingam, Sriram P; Aluru, Maneesha; Aluru, Srinivas.

Bioinformatics ; 39(6)2023 06 01.

Artículo en Inglés | MEDLINE | ID: mdl-37289522

RESUMEN

MOTIVATION: Gene network reconstruction from gene expression profiles is a compute- and data-intensive problem. Numerous methods based on diverse approaches including mutual information, random forests, Bayesian networks, correlation measures, as well as their transforms and filters such as data processing inequality, have been proposed. However, an effective gene network reconstruction method that performs well in all three aspects of computational efficiency, data size scalability, and output quality remains elusive. Simple techniques such as Pearson correlation are fast to compute but ignore indirect interactions, while more robust methods such as Bayesian networks are prohibitively time consuming to apply to tens of thousands of genes. RESULTS: We developed maximum capacity path (MCP) score, a novel maximum-capacity-path-based metric to quantify the relative strengths of direct and indirect gene-gene interactions. We further present MCPNet, an efficient, parallelized gene network reconstruction software based on MCP score, to reverse engineer networks in unsupervised and ensemble manners. Using synthetic and real Saccharomyces cervisiae datasets as well as real Arabidopsis thaliana datasets, we demonstrate that MCPNet produces better quality networks as measured by AUPRC, is significantly faster than all other gene network reconstruction software, and also scales well to tens of thousands of genes and hundreds of CPU cores. Thus, MCPNet represents a new gene network reconstruction tool that simultaneously achieves quality, performance, and scalability requirements. AVAILABILITY AND IMPLEMENTATION: Source code freely available for download at https://doi.org/10.5281/zenodo.6499747 and https://github.com/AluruLab/MCPNet, implemented in C++ and supported on Linux.

Asunto(s)

Algoritmos , Arabidopsis , Redes Reguladoras de Genes , Teorema de Bayes , Programas Informáticos , Genoma , Arabidopsis/genética

EnGRaiN: a supervised ensemble learning method for recovery of large-scale gene regulatory networks.

Aluru, Maneesha; Shrivastava, Harsh; Chockalingam, Sriram P; Shivakumar, Shruti; Aluru, Srinivas.

Bioinformatics ; 38(5): 1312-1319, 2022 02 07.

Artículo en Inglés | MEDLINE | ID: mdl-34888624

RESUMEN

MOTIVATION: Reconstruction of genome-scale networks from gene expression data is an actively studied problem. A wide range of methods that differ between the types of interactions they uncover with varying trade-offs between sensitivity and specificity have been proposed. To leverage benefits of multiple such methods, ensemble network methods that combine predictions from resulting networks have been developed, promising results better than or as good as the individual networks. Perhaps owing to the difficulty in obtaining accurate training examples, these ensemble methods hitherto are unsupervised. RESULTS: In this article, we introduce EnGRaiN, the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs. We demonstrate the effectiveness of EnGRaiN using simulated datasets as well as a curated collection of Arabidopsis thaliana datasets we created from microarray datasets available from public repositories. EnGRaiN shows better results not only in terms of receiver operating characteristic and PR characteristics for both real and simulated datasets compared with unsupervised methods for ensemble network construction, but also generates networks that can be mined for elucidating complex biological interactions. AVAILABILITY AND IMPLEMENTATION: EnGRaiN software and the datasets used in the study are publicly available at the github repository: https://github.com/AluruLab/EnGRaiN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Arabidopsis , Redes Reguladoras de Genes , Programas Informáticos , Genoma , Arabidopsis/genética , Aprendizaje Automático

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.

Chockalingam, Sriram P; Pannu, Jodh; Hooshmand, Sahar; Thankachan, Sharma V; Aluru, Srinivas.

BMC Bioinformatics ; 21(Suppl 6): 404, 2020 Nov 18.

Artículo en Inglés | MEDLINE | ID: mdl-33203364

RESUMEN

BACKGROUND: Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. RESULTS: In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. CONCLUSIONS: Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .

Asunto(s)

Biología Computacional , Heurística , Filogenia , Algoritmos , Alineación de Secuencia , Programas Informáticos

A greedy alignment-free distance estimator for phylogenetic inference.

Thankachan, Sharma V; Chockalingam, Sriram P; Liu, Yongchao; Krishnan, Ambujam; Aluru, Srinivas.

BMC Bioinformatics ; 18(Suppl 8): 238, 2017 Jun 07.

Artículo en Inglés | MEDLINE | ID: mdl-28617225

RESUMEN

BACKGROUND: Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings. RESULTS: We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed. CONCLUSIONS: ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).

Asunto(s)

Algoritmos , Biología Computacional/métodos , Filogenia , Análisis de Secuencia/métodos

Efficient detection of viral transmissions with Next-Generation Sequencing data.

Rytsareva, Inna; Campo, David S; Zheng, Yueli; Sims, Seth; Thankachan, Sharma V; Tetik, Cansu; Chirag, Jain; Chockalingam, Sriram P; Sue, Amanda; Aluru, Srinivas; Khudyakov, Yury.

BMC Genomics ; 18(Suppl 4): 372, 2017 05 24.

Artículo en Inglés | MEDLINE | ID: mdl-28589864

RESUMEN

BACKGROUND: Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed. The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples. METHODS: We developed a three-step strategy that filters pairs of samples according to different criteria: (i) a k-mer bloom filter, (ii) a Levenhstein filter and (iii) a filter of identical sequences. We applied these three filters on a set of samples that cover the spectrum of genetic relationships among HCV cases, from being part of the same transmission cluster, to belonging to different subtypes. RESULTS: Our three-step filtering strategy rapidly removes 85.1% of all the pairwise sample comparisons and 91.0% of all pairwise sequence comparisons, accurately establishing which pairs of HCV samples are below the relatedness threshold. CONCLUSIONS: We present a fast and efficient three-step filtering strategy that removes most sequence comparisons and accurately establishes transmission links of any threshold-based method. This highly efficient workflow will allow a faster response and molecular detection capacity, improving the rate of detection of viral transmissions with molecular data.

Asunto(s)

Hepacivirus/genética , Hepacivirus/fisiología , Secuenciación de Nucleótidos de Alto Rendimiento , Algoritmos , Estadística como Asunto

A survey of error-correction methods for next-generation sequencing.

Yang, Xiao; Chockalingam, Sriram P; Aluru, Srinivas.

Brief Bioinform ; 14(1): 56-66, 2013 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-22492192

RESUMEN

UNLABELLED: Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. AVAILABILITY: All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.

Asunto(s)

Análisis de Secuencia de ADN/tendencias , Programas Informáticos , Algoritmos , Animales , Mapeo Cromosómico/estadística & datos numéricos , Mapeo Cromosómico/tendencias , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Bases de Datos Genéticas/tendencias , Predicción , Humanos , Alineación de Secuencia/estadística & datos numéricos , Alineación de Secuencia/tendencias , Análisis de Secuencia de ADN/estadística & datos numéricos

Loss of immune cell identity with age inferred from large atlases of single cell transcriptomes.

Connolly, Erin; Pan, Tony; Aluru, Maneesha; Chockalingam, Sriram; Dhere, Vishal; Gibson, Greg.

Aging Cell ; : e14306, 2024 Aug 14.

Artículo en Inglés | MEDLINE | ID: mdl-39143696

RESUMEN

By analyzing two large atlases of almost 4 million cells, we show that immune-senescence involves a gradual loss of cellular identity, reflecting increased cellular heterogeneity, for effector, and cytotoxic immune cells. The effects are largely similar in both males and females and were robustly reproduced in two atlases, one assembled from 35 diverse studies including 678 adults, the other the OneK1K study of 982 adults. Since the mean transcriptional differences among cell-types remain constant across age deciles, there is little evidence for the alternative mechanism of convergence of cell-type identity. Key pathways promoting activation and stemness are down-regulated in aged T cells, while CD8 TEM and CD4 CTLs exhibited elevated inflammatory, and cytotoxicity in older individuals. Elevated inflammatory signaling pathways, such as MAPK and TNF-alpha signaling via NF-kB, also occur across all aged immune cells, particularly amongst effector immune cells. This finding of lost transcriptional identity with age carries several implications, spanning from a fundamental biological understanding of aging mechanisms to clinical perspectives on the efficacy of immunomodulation in elderly people.

Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories.

Chockalingam, Sriram; Aluru, Maneesha; Aluru, Srinivas.

Microarrays (Basel) ; 5(3)2016 Sep 19.

Artículo en Inglés | MEDLINE | ID: mdl-27657141

RESUMEN

Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.

ALFRED: A Practical Method for Alignment-Free Distance Computation.

Thankachan, Sharma V; Chockalingam, Sriram P; Liu, Yongchao; Apostolico, Alberto; Aluru, Srinivas.

J Comput Biol ; 23(6): 452-60, 2016 06.

Artículo en Inglés | MEDLINE | ID: mdl-27138275

RESUMEN

Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.

Asunto(s)

Biología Computacional/métodos , Primates/genética , Alineación de Secuencia/métodos , Algoritmos , Animales , Genoma Mitocondrial , Metagenómica , Filogenia

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

Detalles de la búsqueda