Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 47
Filtrar
1.
Bioinformatics ; 39(6)2023 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-37289522

RESUMO

MOTIVATION: Gene network reconstruction from gene expression profiles is a compute- and data-intensive problem. Numerous methods based on diverse approaches including mutual information, random forests, Bayesian networks, correlation measures, as well as their transforms and filters such as data processing inequality, have been proposed. However, an effective gene network reconstruction method that performs well in all three aspects of computational efficiency, data size scalability, and output quality remains elusive. Simple techniques such as Pearson correlation are fast to compute but ignore indirect interactions, while more robust methods such as Bayesian networks are prohibitively time consuming to apply to tens of thousands of genes. RESULTS: We developed maximum capacity path (MCP) score, a novel maximum-capacity-path-based metric to quantify the relative strengths of direct and indirect gene-gene interactions. We further present MCPNet, an efficient, parallelized gene network reconstruction software based on MCP score, to reverse engineer networks in unsupervised and ensemble manners. Using synthetic and real Saccharomyces cervisiae datasets as well as real Arabidopsis thaliana datasets, we demonstrate that MCPNet produces better quality networks as measured by AUPRC, is significantly faster than all other gene network reconstruction software, and also scales well to tens of thousands of genes and hundreds of CPU cores. Thus, MCPNet represents a new gene network reconstruction tool that simultaneously achieves quality, performance, and scalability requirements. AVAILABILITY AND IMPLEMENTATION: Source code freely available for download at https://doi.org/10.5281/zenodo.6499747 and https://github.com/AluruLab/MCPNet, implemented in C++ and supported on Linux.


Assuntos
Algoritmos , Arabidopsis , Redes Reguladoras de Genes , Teorema de Bayes , Software , Genoma , Arabidopsis/genética
2.
Bioinformatics ; 38(5): 1312-1319, 2022 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-34888624

RESUMO

MOTIVATION: Reconstruction of genome-scale networks from gene expression data is an actively studied problem. A wide range of methods that differ between the types of interactions they uncover with varying trade-offs between sensitivity and specificity have been proposed. To leverage benefits of multiple such methods, ensemble network methods that combine predictions from resulting networks have been developed, promising results better than or as good as the individual networks. Perhaps owing to the difficulty in obtaining accurate training examples, these ensemble methods hitherto are unsupervised. RESULTS: In this article, we introduce EnGRaiN, the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs. We demonstrate the effectiveness of EnGRaiN using simulated datasets as well as a curated collection of Arabidopsis thaliana datasets we created from microarray datasets available from public repositories. EnGRaiN shows better results not only in terms of receiver operating characteristic and PR characteristics for both real and simulated datasets compared with unsupervised methods for ensemble network construction, but also generates networks that can be mined for elucidating complex biological interactions. AVAILABILITY AND IMPLEMENTATION: EnGRaiN software and the datasets used in the study are publicly available at the github repository: https://github.com/AluruLab/EnGRaiN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Arabidopsis , Redes Reguladoras de Genes , Software , Genoma , Arabidopsis/genética , Aprendizado de Máquina
3.
Bioinformatics ; 37(Suppl_1): i460-i467, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252945

RESUMO

MOTIVATION: Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. RESULTS: In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. AVAILABILITY AND IMPLEMENTATION: : https://github.com/AT-CG/VF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Algoritmos , Genoma Humano , Humanos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA
4.
Bioinformatics ; 37(Suppl_1): i477-i483, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252938

RESUMO

MOTIVATION: Oxford Nanopore Technologies sequencing devices support adaptive sequencing, in which undesired reads can be ejected from a pore in real time. This feature allows targeted sequencing aided by computational methods for mapping partial reads, rather than complex library preparation protocols. However, existing mapping methods either require a computationally expensive base-calling procedure before using aligners to map partial reads or work well only on small genomes. RESULTS: In this work, we present a new streaming method that can map nanopore raw signals for real-time selective sequencing. Rather than converting read signals to bases, we propose to convert reference genomes to signals and fully operate in the signal space. Our method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored toward the current signal characteristics. We implemented the method as a tool Sigmap. Then we evaluated it on both simulated and real data and compared it to the state-of-the-art nanopore raw signal mapper Uncalled. Our results show that Sigmap yields comparable performance on mapping yeast simulated raw signals, and better mapping accuracy on mapping yeast real raw signals with a 4.4× speedup. Moreover, our method performed well on mapping raw signals to genomes of size >100 Mbp and correctly mapped 11.49% more real raw signals of green algae, which leads to a significantly higher F1-score (0.9354 versus 0.8660). AVAILABILITY AND IMPLEMENTATION: Sigmap code is accessible at https://github.com/haowenz/sigmap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Nanoporos , Algoritmos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software
5.
BMC Bioinformatics ; 21(Suppl 6): 404, 2020 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-33203364

RESUMO

BACKGROUND: Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. RESULTS: In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. CONCLUSIONS: Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .


Assuntos
Biologia Computacional , Heurística , Filogenia , Algoritmos , Alinhamento de Sequência , Software
6.
BMC Genomics ; 21(Suppl 6): 889, 2020 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-33349243

RESUMO

BACKGROUND: Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. RESULTS: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. CONCLUSIONS: Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Genômica , Análise de Sequência de DNA , Software
7.
Bioinformatics ; 34(17): i748-i756, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30423094

RESUMO

Motivation: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements. Results: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be >97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and ≥90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser's segmental duplication annotation. Availability and implementation: https://github.com/marbl/MashMap.


Assuntos
Algoritmos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequência de Bases , Mapeamento Cromossômico , Genoma Humano , Genômica/métodos , Humanos , Duplicações Segmentares Genômicas , Alinhamento de Sequência , Software , Fatores de Tempo
8.
BMC Bioinformatics ; 19(Suppl 20): 506, 2018 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-30577740

RESUMO

BACKGROUND: Atomic details of protein-DNA complexes can provide insightful information for better understanding of the function and binding specificity of DNA binding proteins. In addition to experimental methods for solving protein-DNA complex structures, protein-DNA docking can be used to predict native or near-native complex models. A docking program typically generates a large number of complex conformations and predicts the complex model(s) based on interaction energies between protein and DNA. However, the prediction accuracy is hampered by current approaches to model assessment, especially when docking simulations fail to produce any near-native models. RESULTS: We present here a Support Vector Machine (SVM)-based approach for quality assessment of the predicted transcription factor (TF)-DNA complex models. Besides a knowledge-based protein-DNA interaction potential DDNA3, we applied several structural features that have been shown to play important roles in binding specificity between transcription factors and DNA molecules to quality assessment of complex models. To address the issue of unbalanced positive and negative cases in the training dataset, we applied hard-negative mining, an iterative training process that selects an initial training dataset by combining all of the positive cases and a random sample from the negative cases. Results show that the SVM model greatly improves prediction accuracy (84.2%) over two knowledge-based protein-DNA interaction potentials, orientation potential (60.8%) and DDNA3 (68.4%). The improvement is achieved through reducing the number of false positive predictions, especially for the hard docking cases, in which a docking algorithm fails to produce any near-native complex models. CONCLUSIONS: A learning-based SVM scoring model with structural features for specific protein-DNA binding and an atomic-level protein-DNA interaction potential DDNA3 significantly improves prediction accuracy of complex models by successfully identifying cases without near-native structural models.


Assuntos
DNA/metabolismo , Modelos Moleculares , Máquina de Vetores de Suporte , Fatores de Transcrição/metabolismo , Algoritmos , DNA/química , Ligação Proteica
9.
BMC Bioinformatics ; 18(Suppl 8): 238, 2017 Jun 07.
Artigo em Inglês | MEDLINE | ID: mdl-28617225

RESUMO

BACKGROUND: Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings. RESULTS: We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed. CONCLUSIONS: ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).


Assuntos
Algoritmos , Biologia Computacional/métodos , Filogenia , Análise de Sequência/métodos
10.
BMC Genomics ; 18(Suppl 4): 372, 2017 05 24.
Artigo em Inglês | MEDLINE | ID: mdl-28589864

RESUMO

BACKGROUND: Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed. The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples. METHODS: We developed a three-step strategy that filters pairs of samples according to different criteria: (i) a k-mer bloom filter, (ii) a Levenhstein filter and (iii) a filter of identical sequences. We applied these three filters on a set of samples that cover the spectrum of genetic relationships among HCV cases, from being part of the same transmission cluster, to belonging to different subtypes. RESULTS: Our three-step filtering strategy rapidly removes 85.1% of all the pairwise sample comparisons and 91.0% of all pairwise sequence comparisons, accurately establishing which pairs of HCV samples are below the relatedness threshold. CONCLUSIONS: We present a fast and efficient three-step filtering strategy that removes most sequence comparisons and accurately establishes transmission links of any threshold-based method. This highly efficient workflow will allow a faster response and molecular detection capacity, improving the rate of detection of viral transmissions with molecular data.


Assuntos
Hepacivirus/genética , Hepacivirus/fisiologia , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Estatística como Assunto
11.
BMC Bioinformatics ; 16 Suppl 17: S7, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26679555

RESUMO

BACKGROUND: Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources. METHODS: We propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need. RESULTS: Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%. CONCLUSIONS: Thanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.


Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Bases de Dados de Ácidos Nucleicos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência
12.
Brief Bioinform ; 14(1): 56-66, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22492192

RESUMO

UNLABELLED: Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. AVAILABILITY: All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.


Assuntos
Análise de Sequência de DNA/tendências , Software , Algoritmos , Animais , Mapeamento Cromossômico/estatística & dados numéricos , Mapeamento Cromossômico/tendências , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Bases de Dados Genéticas/tendências , Previsões , Humanos , Alinhamento de Sequência/estatística & dados numéricos , Alinhamento de Sequência/tendências , Análise de Sequência de DNA/estatística & dados numéricos
13.
Nucleic Acids Res ; 41(1): e24, 2013 Jan 07.
Artigo em Inglês | MEDLINE | ID: mdl-23042249

RESUMO

Reverse engineering the whole-genome networks of complex multicellular organisms continues to remain a challenge. While simpler models easily scale to large number of genes and gene expression datasets, more accurate models are compute intensive limiting their scale of applicability. To enable fast and accurate reconstruction of large networks, we developed Tool for Inferring Network of Genes (TINGe), a parallel mutual information (MI)-based program. The novel features of our approach include: (i) B-spline-based formulation for linear-time computation of MI, (ii) a novel algorithm for direct permutation testing and (iii) development of parallel algorithms to reduce run-time and facilitate construction of large networks. We assess the quality of our method by comparison with ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) and GeneNet and demonstrate its unique capability by reverse engineering the whole-genome network of Arabidopsis thaliana from 3137 Affymetrix ATH1 GeneChips in just 9 min on a 1024-core cluster. We further report on the development of a new software Gene Network Analyzer (GeNA) for extracting context-specific subnetworks from a given set of seed genes. Using TINGe and GeNA, we performed analysis of 241 Arabidopsis AraCyc 8.0 pathways, and the results are made available through the web.


Assuntos
Arabidopsis/genética , Redes Reguladoras de Genes , Genômica/métodos , Algoritmos , Arabidopsis/metabolismo , Carotenoides/biossíntese , Respiração Celular/genética , Celulose/biossíntese , Genoma de Planta , Genômica/normas , Software , Transcriptoma
14.
Plant J ; 65(4): 634-46, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21214652

RESUMO

Brassinosteroids (BRs) are important regulators for plant growth and development. BRs signal to control the activities of the BES1 and BZR1 family transcription factors. The transcriptional network through which BES1 and BZR regulate large number of target genes is mostly unknown. By combining chromatin immunoprecipitation coupled with Arabidopsis tiling arrays (ChIP-chip) and gene expression studies, we have identified 1609 putative BES1 target genes, 404 of which are regulated by BRs and/or in gain-of-function bes1-D mutant. BES1 targets contribute to BR responses and interactions with other hormonal or light signaling pathways. Computational modeling of gene expression data using Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) reveals that BES1-targeted transcriptional factors form a gene regulatory network (GRN). Mutants of many genes in the network displayed defects in BR responses. Moreover, we found that BES1 functions to inhibit chloroplast development by repressing the expression of GLK1 and GLK2 transcription factors, confirming a hypothesis generated from the GRN. Our results thus provide a global view of BR regulated gene expression and a GRN that guides future studies in understanding BR-regulated plant growth.


Assuntos
Arabidopsis/genética , Redes Reguladoras de Genes , Reguladores de Crescimento de Plantas/metabolismo , Esteroides/metabolismo , Algoritmos , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Sítios de Ligação , Imunoprecipitação da Cromatina , Biologia Computacional , Proteínas de Ligação a DNA , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Genoma de Planta , Proteínas Nucleares/genética , Proteínas Nucleares/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
15.
PLoS Genet ; 5(11): e1000728, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-19936048

RESUMO

Most of our understanding of plant genome structure and evolution has come from the careful annotation of small (e.g., 100 kb) sequenced genomic regions or from automated annotation of complete genome sequences. Here, we sequenced and carefully annotated a contiguous 22 Mb region of maize chromosome 4 using an improved pseudomolecule for annotation. The sequence segment was comprehensively ordered, oriented, and confirmed using the maize optical map. Nearly 84% of the sequence is composed of transposable elements (TEs) that are mostly nested within each other, of which most families are low-copy. We identified 544 gene models using multiple levels of evidence, as well as five miRNA genes. Gene fragments, many captured by TEs, are prevalent within this region. Elimination of gene redundancy from a tetraploid maize ancestor that originated a few million years ago is responsible in this region for most disruptions of synteny with sorghum and rice. Consistent with other sub-genomic analyses in maize, small RNA mapping showed that many small RNAs match TEs and that most TEs match small RNAs. These results, performed on approximately 1% of the maize genome, demonstrate the feasibility of refining the B73 RefGen_v1 genome assembly by incorporating optical map, high-resolution genetic map, and comparative genomic data sets. Such improvements, along with those of gene and repeat annotation, will serve to promote future functional genomic and phylogenomic research in maize and other grasses.


Assuntos
Pareamento de Bases/genética , Genoma de Planta/genética , Zea mays/genética , Sequência de Bases , Cromossomos de Plantas/genética , Elementos de DNA Transponíveis/genética , Evolução Molecular , Duplicação Gênica , Rearranjo Gênico/genética , Genes de Plantas , Loci Gênicos/genética , Dados de Sequência Molecular , Mutação/genética , Fases de Leitura Aberta/genética , Oryza/genética , Mapeamento Físico do Cromossomo , RNA de Plantas/genética , Homologia de Sequência do Ácido Nucleico , Sorghum/genética , Sintenia/genética
16.
J Comput Biol ; 29(1): 27-44, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-35050715

RESUMO

We propose GRNUlar, a novel deep learning framework for supervised learning of gene regulatory networks (GRNs) from single-cell RNA-Sequencing (scRNA-Seq) data. Our framework incorporates two intertwined models. First, we leverage the expressive ability of neural networks to capture complex dependencies between transcription factors and the corresponding genes they regulate, by developing a multitask learning framework. Second, to capture sparsity of GRNs observed in the real world, we design an unrolled algorithm technique for our framework. Our deep architecture requires supervision for training, for which we repurpose existing synthetic data simulators that generate scRNA-Seq data guided by an underlying GRN. Experimental results demonstrate that GRNUlar outperforms state-of-the-art methods on both synthetic and real data sets. Our study also demonstrates the novel and successful use of expression data simulators for supervised learning of GRN inference.


Assuntos
Aprendizado Profundo , Redes Reguladoras de Genes , Análise de Célula Única/estatística & dados numéricos , Algoritmos , Animais , Viés , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Escherichia coli/genética , Humanos , Camundongos , Redes Neurais de Computação , RNA-Seq/estatística & dados numéricos , Saccharomyces cerevisiae/genética , Aprendizado de Máquina Supervisionado
17.
J Comput Biol ; 29(12): 1377-1396, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36450127

RESUMO

The problem of aligning a sequence to a walk in a labeled graph is of fundamental importance to Computational Biology. For an arbitrary graph G=(V,E) and a pattern P of length m, a lower bound based on the Strong Exponential Time Hypothesis implies that an algorithm for finding a walk in G exactly matching P significantly faster than O(|E|m) time is unlikely. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is solvable in O(|E|m) time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets. Moreover, NP-completeness continues to hold even when edits are restricted to only substitutions. Despite the popularity of the de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on the de Bruijn graphs remained unknown. We investigate this problem and show that the properties that make the de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. We also demonstrate that an algorithm significantly faster than O(|E|m) is unlikely for the de Bruijn graphs in the case where substitutions are only allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, such as on the de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic Õ(nm) time, where n is the text's length.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Análise de Sequência de DNA , Dureza
18.
BMC Bioinformatics ; 12 Suppl 1: S52, 2011 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-21342585

RESUMO

BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. RESULTS: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. AVAILABILITY: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php?id = redeem". CONCLUSIONS: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Modelos Estatísticos , Software , Algoritmos , Funções Verossimilhança
19.
Bioinformatics ; 26(20): 2526-33, 2010 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-20834037

RESUMO

MOTIVATION: Error correction is critical to the success of next-generation sequencing applications, such as resequencing and de novo genome sequencing. It is especially important for high-throughput short-read sequencing, where reads are much shorter and more abundant, and errors more frequent than in traditional Sanger sequencing. Processing massive numbers of short reads with existing error correction methods is both compute and memory intensive, yet the results are far from satisfactory when applied to real datasets. RESULTS: We present a novel approach, termed Reptile, for error correction in short-read data from next-generation sequencing. Reptile works with the spectrum of k-mers from the input reads, and corrects errors by simultaneously examining: (i) Hamming distance-based correction possibilities for potentially erroneous k-mers; and (ii) neighboring k-mers from the same read for correct contextual information. By not needing to store input data, Reptile has the favorable property that it can handle data that does not fit in main memory. In addition to sequence data, Reptile can make use of available quality score information. Our experiments show that Reptile outperforms previous methods in the percentage of errors removed from the data and the accuracy in true base assignment. In addition, a significant reduction in run time and memory usage have been achieved compared with previous methods, making it more practical for short-read error correction when sampling larger genomes. AVAILABILITY: Reptile is implemented in C++ and is available through the link: http://aluru-sun.ece.iastate.edu/doku.php?id=software CONTACT: aluru@iastate.edu.


Assuntos
Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos
20.
Nat Commun ; 12(1): 6566, 2021 11 12.
Artigo em Inglês | MEDLINE | ID: mdl-34772935

RESUMO

As sequencing depth of chromatin studies continually grows deeper for sensitive profiling of regulatory elements or chromatin spatial structures, aligning and preprocessing of these sequencing data have become the bottleneck for analysis. Here we present Chromap, an ultrafast method for aligning and preprocessing high throughput chromatin profiles. Chromap is comparable to BWA-MEM and Bowtie2 in alignment accuracy and is over 10 times faster than traditional workflows on bulk ChIP-seq/Hi-C profiles and than 10x Genomics' CellRanger v2.0.0 pipeline on single-cell ATAC-seq profiles.


Assuntos
Cromatina , Genômica/métodos , Sequenciamento de Cromatina por Imunoprecipitação , Biologia Computacional , Perfilação da Expressão Gênica/métodos , Humanos , Sequências Reguladoras de Ácido Nucleico , Análise de Sequência de DNA/métodos , Análise de Célula Única
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA