Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
1.
PLoS Comput Biol ; 20(5): e1012095, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38753877

RESUMO

Dictionary learning (DL), implemented via matrix factorization (MF), is commonly used in computational biology to tackle ubiquitous clustering problems. The method is favored due to its conceptual simplicity and relatively low computational complexity. However, DL algorithms produce results that lack interpretability in terms of real biological data. Additionally, they are not optimized for graph-structured data and hence often fail to handle them in a scalable manner. In order to address these limitations, we propose a novel DL algorithm called online convex network dictionary learning (online cvxNDL). Unlike classical DL algorithms, online cvxNDL is implemented via MF and designed to handle extremely large datasets by virtue of its online nature. Importantly, it enables the interpretation of dictionary elements, which serve as cluster representatives, through convex combinations of real measurements. Moreover, the algorithm can be applied to data with a network structure by incorporating specialized subnetwork sampling techniques. To demonstrate the utility of our approach, we apply cvxNDL on 3D-genome RNAPII ChIA-Drop data with the goal of identifying important long-range interaction patterns (long-range dictionary elements). ChIA-Drop probes higher-order interactions, and produces data in the form of hypergraphs whose nodes represent genomic fragments. The hyperedges represent observed physical contacts. Our hypergraph model analysis has the objective of creating an interpretable dictionary of long-range interaction patterns that accurately represent global chromatin physical contact maps. Through the use of dictionary information, one can also associate the contact maps with RNA transcripts and infer cellular functions. To accomplish the task at hand, we focus on RNAPII-enriched ChIA-Drop data from Drosophila Melanogaster S2 cell lines. Our results offer two key insights. First, we demonstrate that online cvxNDL retains the accuracy of classical DL (MF) methods while simultaneously ensuring unique interpretability and scalability. Second, we identify distinct collections of proximal and distal interaction patterns involving chromatin elements shared by related processes across different chromosomes, as well as patterns unique to specific chromosomes. To associate the dictionary elements with biological properties of the corresponding chromatin regions, we employ Gene Ontology (GO) enrichment analysis and perform multiple RNA coexpression studies.


Assuntos
Algoritmos , Cromatina , Biologia Computacional , Drosophila melanogaster , Cromatina/genética , Cromatina/química , Cromatina/metabolismo , Biologia Computacional/métodos , Drosophila melanogaster/genética , Animais , Aprendizado de Máquina
2.
BMC Bioinformatics ; 25(1): 195, 2024 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-38760692

RESUMO

BACKGROUND: Pathogenic infections pose a significant threat to global health, affecting millions of people every year and presenting substantial challenges to healthcare systems worldwide. Efficient and timely testing plays a critical role in disease control and transmission prevention. Group testing is a well-established method for reducing the number of tests needed to screen large populations when the disease prevalence is low. However, it does not fully utilize the quantitative information provided by qPCR methods, nor is it able to accommodate a wide range of pathogen loads. RESULTS: To address these issues, we introduce a novel adaptive semi-quantitative group testing (SQGT) scheme to efficiently screen populations via two-stage qPCR testing. The SQGT method quantizes cycle threshold (Ct) values into multiple bins, leveraging the information from the first stage of screening to improve the detection sensitivity. Dynamic Ct threshold adjustments mitigate dilution effects and enhance test accuracy. Comparisons with traditional binary outcome GT methods show that SQGT reduces the number of tests by 24% on the only complete real-world qPCR group testing dataset from Israel, while maintaining a negligible false negative rate. CONCLUSION: In conclusion, our adaptive SQGT approach, utilizing qPCR data and dynamic threshold adjustments, offers a promising solution for efficient population screening. With a reduction in the number of tests and minimal false negatives, SQGT holds potential to enhance disease control and testing strategies on a global scale.


Assuntos
Reação em Cadeia da Polimerase em Tempo Real , Reação em Cadeia da Polimerase em Tempo Real/métodos , Humanos
3.
Nano Lett ; 22(5): 1905-1914, 2022 03 09.
Artigo em Inglês | MEDLINE | ID: mdl-35212544

RESUMO

DNA is a promising next-generation data storage medium, but challenges remain with synthesis costs and recording latency. Here, we describe a prototype of a DNA data storage system that uses an extended molecular alphabet combining natural and chemically modified nucleotides. Our results show that MspA nanopores can discriminate different combinations and ordered sequences of natural and chemically modified nucleotides in custom-designed oligomers. We further demonstrate single-molecule sequencing of the extended alphabet using a neural network architecture that classifies raw current signals generated by Oxford Nanopore sequencers with an average accuracy exceeding 60% (39× larger than random guessing). Molecular dynamics simulations show that the majority of modified nucleotides lead to only minor perturbations of the DNA double helix. Overall, the extended molecular alphabet may potentially offer a nearly 2-fold increase in storage density and potentially the same order of reduction in the recording latency, thereby enabling new implementations of molecular recorders.


Assuntos
Nanoporos , DNA/genética , Sistemas de Dados , Armazenamento e Recuperação da Informação , Redes Neurais de Computação , Nucleotídeos/química , Nucleotídeos/genética , Análise de Sequência de DNA/métodos
4.
Bioinformatics ; 34(15): 2654-2656, 2018 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-29528370

RESUMO

Motivation: DNA methylation is one of the most important epigenetic mechanisms in cells that exhibits a significant role in controlling gene expressions. Abnormal methylation patterns have been associated with cancer, imprinting disorders and repeat-instability diseases. As inexpensive bisulfite sequencing approaches have led to significant efforts in acquiring methylation data, problems of data storage and management have become increasingly important. The de facto compression method for methylation data is gzip, which is a general purpose compression algorithm that does not cater to the special format of methylation files. We propose METHCOMP, a new compression scheme tailor-made for bedMethyl files, which supports random access. Results: We tested the METHCOMP algorithm on 24 bedMethyl files retrieved from four randomly selected ENCODE assays. Our findings reveal that METHCOMP offers an average compression ratio improvement over gzip of up to 7.5x. As an example, METHCOMP compresses a 48 GB file to only 0.9 GB, which corresponds to a 98% reduction in size. Availability and implementation: METHCOMP is freely available at https://github.com/jianhao2016/METHCOMP. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Metilação de DNA , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genômica/métodos , Humanos
5.
Bioinformatics ; 34(6): 911-919, 2018 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-29087447

RESUMO

Motivation: Chromatin immunoprecipitation sequencing (ChIP-seq) experiments are inexpensive and time-efficient, and result in massive datasets that introduce significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. ChIPWig enables random access, summary statistics lookups and it is based on the asymptotic theory of optimal point density design for nonuniform quantizers. Results: We tested the ChIPWig compressor on 10 ChIP-seq datasets generated by the ENCODE consortium. On average, lossless ChIPWig reduced the file sizes to merely 6% of the original, and offered 6-fold compression rate improvement compared to bigWig. The lossy feature further reduced file sizes 2-fold compared to the lossless mode, with little or no effects on peak calling and motif discovery using specialized NarrowPeaks methods. The compression and decompression speed rates are of the order of 0.2 sec/MB using general purpose computers. Availability and implementation: The source code and binaries are freely available for download at https://github.com/vidarmehr/ChIPWig-v2, implemented in C ++. Contact: milenkov@illinois.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Imunoprecipitação da Cromatina/métodos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software
6.
Bioinformatics ; 32(2): 173-80, 2016 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-26424856

RESUMO

CONTRIBUTIONS: We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms. MOTIVATION: The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates. RESULTS: We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3-17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2-3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2-5.2 times. AVAILABILITY AND IMPLEMENTATION: The smallWig software can be downloaded from: http://stanford.edu/~zhiyingw/smallWig/smallwig.html, http://publish.illinois.edu/milenkovic/, http://web.stanford.edu/~tsachy/. CONTACT: zhiyingw@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Humanos , Software
7.
Bioinformatics ; 32(24): 3717-3728, 2016 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-27540270

RESUMO

MOTIVATION: Cancer genomes exhibit a large number of different alterations that affect many genes in a diverse manner. An improved understanding of the generative mechanisms behind the mutation rules and their influence on gene community behavior is of great importance for the study of cancer. RESULTS: To expand our capability to analyze combinatorial patterns of cancer alterations, we developed a rigorous methodology for cancer mutation pattern discovery based on a new, constrained form of correlation clustering. Our new algorithm, named C3 (Cancer Correlation Clustering), leverages mutual exclusivity of mutations, patient coverage and driver network concentration principles. To test C3, we performed a detailed analysis on TCGA breast cancer and glioblastoma data and showed that our algorithm outperforms the state-of-the-art CoMEt method in terms of discovering mutually exclusive gene modules and identifying biologically relevant driver genes. The proposed agnostic clustering method represents a unique tool for efficient and reliable identification of mutation patterns and driver pathways in large-scale cancer genomics studies, and it may also be used for other clustering problems on biological graphs. AVAILABILITY AND IMPLEMENTATION: The source code for the C3 method can be found at https://github.com/jackhou2/C3 CONTACTS: jianma@cs.cmu.edu or milenkov@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Neoplasias da Mama/genética , Análise por Conglomerados , Biologia Computacional/métodos , Análise Mutacional de DNA/métodos , Glioblastoma/genética , Feminino , Redes Reguladoras de Genes , Humanos , Mutação
8.
IEEE ACM Trans Netw ; 25(5): 3219-3234, 2017 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30473608

RESUMO

We propose a new latent Boolean feature model for complex networks that captures different types of node interactions and network communities. The model is based on a new concept in graph theory, termed the Boolean intersection representation of a graph, which generalizes the notion of an intersection representation. We mostly focus on one form of Boolean intersection, termed cointersection, and describe how to use this representation to deduce node feature sets and their communities. We derive several general bounds on the minimum number of features used in cointersection representations and discuss graph families for which exact cointersection characterizations are possible. Our results also include algorithms for finding optimal and approximate cointersection representations of a graph.

9.
BMC Bioinformatics ; 17: 94, 2016 Feb 19.
Artigo em Inglês | MEDLINE | ID: mdl-26895947

RESUMO

BACKGROUND: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1-10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. RESULTS: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. CONCLUSIONS: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. AVAILABILITY: The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.


Assuntos
Classificação/métodos , Compressão de Dados/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Humanos
10.
Bioinformatics ; 31(7): 1034-43, 2015 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25411330

RESUMO

UNLABELLED: Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: (i) first, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. Within this context, we quantify the influence of the number of training genes and similarity criteria on the diagnostic quality of the aggregate and perform in-depth cross-validation studies; (ii) second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-versus-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally; (iii) third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency. MOTIVATION: Fundamental results from social choice theory, political and computer sciences, and statistics have shown that there exists no consistent, fair and unique way to aggregate rankings. Instead, one has to decide on an aggregation approach using predefined set of desirable properties for the aggregate. The aggregation methods fall into two categories, score- and distance-based approaches, each of which has its own drawbacks and advantages. This work is motivated by the observation that merging these two techniques in a computationally efficient manner, and by incorporating additional constraints, one can ensure that the predictive quality of the resulting aggregation algorithm is very high. RESULTS: We tested HyDRA on a number of gene sets, including autism, breast cancer, colorectal cancer, endometriosis, ischaemic stroke, leukemia, lymphoma and osteoarthritis. Furthermore, we performed iterative gene discovery for glioblastoma, meningioma and breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour. Despite this finding, we recommend as best practice to take the union of top-ranked items produced by different methods for the final aggregated list. AVAILABILITY AND IMPLEMENTATION: The HyDRA software may be downloaded from: http://web.engr.illinois.edu/∼mkim158/HyDRA.zip. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Inteligência Artificial , Doença/genética , Genes , Predisposição Genética para Doença , Genômica/métodos , Software , Bases de Dados Genéticas , Humanos
11.
Angew Chem Int Ed Engl ; 55(36): 10722-5, 2016 08 26.
Artigo em Inglês | MEDLINE | ID: mdl-27484303

RESUMO

A 2D approach was studied for the design of polymer-based molecular barcodes. Uniform oligo(alkoxyamine amide)s, containing a monomer-coded binary message, were synthesized by orthogonal solid-phase chemistry. Sets of oligomers with different chain-lengths were prepared. The physical mixture of these uniform oligomers leads to an intentional dispersity (1st dimension fingerprint), which is measured by electrospray mass spectrometry. Furthermore, the monomer sequence of each component of the mass distribution can be analyzed by tandem mass spectrometry (2nd dimension sequencing). By summing the sequence information of all components, a binary message can be read. A 4-bytes extended ASCII-coded message was written on a set of six uniform oligomers. Alternatively, a 3-bytes sequence was written on a set of five oligomers. In both cases, the coded binary information was recovered.

12.
bioRxiv ; 2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38585764

RESUMO

Cohesin is required for chromatin loop formation. However, its precise role in regulating gene transcription remains largely unknown. We investigated the relationship between cohesin and RNA Polymerase II (RNAPII) using single-molecule mapping and live-cell imaging methods in human cells. Cohesin-mediated transcriptional loops were highly correlated with those of RNAPII and followed the direction of gene transcription. Depleting RAD21, a subunit of cohesin, resulted in the loss of long-range (>100 kb) loops between distal (super-)enhancers and promoters of cell-type-specific genes. By contrast, the short-range (<50 kb) loops were insensitive to RAD21 depletion and connected genes that are mostly housekeeping. This result explains why only a small fraction of genes are affected by the loss of long-range chromatin interactions due to cohesin depletion. Remarkably, RAD21 depletion appeared to up-regulate genes located in early initiation zones (EIZ) of DNA replication, and the EIZ signals were amplified drastically without RAD21. Our results revealed new mechanistic insights of cohesin's multifaceted roles in establishing transcriptional loops, preserving long-range chromatin interactions for cell-specific genes, and maintaining timely order of DNA replication.

13.
Artigo em Inglês | MEDLINE | ID: mdl-35385386

RESUMO

We consider the problem of determining the mutational support and distribution of the SARS-CoV-2 viral genome in the small-sample regime. The mutational support refers to the unknown number of sites that may eventually mutate in the SARS-CoV-2 genome while mutational distribution refers to the distribution of point mutations in the viral genome across a population. The mutational support may be used to assess the virulence of the virus and guide primer selection for real-time RT-PCR testing. Estimating the distribution of mutations in the genome of different subpopulations while accounting for the unseen may also aid in discovering new variants. To estimate the mutational support in the small-sample regime, we use GISAID sequencing data and our state-of-the-art polynomial estimation techniques based on new weighted and regularized Chebyshev approximation methods. For distribution estimation, we adapt the well-known Good-Turing estimator. Our analysis reveals several findings: First, the mutational supports exhibit significant differences in the ORF6 and ORF7a regions (older versus younger patients), ORF1b and ORF10 regions (females versus males) and in almost all ORFs (Asia/Europe/North America). Second, even though the N region of SARS-CoV-2 has a predicted 10% mutational support, mutations fall outside of the primer regions recommended by the CDC.


Assuntos
COVID-19 , SARS-CoV-2 , Masculino , Feminino , Humanos , SARS-CoV-2/genética , COVID-19/genética , Mutação/genética , Mutação Puntual , Genoma Viral/genética
14.
Nat Commun ; 13(1): 2984, 2022 05 27.
Artigo em Inglês | MEDLINE | ID: mdl-35624096

RESUMO

DNA-based data storage platforms traditionally encode information only in the nucleotide sequence of the molecule. Here we report on a two-dimensional molecular data storage system that records information in both the sequence and the backbone structure of DNA and performs nontrivial joint data encoding, decoding and processing. Our 2DDNA method efficiently stores images in synthetic DNA and embeds pertinent metadata as nicks in the DNA backbone. To avoid costly worst-case redundancy for correcting sequencing/rewriting errors and to mitigate issues associated with mismatched decoding parameters, we develop machine learning techniques for automatic discoloration detection and image inpainting. The 2DDNA platform is experimentally tested by reconstructing a library of images with undetectable or small visual degradation after readout processing, and by erasing and rewriting copyright metadata encoded in nicks. Our results demonstrate that DNA can serve both as a write-once and rewritable memory for heterogenous data and that data can be erased in a permanent, privacy-preserving manner. Moreover, the storage system can be made robust to degrading channel qualities while avoiding global error-correction redundancy.


Assuntos
DNA , Aprendizado de Máquina , DNA/genética , Biblioteca Gênica , Armazenamento e Recuperação da Informação , Metadados
15.
Microsyst Nanoeng ; 8: 27, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35310513

RESUMO

On-chip manipulation of charged particles using electrophoresis or electroosmosis is widely used for many applications, including optofluidic sensing, bioanalysis and macromolecular data storage. We hereby demonstrate a technique for the capture, localization, and release of charged particles and DNA molecules in an aqueous solution using tubular structures enabled by a strain-induced self-rolled-up nanomembrane (S-RuM) platform. Cuffed-in 3D electrodes that are embedded in cylindrical S-RuM structures and biased by a constant DC voltage are used to provide a uniform electrical field inside the microtubular devices. Efficient charged-particle manipulation is achieved at a bias voltage of <2-4 V, which is ~3 orders of magnitude lower than the required potential in traditional DC electrophoretic devices. Furthermore, Poisson-Boltzmann multiphysics simulation validates the feasibility and advantage of our microtubular charge manipulation devices over planar and other 3D variations of microfluidic devices. This work lays the foundation for on-chip DNA manipulation for data storage applications.

16.
Bioinformatics ; 25(13): 1686-93, 2009 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-19401400

RESUMO

MOTIVATION: The problem of reverse engineering the dynamics of gene expression profiles is of focal importance in systems biology. Due to noise and the inherent lack of sufficiently large datasets generated via high-throughput measurements, known reconstruction frameworks based on dynamical systems models fail to provide adequate settings for network analysis. This motivates the study of new approaches that produce stochastic lists of explanations for the observed network dynamics that can be efficiently inferred from small sample sets and in the presence of errors. RESULTS: We introduce a novel algebraic modeling framework, termed stochastic polynomial dynamical systems (SPDSs) that can capture the dynamics of regulatory networks based on microarray expression data. Here, we refer to dynamics of the network as the trajectories of gene expression profiles over time. The model assumes that the expression data is quantized in a manner that allows for imposing a finite field structure on the observations, and the existence of polynomial update functions for each gene in the network. The underlying reverse engineering algorithm is based on ideas borrowed from coding theory, and in particular, list-decoding methods for so called Reed-Muller codes. The list-decoding method was tested on synthetic data and on microarray expression measurements from the M(3D) database, corresponding to a subnetwork of the Escherichia coli SOS repair system, as well as on the complete transcription factor network, available at RegulonDB. The results show that SPDSs constructed via list-decoders significantly outperform other algebraic reverse engineering methods, and that they also provide good guidelines for estimating the influence of genes on the dynamics of the network. AVAILABILITY: Software codes for list-decoding algorithms suitable for direct application to quantized expression data will be publicly available at the authors' web-pages.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Modelos Genéticos , Perfilação da Expressão Gênica/métodos , Biologia de Sistemas
17.
Sci Rep ; 10(1): 7026, 2020 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-32321929

RESUMO

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

18.
Nat Commun ; 11(1): 1742, 2020 04 08.
Artigo em Inglês | MEDLINE | ID: mdl-32269230

RESUMO

Synthetic DNA-based data storage systems have received significant attention due to the promise of ultrahigh storage density and long-term stability. However, all known platforms suffer from high cost, read-write latency and error-rates that render them noncompetitive with modern storage devices. One means to avoid the above problems is using readily available native DNA. As the sequence content of native DNA is fixed, one can modify the topology instead to encode information. Here, we introduce DNA punch cards, a macromolecular storage mechanism in which data is written in the form of nicks at predetermined positions on the backbone of native double-stranded DNA. The platform accommodates parallel nicking on orthogonal DNA fragments and enzymatic toehold creation that enables single-bit random-access and in-memory computations. We use Pyrococcus furiosus Argonaute to punch files into the PCR products of Escherichia coli genomic DNA and accurately reconstruct the encoded data through high-throughput sequencing and read alignment.


Assuntos
Proteínas Argonautas/metabolismo , DNA/genética , Análise de Sequência de DNA , Sequência de Bases , Pyrococcus furiosus/enzimologia
19.
Nat Commun ; 10(1): 3, 2019 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-30602774

RESUMO

In addition to their use in DNA sequencing, ultrathin nanopore membranes have potential applications in detecting topological variations in deoxyribonucleic acid (DNA). This is due to the fact that when topologically edited DNA molecules, driven by electrophoretic forces, translocate through a narrow orifice, transient residings of edited segments inside the orifice modulate the ionic flow. Here we utilize two programmable barcoding methods based on base-pairing, namely forming a gap in dsDNA and creating protrusion sites in ssDNA for generating a hybrid DNA complex. We integrate a discriminative noise analysis for ds and ss DNA topologies into the threshold detection, resulting in improved multi-level signal detection and consequent extraction of reliable information about topological variations. Moreover, the positional information of the barcode along the template sequence can be determined unambiguously. All methods may be further modified to detect nicks in DNA, and thereby detect DNA damage and repair sites.


Assuntos
Código de Barras de DNA Taxonômico/métodos , DNA/química , Dissulfetos , Molibdênio , Nanoporos
20.
Sci Rep ; 7(1): 5011, 2017 07 10.
Artigo em Inglês | MEDLINE | ID: mdl-28694453

RESUMO

DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.


Assuntos
DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , Análise de Sequência de DNA/instrumentação , Algoritmos , Bases de Dados de Ácidos Nucleicos , Armazenamento e Recuperação da Informação , Nanoporos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA