RESUMO
With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.
Assuntos
Algoritmos , DNA , DNA/genética , DNA/química , Software , Análise de Sequência de DNA/métodos , Biologia Computacional/métodosRESUMO
DNA molecules as storage media are characterized by high encoding density and low energy consumption, making DNA storage a highly promising storage method. However, DNA storage has shortcomings, especially when storing multimedia data, wherein image reconstruction fails when address errors occur, resulting in complete data loss. Therefore, we propose a parity encoding and local mean iteration (PELMI) scheme to achieve robust DNA storage of images. The proposed parity encoding scheme satisfies the common biochemical constraints of DNA sequences and the undesired motif content. It addresses varying pixel weights at different positions for binary data, thus optimizing the utilization of Reed-Solomon error correction. Then, through lost and erroneous sequences, data supplementation and local mean iteration are employed to enhance the robustness. The encoding results show that the undesired motif content is reduced by 23%-50% compared with the representative schemes, which improves the sequence stability. PELMI achieves image reconstruction under general errors (insertion, deletion, substitution) and enhances the DNA sequences quality. Especially under 1% error, compared with other advanced encoding schemes, the peak signal-to-noise ratio and the multiscale structure similarity address metric were increased by 10%-13% and 46.8%-122%, respectively, and the mean squared error decreased by 113%-127%. This demonstrates that the reconstructed images had better clarity, fidelity, and similarity in structure, texture, and detail. In summary, PELMI ensures robustness and stability of image storage in DNA and achieves relatively high-quality image reconstruction under general errors.
Assuntos
Algoritmos , DNA , DNA/genética , Processamento de Imagem Assistida por Computador/métodos , Armazenamento e Recuperação da Informação/métodosRESUMO
DNA is an incredibly dense storage medium for digital data. However, computing on the stored information is expensive and slow, requiring rounds of sequencing, in silico computation, and DNA synthesis. Prior work on accessing and modifying data using DNA hybridization or enzymatic reactions had limited computation capabilities. Inspired by the computational power of "DNA strand displacement," we augment DNA storage with "in-memory" molecular computation using strand displacement reactions to algorithmically modify data in a parallel manner. We show programs for binary counting and Turing universal cellular automaton Rule 110, the latter of which is, in principle, capable of implementing any computer algorithm. Information is stored in the nicks of DNA, and a secondary sequence-level encoding allows high-throughput sequencing-based readout. We conducted multiple rounds of computation on 4-bit data registers, as well as random access of data (selective access and erasure). We demonstrate that large strand displacement cascades with 244 distinct strand exchanges (sequential and in parallel) can use naturally occurring DNA sequence from M13 bacteriophage without stringent sequence design, which has the potential to improve the scale of computation and decrease cost. Our work merges DNA storage and DNA computing, setting the foundation of entirely molecular algorithms for parallel manipulation of digital information preserved in DNA.
Assuntos
Computadores Moleculares , DNA , Replicação do DNA , Algoritmos , Bacteriófago M13RESUMO
BACKGROUND: Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated. RESULTS: In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 and c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. CONCLUSION: We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
Assuntos
Composição de Bases , DNA , DNA/química , Análise de Sequência de DNA/métodos , Algoritmos , Armazenamento e Recuperação da Informação/métodosRESUMO
Conventional cryptographic methods rely on increased computational complexity to counteract the threat posed by growing computing power for sustainable protection. DNA cryptography circumvents this threat by leveraging complex DNA recognition to maintain information security. Specifically, DNA origami has been repurposed for cryptography, using programmable folding of the long scaffold strand carrying additional tagged strands for information encryption. Herein, a subtraction-based cryptographic strategy is presented that uses structural defects on DNA origami to contain encrypted information. Designated staple strands are removed from the staple pool with "hook" strands to create active defect sites on DNA origami for information encryption. These defects can be filled by incubating the structures with the intact pool of biotinylated staple strands, resulting in biotin patterns that can be used for protein-binding steganography. The yields of individual protein pixels reached over 91%, and self-correction codes are implemented to aid the information recovery. Furthermore, the encrypted organization of defective DNA origami structures is investigated to explore the potential of this method for scalable information storage. This method uses DNA origami to encrypt information in hidden structural features, utilizing subtraction for robust cryptography while ensuring the safety and recovery of data.
RESUMO
Deoxyribonucleic acid (DNA)-based data storage is a promising new storage technology which has the advantage of high storage capacity and long storage time compared with traditional storage media. However, the synthesis and sequencing process of DNA can randomly generate many types of errors, which makes it more difficult to cluster DNA sequences to recover DNA information. Currently, the available DNA clustering algorithms are targeted at DNA sequences in the biological domain, which not only cannot adapt to the characteristics of sequences in DNA storage, but also tend to be unacceptably time-consuming for billions of DNA sequences in DNA storage. In this paper, we propose an efficient DNA clustering method termed Clover for DNA storage with linear computational complexity and low memory. Clover avoids the computation of the Levenshtein distance by using a tree structure for interval-specific retrieval. We argue through theoretical proofs that Clover has standard linear computational complexity, low space complexity, etc. Experiments show that our method can cluster 10 million DNA sequences into 50 000 classes in 10 s and meet an accuracy rate of over 99%. Furthermore, we have successfully completed an unprecedented clustering of 10 billion DNA data on a single home computer and the time consumption still satisfies the linear relationship. Clover is freely available at https://github.com/Guanjinqu/Clover.
Assuntos
Algoritmos , Armazenamento e Recuperação da Informação , Análise por Conglomerados , DNA/genética , Análise de SequênciaRESUMO
Polymerase Chain Reaction (PCR) amplification is widely used for retrieving information from DNA storage. During the PCR amplification process, nonspecific pairing between the 3' end of the primer and the DNA sequence can cause cross-talk in the amplification reaction, leading to the generation of interfering sequences and reduced amplification accuracy. To address this issue, we propose an efficient coding algorithm for PCR amplification information retrieval (ECA-PCRAIR). This algorithm employs variable-length scanning and pruning optimization to construct a codebook that maximizes storage density while satisfying traditional biological constraints. Subsequently, a codeword search tree is constructed based on the primer library to optimize the codebook, and a variable-length interleaver is used for constraint detection and correction, thereby minimizing the likelihood of nonspecific pairing. Experimental results demonstrate that ECA-PCRAIR can reduce the probability of nonspecific pairing between the 3' end of the primer and the DNA sequence to 2-25%, enhancing the robustness of the DNA sequences. Additionally, ECA-PCRAIR achieves a storage density of 2.14-3.67 bits per nucleotide (bits/nt), significantly improving storage capacity.
Assuntos
Algoritmos , Reação em Cadeia da Polimerase , Reação em Cadeia da Polimerase/métodos , DNA/genética , Armazenamento e Recuperação da Informação/métodos , Primers do DNA/genética , Sequência de BasesRESUMO
DNA molecules, as a storage medium, possess unique advantages. Not only does DNA storage exhibit significantly higher storage density compared to electromagnetic storage media, but it also features low energy consumption and extremely long storage times. However, the integration of DNA storage into daily life remains distant due to challenges such as low storage density, high latency, and inevitable errors during the storage process. Therefore, this paper proposes constructing a DNA storage coding set based on the Levy Sooty Tern Optimization Algorithm (LSTOA) to achieve an efficient random-access DNA storage system. Firstly, addressing the slow iteration speed and susceptibility to local optima of the Sooty Tern Optimization Algorithm (STOA), this paper introduces Levy flight operations and propose the LSTOA. Secondly, utilizing the LSTOA, this paper constructs a DNA storage encoding set to facilitate random access while meeting combinatorial constraints. To demonstrate the coding performance of the LSTOA, this paper consists of analyses on 13 benchmark test functions, showcasing its superior performance. Furthermore, under the same combinatorial constraints, the LSTOA constructs larger DNA storage coding sets, effectively reducing the read-write latency and error rate of DNA storage.
RESUMO
Synchronization (insertions-deletions) errors are still a major challenge for reliable information retrieval in DNA storage. Unlike traditional error correction codes (ECC) that add redundancy in the stored information, multiple sequence alignment (MSA) solves this problem by searching the conserved subsequences. In this paper, we conduct a comprehensive simulation study on the error correction capability of a typical MSA algorithm, MAFFT. Our results reveal that its capability exhibits a phase transition when there are around 20% errors. Below this critical value, increasing sequencing depth can eventually allow it to approach complete recovery. Otherwise, its performance plateaus at some poor levels. Given a reasonable sequencing depth (≤ 70), MSA could achieve complete recovery in the low error regime, and effectively correct 90% of the errors in the medium error regime. In addition, MSA is robust to imperfect clustering. It could also be combined with other means such as ECC, repeated markers, or any other code constraints. Furthermore, by selecting an appropriate sequencing depth, this strategy could achieve an optimal trade-off between cost and reading speed. MSA could be a competitive alternative for future DNA storage.
Assuntos
Algoritmos , DNA , Alinhamento de Sequência , DNA/genética , Simulação por Computador , Análise de Sequência de DNARESUMO
With the informationization of social processes, the amount of related data has greatly increased, making traditional storage media unable to meet the current requirements for data storage. Due to its advantages of a high storage capacity and persistence, deoxyribonucleic acid (DNA) has been considered the most prospective storage media to solve the data storage problem. Synthesis is an important process for DNA storage, and low-quality DNA coding can increase errors during sequencing, which can affect the storage efficiency. To reduce errors caused by the poor stability of DNA sequences during storage, this paper proposes a method that uses the double-matching and error-pairing constraints to improve the quality of the DNA coding set. First, the double-matching and error-pairing constraints are defined to solve problems of sequences with self-complementary reactions in the solution that are prone to mismatch at the 3' end. In addition, two strategies are introduced in the arithmetic optimization algorithm, including a random perturbation of the elementary function and a double adaptive weighting strategy. An improved arithmetic optimization algorithm (IAOA) is proposed to construct DNA coding sets. The experimental results of the IAOA on 13 benchmark functions show a significant improvement in its exploration and development capabilities over the existing algorithms. Moreover, the IAOA is used in the DNA encoding design under both traditional and new constraints. The DNA coding sets are tested to estimate their quality regarding the number of hairpins and melting temperature. The DNA storage coding sets constructed in this study are improved by 77.7% at the lower boundary compared to existing algorithms. The DNA sequences in the storage sets show a reduction of 9.7-84.1% in the melting temperature variance, and the hairpin structure ratio is reduced by 2.1-80%. The results indicate that the stability of the DNA coding sets is improved under the two proposed constraints compared to traditional constraints.
RESUMO
Due to its high information density, DNA is very attractive as a data storage system. However, a major obstacle is the high cost and long turnaround time for retrieving DNA data with next-generation sequencing. Herein, the use of a microfluidic very large-scale integration (mVLSI) platform is described to perform highly parallel and rapid readout of data stored in DNA. Additionally, it is demonstrated that multi-state data encoded in DNA can be deciphered with on-chip melt-curve analysis, thereby further increasing the data content that can be analyzed. The pairing of mVLSI network architecture with exquisitely specific DNA recognition gives rise to a scalable platform for rapid DNA data reading.
RESUMO
BACKGROUND: Using DNA as a storage medium is appealing due to the information density and longevity of DNA, especially in the era of data explosion. A significant challenge in the DNA data storage area is to deal with the noises introduced in the channel and control the trade-off between the redundancy of error correction codes and the information storage density. As running DNA data storage experiments in vitro is still expensive and time-consuming, a simulation model is needed to systematically optimize the redundancy to combat the channel's particular noise structure. RESULTS: Here, we present DeSP, a systematic DNA storage error Simulation Pipeline, which simulates the errors generated from all DNA storage stages and systematically guides the optimization of encoding redundancy. It covers both the sequence lost and the within-sequence errors in the particular context of the data storage channel. With this model, we explained how errors are generated and passed through different stages to form final sequencing results, analyzed the influence of error rate and sampling depth to final error rates, and demonstrated how to systemically optimize redundancy design in silico with the simulation model. These error simulation results are consistent with the in vitro experiments. CONCLUSIONS: DeSP implemented in Python is freely available on Github ( https://github.com/WangLabTHU/DeSP ). It is a flexible framework for systematic error simulation in DNA storage and can be adapted to a wide range of experiment pipelines.
Assuntos
DNA , Armazenamento e Recuperação da Informação , Simulação por Computador , DNA/genética , Análise de Sequência de DNA/métodosRESUMO
Large scale DNA oligo pools are emerging as a novel material in a variety of advanced applications. However, GC content and length cause significant bias in amplification of oligos. We systematically explored the amplification of one oligo pool comprising of over ten thousand distinct strands with moderate GC content in the range of 35-65%. Uniqual amplification of oligos result to the increased Gini index of the oligo distribution while a few oligos greatly increased their proportion after 60 cycles of PCR. However, the significantly enriched oligos all have relatively high GC content. Further thermodynamic analysis demonstrated that a high value of both GC content and Gibbs free energy could improve the replication of specific oligos during biased amplification. Therefore, this double-G (GC content and Gibbs free energy) driven replication advantage can be used as a guiding principle for the sequence design for a variety of applications, particularly for data storage.
Assuntos
DNA , Composição de Bases , DNA/genética , Reação em Cadeia da Polimerase , TermodinâmicaRESUMO
Traditional storage media have been gradually unable to meet the needs of data storage around the world, and one solution to this problem is DNA storage. However, it is easy to make errors in the subsequent sequencing reading process of DNA storage coding. To reduces error rates, a method to enhance the robustness of the DNA storage coding set is proposed. Firstly, to reduce the likelihood of secondary structure in DNA coding sets, a repeat tandem sequence constraint is proposed. An improved DTW distance constraint is proposed to address the issue that the traditional distance constraint cannot accurately evaluate non-specific hybridization between DNA sequences. Secondly, an algorithm that combines random opposition-based learning and eddy jump strategy with Aquila Optimizer (AO) is proposed in this paper, which is called ROEAO. Finally, the ROEAO algorithm is used to construct the coding sets with traditional constraints and enhanced constraints, respectively. The quality of the two coding sets is evaluated by the test of the number of issuing card structures and the temperature stability of melting; the data show that the coding set constructed with ROEAO under enhanced constraints can obtain a larger lower bound while improving the coding quality.
RESUMO
BACKGROUND: DNA is a promising storage medium for high-density long-term digital data storage. Since DNA synthesis and sequencing are still relatively expensive tasks, the coding methods used to store digital data in DNA should correct errors and avoid unstable or error-prone DNA sequences. Near-optimal rateless erasure codes, also called fountain codes, are particularly interesting codes to realize high-capacity and low-error DNA storage systems, as shown by Erlich and Zielinski in their approach based on the Luby transform (LT) code. Since LT is the most basic fountain code, there is a large untapped potential for improvement in using near-optimal erasure codes for DNA storage. RESULTS: We present NOREC4DNA, a software framework to use, test, compare, and improve near-optimal rateless erasure codes (NORECs) for DNA storage systems. These codes can effectively be used to store digital information in DNA and cope with the restrictions of the DNA medium. Additionally, they can adapt to possible variable lengths of DNA strands and have nearly zero overhead. We describe the design and implementation of NOREC4DNA. Furthermore, we present experimental results demonstrating that NOREC4DNA can flexibly be used to evaluate the use of NORECs in DNA storage systems. In particular, we show that NORECs that apparently have not yet been used for DNA storage, such as Raptor and Online codes, can achieve significant improvements over LT codes that were used in previous work. NOREC4DNA is available on https://github.com/umr-ds/NOREC4DNA . CONCLUSION: NOREC4DNA is a flexible and extensible software framework for using, evaluating, and comparing NORECs for DNA storage systems.
Assuntos
Algoritmos , DNA , DNA/genética , Armazenamento e Recuperação da Informação , Análise de Sequência de DNA , SoftwareRESUMO
A convenient method for the synthesis of the first generation PAMAM dendrimers based on the thiacalix[4]arene has been developed for the first time. Three new PAMAM-calix-dendrimers with the macrocyclic core in cone, partial cone, and 1,3-alternate conformations were obtained with high yields. The interaction of the obtained compounds with salmon sperm DNA resulted in the formation of the associates of the size up to 200 nm, as shown by the UV-Vis spectroscopy, DLS, and TEM. It was demonstrated by the CD method that the structure of the DNA did not undergo significant changes upon binding. The PAMAM-calix-dendrimer based on the macrocycle in cone conformation stabilized DNA and prevented its degradation.
Assuntos
DNA/química , DNA/metabolismo , Dendrímeros/química , Fenóis/química , Sulfetos/química , Animais , Masculino , Conformação Molecular , Salmão , Espermatozoides/metabolismoRESUMO
Due to the properties of DNA data storage, the errors that occur in DNA strands make error correction an important and challenging task. In this paper, a new code design of quaternary code suitable for DNA storage is proposed to correct at most two consecutive deletion or insertion errors. The decoding algorithms of the proposed codes are also presented when one and two deletion or insertion errors occur, and it is proved that the proposed code can correct at most two consecutive errors. Moreover, the lower and upper bounds on the cardinality of the proposed quaternary codes are also evaluated, then the redundancy of the proposed code is provided as roughly 2log48n.
RESUMO
Solid-state nanopores are powerful tools for reading the three-dimensional shape of molecules, allowing for the translation of molecular structure information into electric signals. Here, we show a high-resolution integrated nanopore system for identifying DNA nanostructures that has the capability of distinguishing attached short DNA hairpins with only a stem length difference of 8 bp along a DNA double strand named the DNA carrier. Using our platform, we can read up to 112 DNA hairpins with a separating distance of 114 bp attached on a DNA carrier that carries digital information. Our encoding strategy allows for the creation of a library of molecules with a size of up to 5 × 1033 (2112) that is only built from a few hundred types of base molecules for data storage and has the potential to be extended by linking multiple DNA carriers. Our platform provides a nanopore- and DNA nanostructure-based data storage method with convenient access and the potential for miniature-scale integration.
Assuntos
DNA/química , Armazenamento e Recuperação da Informação/métodos , Nanoporos , Nanoestruturas/química , Nanotecnologia/métodos , Sequência de Bases , Eletricidade , Biblioteca Gênica , Nanoporos/ultraestrutura , Nanoestruturas/ultraestruturaRESUMO
The high density, large capacity, and long-term stability of DNA molecules make them an emerging storage medium that is especially suitable for the long-term storage of large datasets. The DNA sequences used in storage need to consider relevant constraints to avoid nonspecific hybridization reactions, such as the No-runlength constraint, GC-content, and the Hamming distance. In this work, a new nonlinear control parameter strategy and a random opposition-based learning strategy were used to improve the Harris hawks optimization algorithm (for the improved algorithm NOL-HHO) in order to prevent it from falling into local optima. Experimental testing was performed on 23 widely used benchmark functions, and the proposed algorithm was used to obtain better coding lower bounds for DNA storage. The results show that our algorithm can better maintain a smooth transition between exploration and exploitation and has stronger global exploration capabilities as compared with other algorithms. At the same time, the improvement of the lower bound directly affects the storage capacity and code rate, which promotes the further development of DNA storage technology.
Assuntos
Inteligência Artificial , DNA/química , Algoritmos , Composição de Bases , Bases de Dados de Ácidos NucleicosRESUMO
Long-term storage of DNA is a routine practice in biomedical research, diagnostics and drug discovery. Periodic monitoring is important for early detection of changes in DNA quality and quantity. Existing methods include agarose gel, ultraviolet (UV) absorbance, fluorometric reading and qPCR. However, these methods are either limited by sensitivity or depend on DNA standards, which face the same storage challenges. In this paper, we tested the state-of-the-art droplet digital PCR (ddPCR) technology that can quantify the absolute DNA copy number with no need of a standard curve. We found that ddPCR was very accurate in determining the level of a plasmid DNA standard and was sensitive to DNA loss due to degradation or adsorption. With the ddPCR technology, we found a gradual process of DNA adsorption to several types of low binding tubes, which was unnoticed before. Although modest, adsorption significantly affected recovery of highly diluted DNA (<0.2⯵g/mL), which could be rescued by addition of carrier DNA. In conclusion, this paper not only demonstrated that ddPCR is an ideal method for monitoring DNA storage, but also provided an effective approach to improving recovery of highly diluted DNA, which may have broad implications in assay development, diagnostics and forensic sciences.