RESUMO
BACKGROUND: Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated. RESULTS: In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 and c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. CONCLUSION: We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
Assuntos
Composição de Bases , DNA , DNA/química , Análise de Sequência de DNA/métodos , Algoritmos , Armazenamento e Recuperação da Informação/métodosRESUMO
Conventional cryptographic methods rely on increased computational complexity to counteract the threat posed by growing computing power for sustainable protection. DNA cryptography circumvents this threat by leveraging complex DNA recognition to maintain information security. Specifically, DNA origami has been repurposed for cryptography, using programmable folding of the long scaffold strand carrying additional tagged strands for information encryption. Herein, a subtraction-based cryptographic strategy is presented that uses structural defects on DNA origami to contain encrypted information. Designated staple strands are removed from the staple pool with "hook" strands to create active defect sites on DNA origami for information encryption. These defects can be filled by incubating the structures with the intact pool of biotinylated staple strands, resulting in biotin patterns that can be used for protein-binding steganography. The yields of individual protein pixels reached over 91%, and self-correction codes are implemented to aid the information recovery. Furthermore, the encrypted organization of defective DNA origami structures is investigated to explore the potential of this method for scalable information storage. This method uses DNA origami to encrypt information in hidden structural features, utilizing subtraction for robust cryptography while ensuring the safety and recovery of data.
RESUMO
DNA molecules, as a storage medium, possess unique advantages. Not only does DNA storage exhibit significantly higher storage density compared to electromagnetic storage media, but it also features low energy consumption and extremely long storage times. However, the integration of DNA storage into daily life remains distant due to challenges such as low storage density, high latency, and inevitable errors during the storage process. Therefore, this paper proposes constructing a DNA storage coding set based on the Levy Sooty Tern Optimization Algorithm (LSTOA) to achieve an efficient random-access DNA storage system. Firstly, addressing the slow iteration speed and susceptibility to local optima of the Sooty Tern Optimization Algorithm (STOA), this paper introduces Levy flight operations and propose the LSTOA. Secondly, utilizing the LSTOA, this paper constructs a DNA storage encoding set to facilitate random access while meeting combinatorial constraints. To demonstrate the coding performance of the LSTOA, this paper consists of analyses on 13 benchmark test functions, showcasing its superior performance. Furthermore, under the same combinatorial constraints, the LSTOA constructs larger DNA storage coding sets, effectively reducing the read-write latency and error rate of DNA storage.
RESUMO
DNA amplification technologies have significantly advanced biotechnology, particularly in DNA storage. However, adaptation of these technologies to DNA storage poses substantial challenges. Key bottlenecks include achieving high throughput to manage large data sets, ensuring rapid and efficient DNA amplification, and minimizing bias to maintain data fidelity. This perspective begins with an overview of natural and artificial amplification strategies, such as polymerase chain reaction and isothermal amplification, highlighting their respective advantages and limitations. It then explores the prospective applications of these techniques in DNA storage, emphasizing the need to optimize protocols for scalability and robustness in handling diverse digital data. Concurrently, we identify promising avenues, including advancements in enzymatic processes and novel amplification methodologies, poised to mitigate existing constraints and propel the field forward. Ultimately, we provide insights into how to utilize advanced DNA amplification strategies poised to revolutionize the efficiency and feasibility of data storage, ushering in enhanced approaches to data retrieval in the digital age.
Assuntos
DNA , Técnicas de Amplificação de Ácido Nucleico , Técnicas de Amplificação de Ácido Nucleico/métodos , DNA/química , DNA/genética , Armazenamento e Recuperação da Informação/métodos , Reação em Cadeia da Polimerase/métodos , HumanosRESUMO
DNA molecules as storage media are characterized by high encoding density and low energy consumption, making DNA storage a highly promising storage method. However, DNA storage has shortcomings, especially when storing multimedia data, wherein image reconstruction fails when address errors occur, resulting in complete data loss. Therefore, we propose a parity encoding and local mean iteration (PELMI) scheme to achieve robust DNA storage of images. The proposed parity encoding scheme satisfies the common biochemical constraints of DNA sequences and the undesired motif content. It addresses varying pixel weights at different positions for binary data, thus optimizing the utilization of Reed-Solomon error correction. Then, through lost and erroneous sequences, data supplementation and local mean iteration are employed to enhance the robustness. The encoding results show that the undesired motif content is reduced by 23%-50% compared with the representative schemes, which improves the sequence stability. PELMI achieves image reconstruction under general errors (insertion, deletion, substitution) and enhances the DNA sequences quality. Especially under 1% error, compared with other advanced encoding schemes, the peak signal-to-noise ratio and the multiscale structure similarity address metric were increased by 10%-13% and 46.8%-122%, respectively, and the mean squared error decreased by 113%-127%. This demonstrates that the reconstructed images had better clarity, fidelity, and similarity in structure, texture, and detail. In summary, PELMI ensures robustness and stability of image storage in DNA and achieves relatively high-quality image reconstruction under general errors.
Assuntos
Algoritmos , DNA , DNA/genética , Processamento de Imagem Assistida por Computador/métodos , Armazenamento e Recuperação da Informação/métodosRESUMO
Recent advancements in synthesis and sequencing techniques have made deoxyribonucleic acid (DNA) a promising alternative for next-generation digital storage. As it approaches practical application, ensuring the security of DNA-stored information has become a critical problem. Deniable encryption allows the decryption of different information from the same ciphertext, ensuring that the "plausible" fake information can be provided when users are coerced to reveal the real information. In this paper, we propose a deniable encryption method that uniquely leverages DNA noise channels. Specifically, true and fake messages are encrypted by two similar modulation carriers and subsequently obfuscated by inherent errors. Experiment results demonstrate that our method not only can conceal true information among fake ones indistinguishably, but also allow both the coercive adversary and the legitimate receiver to decrypt the intended information accurately. Further security analysis validates the resistance of our method against various typical attacks. Compared with conventional DNA cryptography methods based on complex biological operations, our method offers superior practicality and reliability, positioning it as an ideal solution for data encryption in future large-scale DNA storage applications.
Assuntos
DNA , Armazenamento e Recuperação da Informação/métodosRESUMO
With digital transformation and the general application of new technologies, data storage is facing new challenges with the demand for high-density loading of massive information. In response, DNA storage technology has emerged as a promising research direction. Efficient and reliable data retrieval is critical for DNA storage, and the development of random access technology plays a key role in its practicality and reliability. However, achieving fast and accurate random access functions has proven difficult for existing DNA storage efforts, which limits its practical applications in industry. In this review, we summarize the recent advances in DNA storage technology that enable random access functionality, as well as the challenges that need to be overcome and the current solutions. This review aims to help researchers in the field of DNA storage better understand the importance of the random access step and its impact on the overall development of DNA storage. Furthermore, the remaining challenges and future research trends in random access technology of DNA storage are discussed, with the goal of providing a solid foundation for achieving random access in DNA storage under large-scale data conditions.
Assuntos
DNA , Armazenamento e Recuperação da Informação , DNA/química , Armazenamento e Recuperação da Informação/métodos , HumanosRESUMO
With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.
Assuntos
Algoritmos , DNA , DNA/genética , DNA/química , Software , Análise de Sequência de DNA/métodos , Biologia Computacional/métodosRESUMO
The exponential growth in data volume has necessitated the adoption of alternative storage solutions, and DNA storage stands out as the most promising solution. However, the exorbitant costs associated with synthesis and sequencing impeded its development. Pre-compressing the data is recognized as one of the most effective approaches for reducing storage costs. However, different compression methods yield varying compression ratios for the same file, and compressing a large number of files with a single method may not achieve the maximum compression ratio. This study proposes a multi-file dynamic compression method based on machine learning classification algorithms that selects the appropriate compression method for each file to minimize the amount of data stored into DNA as much as possible. Firstly, four different compression methods are applied to the collected files. Subsequently, the optimal compression method is selected as a label, as well as the file type and size are used as features, which are put into seven machine learning classification algorithms for training. The results demonstrate that k-nearest neighbor outperforms other machine learning algorithms on the validation set and test set most of the time, achieving an accuracy rate of over 85% and showing less volatility. Additionally, the compression rate of 30.85% can be achieved according to k-nearest neighbor model, more than 4.5% compared to the traditional single compression method, resulting in significant cost savings for DNA storage in the range of $0.48 to 3 billion/TB. In comparison to the traditional compression method, the multi-file dynamic compression method demonstrates a more significant compression effect when compressing multiple files. Therefore, it can considerably decrease the cost of DNA storage and facilitate the widespread implementation of DNA storage technology.
RESUMO
Polymerase Chain Reaction (PCR) amplification is widely used for retrieving information from DNA storage. During the PCR amplification process, nonspecific pairing between the 3' end of the primer and the DNA sequence can cause cross-talk in the amplification reaction, leading to the generation of interfering sequences and reduced amplification accuracy. To address this issue, we propose an efficient coding algorithm for PCR amplification information retrieval (ECA-PCRAIR). This algorithm employs variable-length scanning and pruning optimization to construct a codebook that maximizes storage density while satisfying traditional biological constraints. Subsequently, a codeword search tree is constructed based on the primer library to optimize the codebook, and a variable-length interleaver is used for constraint detection and correction, thereby minimizing the likelihood of nonspecific pairing. Experimental results demonstrate that ECA-PCRAIR can reduce the probability of nonspecific pairing between the 3' end of the primer and the DNA sequence to 2-25%, enhancing the robustness of the DNA sequences. Additionally, ECA-PCRAIR achieves a storage density of 2.14-3.67 bits per nucleotide (bits/nt), significantly improving storage capacity.
Assuntos
Algoritmos , Reação em Cadeia da Polimerase , Reação em Cadeia da Polimerase/métodos , DNA/genética , Armazenamento e Recuperação da Informação/métodos , Primers do DNA/genética , Sequência de BasesRESUMO
Robust encapsulation and controllable release of biomolecules have wide biomedical applications ranging from biosensing, drug delivery to information storage. However, conventional biomolecule encapsulation strategies have limitations in complicated operations, optical instability, and difficulty in decapsulation. Here, we report a simple, robust, and solvent-free biomolecule encapsulation strategy based on gallium liquid metal featuring low-temperature phase transition, self-healing, high hermetic sealing, and intrinsic resistance to optical damage. We sandwiched the biomolecules with the solid gallium films followed by low-temperature welding of the films for direct sealing. The gallium can not only protect DNA and enzymes from various physical and chemical damages but also allow the on-demand release of biomolecules by applying vibration to break the liquid gallium. We demonstrated that a DNA-coded image file can be recovered with up to 99.9% sequence retention after an accelerated aging test. We also showed the practical applications of the controllable release of bioreagents in a one-pot RPA-CRISPR/Cas12a reaction for SARS-COV-2 screening with a low detection limit of 10 copies within 40 min. This work may facilitate the development of robust and stimuli-responsive biomolecule capsules by using low-melting metals for biotechnology.
Assuntos
Técnicas Biossensoriais , Transição de Fase , SARS-CoV-2 , Técnicas Biossensoriais/métodos , SARS-CoV-2/isolamento & purificação , COVID-19/virologia , Gálio/química , Humanos , DNA/química , Sistemas CRISPR-Cas , Cápsulas/químicaRESUMO
Today's digital data storage systems typically offer advanced data recovery solutions to address the problem of catastrophic data loss, such as software-based disk sector analysis or physical-level data retrieval methods for conventional hard disk drives. However, DNA-based data storage currently relies solely on the inherent error correction properties of the methods used to encode digital data into strands of DNA. Any error that cannot be corrected utilizing the redundancy added by DNA encoding methods results in permanent data loss. To provide data recovery for DNA storage systems, we present a method to automatically reconstruct corrupted or missing data stored in DNA using fountain codes. Our method exploits the relationships between packets encoded with fountain codes to identify and rectify corrupted or lost data. Furthermore, we present file type-specific and content-based data recovery methods for three file types, illustrating how a fusion of fountain encoding-specific redundancy and knowledge about the data can effectively recover information in a corrupted DNA storage system, both in an automatic and in a guided manual manner. To demonstrate our approach, we introduce DR4DNA, a software toolkit that contains all methods presented. We evaluate DR4DNA using both in-silico and in-vitro experiments.
RESUMO
Environmental DNA (eDNA) workflows contain many familiar molecular-lab techniques, but also employ several unique methodologies. When working with eDNA, it is essential to avoid contamination from the point of collection through preservation and select a meaningful negative control. As eDNA can be obtained from a variety of samples and habitats (e.g., soil, water, air, or tissue), protocols will vary depending on usage. Samples may require additional steps to dilute, block, or remove inhibitors or physically break up samples or filters. Thereafter, standard DNA isolation techniques (kit-based or phenol:chloroform:isoamyl [PCI]) are employed. Once DNA is extracted, it is typically quantified using a fluorometer. Yields vary greatly, but are important to know prior to amplification of the gene(s) of interest. Long-term storage of both the sampled material and the extracted DNA is encouraged, as it provides a backup for spilled/contaminated samples, lost data, reanalysis, and future studies using newer technology. Storage in a freezer is often ideal; however, some storage buffers (e.g., Longmires) require that filters or swabs are kept at room temperature to prevent precipitation of buffer-related solutes. These baseline methods for eDNA isolation, validation, and preservation are detailed in this protocol chapter. In addition, we outline a cost-effective, homebrew extraction protocol optimized to extract eDNA.
Assuntos
DNA Ambiental , DNA Ambiental/isolamento & purificação , DNA Ambiental/análise , DNA Ambiental/genética , Preservação Biológica/métodos , Manejo de Espécimes/métodosRESUMO
In the absence of a DNA template, the ab initio production of long double-stranded DNA molecules of predefined sequences is particularly challenging. The DNA synthesis step remains a bottleneck for many applications such as functional assessment of ancestral genes, analysis of alternative splicing or DNA-based data storage. In this report we propose a fully in vitro protocol to generate very long double-stranded DNA molecules starting from commercially available short DNA blocks in less than 3 days using Golden Gate assembly. This innovative application allowed us to streamline the process to produce a 24 kb-long DNA molecule storing part of the Declaration of the Rights of Man and of the Citizen of 1789 . The DNA molecule produced can be readily cloned into a suitable host/vector system for amplification and selection.
Assuntos
DNA , DNA/genética , DNA/química , Armazenamento e Recuperação da Informação/métodos , Humanos , Sequência de Bases/genética , Clonagem Molecular/métodosRESUMO
The characterization of DNA methylation patterns to identify epigenetic markers for complex human diseases is an important and rapidly evolving part in biomedical research. DNA samples collected and stored in clinical biobanks over the past years are an important source for future epigenetic studies. Isolated gDNA is considered stable when stored at low temperatures for several years. However, the effect of multiple use and the associated repeated thawing of long-term stored DNA samples on DNA methylation patterns has not yet been investigated. In this study, we examined the influence of up to 10 freeze and thaw cycles on global DNA methylation by comparing genome-wide methylation profiles. DNA samples from 19 healthy volunteers were either frozen at -80°C or subjected to up to 10 freeze and thaw cycles. Genome-wide DNA methylation was analyzed after 0, 1, 3, 5, or 10 thaw cycles using the Illumina Infinium MethylationEPIC BeadChip. Evaluation of the global DNA methylation profile by beta-value density plots and multidimensional scaling plots revealed an expected clear participant-dependent variability, but a very low variability depending on the freeze and thaw cycles. In accordance, no significant difference in any of the methylated cytosine/guanine sites studied could be detected in the performed statistical analyses. Our results suggest that long-term frozen DNA samples are still suitable for epigenetic studies after multiple thaw cycles.
Assuntos
Metilação de DNA , DNA , Humanos , Congelamento , DNA/genética , Voluntários Saudáveis , GenômicaRESUMO
Due to its high information density, DNA is very attractive as a data storage system. However, a major obstacle is the high cost and long turnaround time for retrieving DNA data with next-generation sequencing. Herein, the use of a microfluidic very large-scale integration (mVLSI) platform is described to perform highly parallel and rapid readout of data stored in DNA. Additionally, it is demonstrated that multi-state data encoded in DNA can be deciphered with on-chip melt-curve analysis, thereby further increasing the data content that can be analyzed. The pairing of mVLSI network architecture with exquisitely specific DNA recognition gives rise to a scalable platform for rapid DNA data reading.
RESUMO
BACKGROUND: In single-stranded DNAs/RNAs, secondary structures are very common especially in long sequences. It has been recognized that the high degree of secondary structures in DNA sequences could interfere with the correct writing and reading of information in DNA storage. However, how to circumvent its side-effect is seldom studied. METHOD: As the degree of secondary structures of DNA sequences is closely related to the magnitude of the free energy released in the complicated folding process, we first investigate the free-energy distribution at different encoding lengths based on randomly generated DNA sequences. Then, we construct a bidirectional long short-term (BiLSTM)-attention deep learning model to predict the free energy of sequences. RESULTS: Our simulation results indicate that the free energy of DNA sequences at a specific length follows a right skewed distribution and the mean increases as the length increases. Given a tolerable free energy threshold of 20 kcal/mol, we could control the ratio of serious secondary structures in the encoding sequences to within 1% of the significant level through selecting a feasible encoding length of 100 nt. Compared with traditional deep learning models, the proposed model could achieve a better prediction performance both in the mean relative error (MRE) and the coefficient of determination (R2). It achieved MRE = 0.109 and R2 = 0.918 respectively in the simulation experiment. The combination of the BiLSTM and attention module can handle the long-term dependencies and capture the feature of base pairing. Further, the prediction has a linear time complexity which is suitable for detecting sequences with severe secondary structures in future large-scale applications. Finally, 70 of 94 predicted free energy can be screened out on a real dataset. It demonstrates that the proposed model could screen out some highly suspicious sequences which are prone to produce more errors and low sequencing copies.
RESUMO
DNA, as the storage medium in organisms, can address the shortcomings of existing electromagnetic storage media, such as low information density, high maintenance power consumption, and short storage time. Current research on DNA storage mainly focuses on designing corresponding encoders to convert binary data into DNA base data that meets biological constraints. We have created a new Chinese character code table that enables exceptionally high information storage density for storing Chinese characters (compared to traditional UTF-8 encoding). To meet biological constraints, we have devised a DNA shift coding scheme with low algorithmic complexity, which can encode any strand of DNA even has excessively long homopolymer. The designed DNA sequence will be stored in a double-stranded plasmid of 744bp, ensuring high reliability during storage. Additionally, the plasmid's resistance to environmental interference ensuring long-term stable information storage. Moreover, it can be replicated at a lower cost.
RESUMO
DNA storage systems have begun to attract considerable attention as next-generation storage technologies due to their high densities and longevity. However, efficient primer design for random-access in synthesized DNA strands is still an issue that needs to be solved. Although previous studies have explored various constraints for primer design in DNA storage systems, there is no attention paid to the combination of weakly mutually uncorrelated codes with the maximum run length constraint. In this paper, we first propose a code design by combining weakly mutually uncorrelated codes with the maximum run length constraint. Moreover, we also explore the weakly mutually uncorrelated codes to satisfy combinations of maximum run length constraint with more constraints such as being almost-balanced and having large Hamming distance, which are also efficient constraints for random-access in DNA storage systems. To guarantee that the proposed codes can be adapted to primer design with variable length, we present modified code construction methods to achieve different lengths of the code. Then, we provide an analysis of the size of the proposed codes, which indicates the capacity to support primer design. Finally, we compare the codes with those of previous works to show that the proposed codes can always guarantee the maximum run length constraint, which is helpful for random-access for DNA storage.
Assuntos
DNA , Salários e BenefíciosRESUMO
DNA is an incredibly dense storage medium for digital data. However, computing on the stored information is expensive and slow, requiring rounds of sequencing, in silico computation, and DNA synthesis. Prior work on accessing and modifying data using DNA hybridization or enzymatic reactions had limited computation capabilities. Inspired by the computational power of "DNA strand displacement," we augment DNA storage with "in-memory" molecular computation using strand displacement reactions to algorithmically modify data in a parallel manner. We show programs for binary counting and Turing universal cellular automaton Rule 110, the latter of which is, in principle, capable of implementing any computer algorithm. Information is stored in the nicks of DNA, and a secondary sequence-level encoding allows high-throughput sequencing-based readout. We conducted multiple rounds of computation on 4-bit data registers, as well as random access of data (selective access and erasure). We demonstrate that large strand displacement cascades with 244 distinct strand exchanges (sequential and in parallel) can use naturally occurring DNA sequence from M13 bacteriophage without stringent sequence design, which has the potential to improve the scale of computation and decrease cost. Our work merges DNA storage and DNA computing, setting the foundation of entirely molecular algorithms for parallel manipulation of digital information preserved in DNA.