RESUMEN
DNA, with its high storage density and long-term stability, is a potential candidate for a next-generation storage device. The DNA data storage channel, composed of synthesis, amplification, storage, and sequencing, exhibits error probabilities and error profiles specific to the components of the channel. Here, we present Autoturbo-DNA, a PyTorch framework for training error-correcting, overcomplete autoencoders specifically tailored for the DNA data storage channel. It allows training different architecture combinations and using a wide variety of channel component models for noise generation during training. It further supports training the encoder to generate DNA sequences that adhere to user-defined constraints. Autoturbo-DNA exhibits error-correction capabilities close to non-neural-network state-of-the-art error correction and constrained codes for DNA data storage. Our results indicate that neural-network-based codes can be a viable alternative to traditionally designed codes for the DNA data storage channel.
RESUMEN
Microbes are essential for element cycling and ecosystem functioning. However, many questions central to understanding the role of microbes in ecology are still open. Here, we analyze the relationship between lake microbiomes and the lakes' land cover. By applying machine learning methods, we quantify the covariance between land cover categories and the microbial community composition recorded in the largest amplicon sequencing dataset of European lakes available to date. Our results show that the aggregation of environmental features or microbial taxa before analysis can obscure ecologically relevant patterns. We observe a comparatively high covariation of the lakes' microbial community with herbaceous and open spaces surrounding the lake; nevertheless, the microbial covariation with land cover categories is generally lower than the covariation with physico-chemical parameters. Combining land cover and physico-chemical bioindicators identified from the same amplicon sequencing dataset, we develop analytical data structures that facilitate insights into the ecology of the lake microbiome. Among these, a list of the environmental parameters sorted by the number of microbial bioindicators we have identified for them points towards apparent environmental drivers of the lake microbial community composition, such as the altitude, conductivity, and area covered herbaceous vegetation surrounding the lake. Furthermore, the response map, a similarity matrix calculated from the Jaccard similarity of the environmental parameters' lists of bioindicators, allows us to study the ecosystem's structure from the standpoint of the microbiome. More specifically, we identify multiple clusters of highly similar and possibly functionally linked ecological parameters, including one that highlights the importance of the calcium-bicarbonate equilibrium for lake ecology. Taken together, we demonstrate the use of machine learning approaches in studying the interplay between microbial diversity and environmental factors and introduce novel approaches to integrate environmental molecular diversity into monitoring and water quality assessments.
Asunto(s)
Lagos , Microbiota , Biomarcadores Ambientales , Calidad del AguaRESUMEN
Due to the highly growing number of available genomic information, the need for accessible and easy-to-use analysis tools is increasing. To facilitate eukaryotic genome annotations, we created MOSGA. In this work, we show how MOSGA 2 is developed by including several advanced analyses for genomic data. Since the genomic data quality greatly impacts the annotation quality, we included multiple tools to validate and ensure high-quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can benefit from a broader genomic view by analyzing multiple genomic data sets simultaneously. Further, we demonstrate the new functionalities of MOSGA 2 by different use-cases and practical examples. MOSGA 2 extends the already established application to the quality control of the genomic data and integrates and analyzes multiple genomes in a larger context, e.g., by phylogenetics.