RESUMEN
Grouping gene expression into gene set activity scores (GSAS) provides better biological insights than studying individual genes. However, existing gene set projection methods cannot return representative, robust, and interpretable GSAS. We developed NetActivity, a machine learning framework that generates GSAS based on a sparsely-connected autoencoder, where each neuron in the inner layer represents a gene set. We proposed a three-tier training that yielded representative, robust, and interpretable GSAS. NetActivity model was trained with 1518 GO biological processes terms and KEGG pathways and all GTEx samples. NetActivity generates GSAS robust to the initialization parameters and representative of the original transcriptome, and assigned higher importance to more biologically relevant genes. Moreover, NetActivity returns GSAS with a more consistent definition and higher interpretability than GSVA and hipathia, state-of-the-art gene set projection methods. Finally, NetActivity enables combining bulk RNA-seq and microarray datasets in a meta-analysis of prostate cancer progression, highlighting gene sets related to cell division, key for disease progression. When applied to metastatic prostate cancer, gene sets associated with cancer progression were also altered due to drug resistance, while a classical enrichment analysis identified gene sets irrelevant to the phenotype. NetActivity is publicly available in Bioconductor and GitHub.
Asunto(s)
Neoplasias de la Próstata , Humanos , Neoplasias de la Próstata/genética , Neoplasias de la Próstata/patología , Neoplasias de la Próstata/metabolismo , Masculino , Aprendizaje Automático , Perfilación de la Expresión Génica/métodos , Transcriptoma/genética , Regulación Neoplásica de la Expresión Génica , RNA-Seq/métodos , AlgoritmosRESUMEN
MOTIVATION: Drug-target interaction (DTI) prediction is a relevant but challenging task in the drug repurposing field. In-silico approaches have drawn particular attention as they can reduce associated costs and time commitment of traditional methodologies. Yet, current state-of-the-art methods present several limitations: existing DTI prediction approaches are computationally expensive, thereby hindering the ability to use large networks and exploit available datasets and, the generalization to unseen datasets of DTI prediction methods remains unexplored, which could potentially improve the development processes of DTI inferring approaches in terms of accuracy and robustness. RESULTS: In this work, we introduce GeNNius (Graph Embedding Neural Network Interaction Uncovering System), a Graph Neural Network (GNN)-based method that outperforms state-of-the-art models in terms of both accuracy and time efficiency across a variety of datasets. We also demonstrated its prediction power to uncover new interactions by evaluating not previously known DTIs for each dataset. We further assessed the generalization capability of GeNNius by training and testing it on different datasets, showing that this framework can potentially improve the DTI prediction task by training on large datasets and testing on smaller ones. Finally, we investigated qualitatively the embeddings generated by GeNNius, revealing that the GNN encoder maintains biological information after the graph convolutions while diffusing this information through nodes, eventually distinguishing protein families in the node embedding space. AVAILABILITY AND IMPLEMENTATION: GeNNius code is available at https://github.com/ubioinformat/GeNNius.
Asunto(s)
Sistemas de Liberación de Medicamentos , Reposicionamiento de Medicamentos , Interacciones Farmacológicas , Difusión , Redes Neurales de la ComputaciónRESUMEN
The adult liver has an exceptional ability to regenerate, but how it maintains its specialized functions during regeneration is unclear. Here, we used partial hepatectomy (PHx) in tandem with single-cell transcriptomics to track cellular transitions and heterogeneities of â¼22,000 liver cells through the initiation, progression, and termination phases of mouse liver regeneration. Our results uncovered that, following PHx, a subset of hepatocytes transiently reactivates an early-postnatal-like gene expression program to proliferate, while a distinct population of metabolically hyperactive cells appears to compensate for any temporary deficits in liver function. Cumulative EdU labeling and immunostaining of metabolic, portal, and central vein-specific markers revealed that hepatocyte proliferation after PHx initiates in the midlobular region before proceeding toward the periportal and pericentral areas. We further demonstrate that portal and central vein proximal hepatocytes retain their metabolically active state to preserve essential liver functions while midlobular cells proliferate nearby. Through combined analysis of gene regulatory networks and cell-cell interaction maps, we found that regenerating hepatocytes redeploy key developmental regulons, which are guided by extensive ligand-receptor-mediated signaling events between hepatocytes and nonparenchymal cells. Altogether, our study offers a detailed blueprint of the intercellular crosstalk and cellular reprogramming that balances the metabolic and proliferative requirements of a regenerating liver.
Asunto(s)
Plasticidad de la Célula , Regeneración Hepática , Hígado/citología , Hígado/metabolismo , Animales , Proliferación Celular , Hepatectomía , Hepatocitos/citología , Hepatocitos/metabolismo , Ratones , Análisis de la Célula Individual , TranscriptomaRESUMEN
Inflammation is a common feature in neurodegenerative diseases that contributes to neuronal loss. Previously, we demonstrated that the basal inflammatory tone differed between brain regions and, consequently, the reaction generated to a pro-inflammatory stimulus was different. In this study, we assessed the innate immune reaction in the midbrain and in the striatum using an experimental model of Parkinson's disease. An adeno-associated virus serotype 9 expressing the α-synuclein and mCherry genes or the mCherry gene was administered into the substantia nigra. Myeloid cells (CD11b+ ) and astrocytes (ACSA2+ ) were purified from the midbrain and striatum for bulk RNA sequencing. In the parkinsonian midbrain, CD11b+ cells presented a unique anti-inflammatory transcriptomic profile that differed from degenerative microglia signatures described in experimental models for other neurodegenerative conditions. By contrast, striatal CD11b+ cells showed a pro-inflammatory state and were similar to disease-associated microglia. In the midbrain, a prominent increase of infiltrated monocytes/macrophages was observed and, together with microglia, participated actively in the phagocytosis of dopaminergic neuronal bodies. Although striatal microglia presented a phagocytic transcriptomic profile, morphology and cell density was preserved and no active phagocytosis was detected. Interestingly, astrocytes presented a pro-inflammatory fingerprint in the midbrain and a low number of differentially displayed transcripts in the striatum. During α-synuclein-dependent degeneration, microglia and astrocytes experience context-dependent activation states with a different contribution to the inflammatory reaction. Our results point towards the relevance of selecting appropriate cell targets to design neuroprotective strategies aimed to modulate the innate immune system during the active phase of dopaminergic degeneration.
Asunto(s)
Enfermedades Neurodegenerativas , Enfermedad de Parkinson , Ratones , Animales , Enfermedad de Parkinson/genética , alfa-Sinucleína/genética , alfa-Sinucleína/metabolismo , Microglía/metabolismo , Astrocitos/metabolismo , Mesencéfalo/metabolismo , InflamaciónRESUMEN
MOTIVATION: An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND (joint integration and discrimination for automated single-cell annotation), a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified. RESULTS: We show on several batched datasets that the joint approach to integration and classification of JIND outperforms in accuracy existing pipelines, and a smaller fraction of cells is rejected as unlabeled as a result of the cell-specific confidence thresholds. Moreover, we investigate cells misclassified by JIND and provide evidence suggesting that they could be due to outliers in the annotated datasets or errors in the original approach used for annotation of the target batch. AVAILABILITY AND IMPLEMENTATION: Implementation for JIND is available at https://github.com/mohit1997/JIND and the data underlying this article can be accessed at https://doi.org/10.5281/zenodo.6246322. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Perfilación de la Expresión GénicaRESUMEN
MOTIVATION: Gene regulatory networks describe the regulatory relationships among genes, and developing methods for reverse engineering these networks is an ongoing challenge in computational biology. The majority of the initially proposed methods for gene regulatory network discovery create a network of genes and then mine it in order to uncover previously unknown regulatory processes. More recent approaches have focused on inferring modules of co-regulated genes, linking these modules with regulatory genes and then mining them to discover new molecular biology. RESULTS: In this work we analyze module-based network approaches to build gene regulatory networks, and compare their performance to single gene network approaches. In the process, we propose a novel approach to estimate gene regulatory networks drawing from the module-based methods. We show that generating modules of co-expressed genes which are predicted by a sparse set of regulators using a variational Bayes method, and then building a bipartite graph on the generated modules using sparse regression, yields more informative networks than previous single and module-based network approaches as measured by: (i) the rate of enriched gene sets, (ii) a network topology assessment, (iii) ChIP-Seq evidence and (iv) the KnowEnG Knowledge Network collection of previously characterized gene-gene interactions. AVAILABILITY AND IMPLEMENTATION: The code is written in R and can be downloaded from https://github.com/mikelhernaez/linker. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Redes Reguladoras de Genes , Teorema de Bayes , Biología Computacional , Perfilación de la Expresión GénicaRESUMEN
MOTIVATION: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. RESULTS: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. AVAILABILITY AND IMPLEMENTATION: GPress is freely available at https://github.com/qm2/gpress. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , RNA-Seq , Programas Informáticos , Secuenciación del ExomaRESUMEN
MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY AND IMPLEMENTATION: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Genoma , Genómica , Programas InformáticosRESUMEN
MOTIVATION: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Algoritmos , Genoma Humano , Genómica , Humanos , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
The regulation of gene expression occurs through complex relationships between transcription, processing, turnover, and translation, which are only beginning to be elucidated. We know that at least for certain messenger (m) RNAs, processing, modifications, and sequence elements can greatly influence their translational output through recognition by translation and turn-over machinery. Recently, we and others have combined high-throughput sequencing technologies with traditional biochemical methods of studying translation to extend our understanding of these relationships. Additionally, there is growing importance given to how these processes may be regulated across varied cell types as a means to achieve tissue-specific expression of proteins. Here, we provide an in-depth methodology for polysome profiling to dissect the composition of mRNAs and proteins that make up the translatome from both whole tissues and a specific cell type isolated from mammalian tissue. Also, we provide a detailed computational workflow for the analysis of the next-generation sequencing data generated from these experiments.
Asunto(s)
Biología Computacional/métodos , Polirribosomas/genética , Biosíntesis de Proteínas , ARN Mensajero/genética , Análisis de Secuencia de ARN/estadística & datos numéricos , Animales , Encéfalo/citología , Encéfalo/metabolismo , Fraccionamiento Celular/métodos , Centrifugación por Gradiente de Densidad/métodos , Ontología de Genes , Redes Reguladoras de Genes , Hepatocitos/citología , Hepatocitos/metabolismo , Secuenciación de Nucleótidos de Alto Rendimiento , Hígado/citología , Hígado/metabolismo , Ratones , Anotación de Secuencia Molecular , Miocardio/citología , Miocardio/metabolismo , Miocitos Cardíacos/citología , Miocitos Cardíacos/metabolismo , Neuronas/citología , Neuronas/metabolismo , Especificidad de Órganos , Polirribosomas/clasificación , Polirribosomas/metabolismo , ARN Mensajero/metabolismoRESUMEN
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
Asunto(s)
Bases de Datos Genéticas , Algoritmos , Compresión de Datos , Genoma , Genómica , Humanos , Análisis de Secuencia de ADNRESUMEN
Motivation: Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results: We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation: CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq. Contact: voges@tnt.uni-hannover.de or mhernaez@illinois.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Compresión de Datos/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Algoritmos , Humanos , Modelos Estadísticos , Alineación de Secuencia , Análisis de Secuencia de ADN/métodosRESUMEN
Motivation: The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results: FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation: FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Compresión de Datos/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , HumanosRESUMEN
MOTIVATION: The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether. RESULTS: We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures. AVAILABILITY AND IMPLEMENTATION: The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Compresión de Datos , Genómica , Análisis de Secuencia de ADN , Bases de Datos Genéticas , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , HumanosRESUMEN
MOTIVATION: Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors. RESULTS: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp. CONTACT: sebastian.deorowicz@polsl.pl, iochoa@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Compresión de Datos , Análisis de Secuencia de ADN , Algoritmos , Genoma , Genoma Humano , Genómica , HumanosRESUMEN
MOTIVATION: With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references. RESULTS: We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms. AVAILABILITY: iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.).
Asunto(s)
Compresión de Datos/métodos , Bases de Datos Factuales , Genoma Humano , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Humanos , Medicina de PrecisiónRESUMEN
MOTIVATION: Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit. RESULTS: The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained. AVAILABILITY AND IMPLEMENTATION: QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz. CONTACT: mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Compresión de Datos/normas , Animales , Bases de Datos Genéticas , Genotipo , Técnicas de Genotipaje , Humanos , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
GA4GH has proposed the Beacon architecture as an interface to retrieve genomic information which also protects the privacy of the individuals. In this paper, we propose to adapt the Beacon Reference Implementation to the use case of a study comparing the susceptibility to the carcinogenic effects of tobacco. This analysis compares the germline of heavy smokers who have either never developed lung cancer or, on the contrary, have developed it at a young age. To adapt the Beacon Reference Implementation to the use case, we have added filtering capabilities and a new grouping of information allowing to retrieve the data by affected gene.
Asunto(s)
Genómica , Neoplasias Pulmonares , Humanos , Neoplasias Pulmonares/genética , Predisposición Genética a la Enfermedad , Fumar/genética , Almacenamiento y Recuperación de la InformaciónRESUMEN
For the last two decades, the amount of genomic data produced by scientific and medical applications has been growing at a rapid pace. To enable software solutions that analyze, process, and transmit these data in an efficient and interoperable way, ISO and IEC released the first version of the compression standard MPEG-G in 2019. However, non-proprietary implementations of the standard are not openly available so far, limiting fair scientific assessment of the standard and, therefore, hindering its broad adoption. In this paper, we present Genie, to the best of our knowledge the first open-source encoder that compresses genomic data according to the MPEG-G standard. We demonstrate that Genie reaches state-of-the-art compression ratios while offering interoperability with any other standard-compliant decoder independent from its manufacturer. Finally, the ISO/IEC ecosystem ensures the long-term sustainability and decodability of the compressed data through the ISO/IEC-supported reference decoder.