Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 37
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Biomolecules ; 14(6)2024 Jun 03.
Artículo en Inglés | MEDLINE | ID: mdl-38927057

RESUMEN

Whole-tissue transcriptomic analyses have been helpful to characterize molecular subtypes of hepatocellular carcinoma (HCC). Metabolic subtypes of human HCC have been defined, yet whether these different metabolic classes are clinically relevant or derive in actionable cancer vulnerabilities is still an unanswered question. Publicly available gene sets or gene signatures have been used to infer functional changes through gene set enrichment methods. However, metabolism-related gene signatures are poorly co-expressed when applied to a biological context. Here, we apply a simple method to infer highly consistent signatures using graph-based statistics. Using the Cancer Genome Atlas Liver Hepatocellular cohort (LIHC), we describe the main metabolic clusters and their relationship with commonly used molecular classes, and with the presence of TP53 or CTNNB1 driver mutations. We find similar results in our validation cohort, the LIRI-JP cohort. We describe how previously described metabolic subtypes could not have therapeutic relevance due to their overall downregulation when compared to non-tumoral liver, and identify N-glycan, mevalonate and sphingolipid biosynthetic pathways as the hallmark of the oncogenic shift of the use of acetyl-coenzyme A in HCC metabolism. Finally, using DepMap data, we demonstrate metabolic vulnerabilities in HCC cell lines.


Asunto(s)
Carcinoma Hepatocelular , Neoplasias Hepáticas , Transcriptoma , Humanos , Carcinoma Hepatocelular/metabolismo , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/patología , Neoplasias Hepáticas/metabolismo , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/patología , Transcriptoma/genética , Regulación Neoplásica de la Expresión Génica , Perfilación de la Expresión Génica , Redes y Vías Metabólicas/genética , Proteína p53 Supresora de Tumor/metabolismo , Proteína p53 Supresora de Tumor/genética , Línea Celular Tumoral , beta Catenina/metabolismo , beta Catenina/genética , Mutación
2.
Nat Commun ; 15(1): 5272, 2024 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-38902243

RESUMEN

While myelodysplastic syndromes with del(5q) (del(5q) MDS) comprises a well-defined hematological subgroup, the molecular basis underlying its origin remains unknown. Using single cell RNA-seq (scRNA-seq) on CD34+ progenitors from del(5q) MDS patients, we have identified cells harboring the deletion, characterizing the transcriptional impact of this genetic insult on disease pathogenesis and treatment response. Interestingly, both del(5q) and non-del(5q) cells present similar transcriptional lesions, indicating that all cells, and not only those harboring the deletion, may contribute to aberrant hematopoietic differentiation. However, gene regulatory network (GRN) analyses reveal a group of regulons showing aberrant activity that could trigger altered hematopoiesis exclusively in del(5q) cells, pointing to a more prominent role of these cells in disease phenotype. In del(5q) MDS patients achieving hematological response upon lenalidomide treatment, the drug reverts several transcriptional alterations in both del(5q) and non-del(5q) cells, but other lesions remain, which may be responsible for potential future relapses. Moreover, lack of hematological response is associated with the inability of lenalidomide to reverse transcriptional alterations. Collectively, this study reveals transcriptional alterations that could contribute to the pathogenesis and treatment response of del(5q) MDS.


Asunto(s)
Antígenos CD34 , Deleción Cromosómica , Cromosomas Humanos Par 5 , Células Madre Hematopoyéticas , Lenalidomida , Síndromes Mielodisplásicos , Análisis de la Célula Individual , Humanos , Lenalidomida/farmacología , Lenalidomida/uso terapéutico , Síndromes Mielodisplásicos/genética , Síndromes Mielodisplásicos/tratamiento farmacológico , Síndromes Mielodisplásicos/patología , Síndromes Mielodisplásicos/metabolismo , Células Madre Hematopoyéticas/efectos de los fármacos , Células Madre Hematopoyéticas/metabolismo , Antígenos CD34/metabolismo , Cromosomas Humanos Par 5/genética , Masculino , Femenino , Anciano , Redes Reguladoras de Genes/efectos de los fármacos , Persona de Mediana Edad , Hematopoyesis/efectos de los fármacos , Hematopoyesis/genética , Transcriptoma , Anciano de 80 o más Años , RNA-Seq , Perfilación de la Expresión Génica
3.
Bioinformatics ; 40(6)2024 Jun 03.
Artículo en Inglés | MEDLINE | ID: mdl-38748994

RESUMEN

MOTIVATION: The identification of minimal genetic interventions that modulate metabolic processes constitutes one of the most relevant applications of genome-scale metabolic models (GEMs). The concept of Minimal Cut Sets (MCSs) and its extension at the gene level, genetic Minimal Cut Sets (gMCSs), have attracted increasing interest in the field of Systems Biology to address this task. Different computational tools have been developed to calculate MCSs and gMCSs using both commercial and open-source software. RESULTS: Here, we present gMCSpy, an efficient Python package to calculate gMCSs in GEMs using both commercial and non-commercial optimization solvers. We show that gMCSpy substantially overperforms our previous computational tool GMCS, which exclusively relied on commercial software. Moreover, we compared gMCSpy with recently published competing algorithms in the literature, finding significant improvements in both accuracy and computation time. All these advances make gMCSpy an attractive tool for researchers in the field of Systems Biology for different applications in health and biotechnology. AVAILABILITY AND IMPLEMENTATION: The Python package gMCSpy and the data underlying this manuscript can be accessed at: https://github.com/PlanesLab/gMCSpy.


Asunto(s)
Algoritmos , Programas Informáticos , Biología de Sistemas , Biología de Sistemas/métodos , Genoma , Biología Computacional/métodos
4.
Nucleic Acids Res ; 52(9): e44, 2024 May 22.
Artículo en Inglés | MEDLINE | ID: mdl-38597610

RESUMEN

Grouping gene expression into gene set activity scores (GSAS) provides better biological insights than studying individual genes. However, existing gene set projection methods cannot return representative, robust, and interpretable GSAS. We developed NetActivity, a machine learning framework that generates GSAS based on a sparsely-connected autoencoder, where each neuron in the inner layer represents a gene set. We proposed a three-tier training that yielded representative, robust, and interpretable GSAS. NetActivity model was trained with 1518 GO biological processes terms and KEGG pathways and all GTEx samples. NetActivity generates GSAS robust to the initialization parameters and representative of the original transcriptome, and assigned higher importance to more biologically relevant genes. Moreover, NetActivity returns GSAS with a more consistent definition and higher interpretability than GSVA and hipathia, state-of-the-art gene set projection methods. Finally, NetActivity enables combining bulk RNA-seq and microarray datasets in a meta-analysis of prostate cancer progression, highlighting gene sets related to cell division, key for disease progression. When applied to metastatic prostate cancer, gene sets associated with cancer progression were also altered due to drug resistance, while a classical enrichment analysis identified gene sets irrelevant to the phenotype. NetActivity is publicly available in Bioconductor and GitHub.


Asunto(s)
Neoplasias de la Próstata , Humanos , Neoplasias de la Próstata/genética , Neoplasias de la Próstata/patología , Neoplasias de la Próstata/metabolismo , Masculino , Aprendizaje Automático , Perfilación de la Expresión Génica/métodos , Transcriptoma/genética , Regulación Neoplásica de la Expresión Génica , RNA-Seq/métodos , Algoritmos
5.
EBioMedicine ; 102: 105048, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38484556

RESUMEN

BACKGROUND: Tobacco is the main risk factor for developing lung cancer. Yet, while some heavy smokers develop lung cancer at a young age, other heavy smokers never develop it, even at an advanced age, suggesting a remarkable variability in the individual susceptibility to the carcinogenic effects of tobacco. We characterized the germline profile of subjects presenting these extreme phenotypes with Whole Exome Sequencing (WES) and Machine Learning (ML). METHODS: We sequenced germline DNA from heavy smokers who either developed lung adenocarcinoma at an early age (extreme cases) or who did not develop lung cancer at an advanced age (extreme controls), selected from databases including over 6600 subjects. We selected individual coding genetic variants and variant-rich genes showing a significantly different distribution between extreme cases and controls. We validated the results from our discovery cohort, in which we analysed by WES extreme cases and controls presenting similar phenotypes. We developed ML models using both cohorts. FINDINGS: Mean age for extreme cases and controls was 50.7 and 79.1 years respectively, and mean tobacco consumption was 34.6 and 62.3 pack-years. We validated 16 individual variants and 33 variant-rich genes. The gene harbouring the most validated variants was HLA-A in extreme controls (4 variants in the discovery cohort, p = 3.46E-07; and 4 in the validation cohort, p = 1.67E-06). We trained ML models using as input the 16 individual variants in the discovery cohort and tested them on the validation cohort, obtaining an accuracy of 76.5% and an AUC-ROC of 83.6%. Functions of validated genes included candidate oncogenes, tumour-suppressors, DNA repair, HLA-mediated antigen presentation and regulation of proliferation, apoptosis, inflammation and immune response. INTERPRETATION: Individuals presenting extreme phenotypes of high and low risk of developing tobacco-associated lung adenocarcinoma show different germline profiles. Our strategy may allow the identification of high-risk subjects and the development of new therapeutic approaches. FUNDING: See a detailed list of funding bodies in the Acknowledgements section at the end of the manuscript.


Asunto(s)
Adenocarcinoma del Pulmón , Neoplasias Pulmonares , Humanos , Persona de Mediana Edad , Anciano , Secuenciación del Exoma , Predisposición Genética a la Enfermedad , Adenocarcinoma del Pulmón/genética , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patología , Fenotipo , Células Germinativas/patología
6.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-37963064

RESUMEN

MOTIVATION: Single-nucleotide variants (SNVs) are the most common type of genetic variation in the human genome. Accurate and efficient detection of SNVs from next-generation sequencing (NGS) data is essential for various applications in genomics and personalized medicine. However, SNV calling methods usually suffer from high computational complexity and limited accuracy. In this context, there is a need for new methods that overcome these limitations and provide fast reliable results. RESULTS: We present EMVC-2, a novel method for SNV calling from NGS data. EMVC-2 uses a multi-class ensemble classification approach based on the expectation-maximization algorithm that infers at each locus the most likely genotype from multiple labels provided by different learners. The inferred variants are then validated by a decision tree that filters out unlikely ones. We evaluate EMVC-2 on several publicly available real human NGS data for which the set of SNVs is available, and demonstrate that it outperforms state-of-the-art variant callers in terms of accuracy and speed, on average. AVAILABILITY AND IMPLEMENTATION: EMVC-2 is coded in C and Python, and is freely available for download at: https://github.com/guilledufort/EMVC-2. EMVC-2 is also available in Bioconda.


Asunto(s)
Motivación , Polimorfismo de Nucleótido Simple , Humanos , Genómica/métodos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Nucleótidos
7.
Acta Ophthalmol ; 102(5): e831-e841, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38131161

RESUMEN

PURPOSE: To assess the suitability of machine learning (ML) techniques in predicting the development of fibrosis and atrophy in patients with neovascular age-related macular degeneration (nAMD), receiving anti-VEGF treatment over a 36-month period. METHODS: An extensive analysis was conducted on the use of ML to predict fibrosis and atrophy development on nAMD patients at 36 months from start of anti-VEGF treatment, using only data from the first 12 months. We use data collected according to real-world practice, which includes clinical and genetic factors. RESULTS: The ML analysis consistently identified ETDRS as a relevant factor for predicting the development of atrophy and fibrosis, confirming previous statistical analyses. Also, it was shown that genetic variables did not demonstrate statistical relevance in the prediction. Despite the complexity of predicting macular degeneration, our model was able to obtain a balance accuracy of 63% and an AUC of 0.72 when predicting the development of atrophy or fibrosis at 36 months. CONCLUSION: This study demonstrates the potential of ML techniques in predicting the development of fibrosis and atrophy in nAMD patients receiving long-term anti-VEGF treatment. The findings highlight the importance of clinical factors, particularly ETDRS (early treatment diabetic retinopathy study) visual acuity test, in predicting these outcomes. The lessons learned from this research can guide future ML-based prediction tasks in the field of ophthalmology and contribute to the design of data collection processes.


Asunto(s)
Inhibidores de la Angiogénesis , Fibrosis , Inyecciones Intravítreas , Aprendizaje Automático , Factor A de Crecimiento Endotelial Vascular , Agudeza Visual , Degeneración Macular Húmeda , Humanos , Inhibidores de la Angiogénesis/uso terapéutico , Masculino , Degeneración Macular Húmeda/diagnóstico , Degeneración Macular Húmeda/tratamiento farmacológico , Femenino , Anciano , Factor A de Crecimiento Endotelial Vascular/antagonistas & inhibidores , Tomografía de Coherencia Óptica/métodos , Atrofia , Estudios de Seguimiento , Anciano de 80 o más Años , Estudios Retrospectivos , Ranibizumab/administración & dosificación , Ranibizumab/uso terapéutico , Angiografía con Fluoresceína/métodos , Fondo de Ojo
8.
Bioinformatics ; 40(1)2024 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-38134424

RESUMEN

MOTIVATION: Drug-target interaction (DTI) prediction is a relevant but challenging task in the drug repurposing field. In-silico approaches have drawn particular attention as they can reduce associated costs and time commitment of traditional methodologies. Yet, current state-of-the-art methods present several limitations: existing DTI prediction approaches are computationally expensive, thereby hindering the ability to use large networks and exploit available datasets and, the generalization to unseen datasets of DTI prediction methods remains unexplored, which could potentially improve the development processes of DTI inferring approaches in terms of accuracy and robustness. RESULTS: In this work, we introduce GeNNius (Graph Embedding Neural Network Interaction Uncovering System), a Graph Neural Network (GNN)-based method that outperforms state-of-the-art models in terms of both accuracy and time efficiency across a variety of datasets. We also demonstrated its prediction power to uncover new interactions by evaluating not previously known DTIs for each dataset. We further assessed the generalization capability of GeNNius by training and testing it on different datasets, showing that this framework can potentially improve the DTI prediction task by training on large datasets and testing on smaller ones. Finally, we investigated qualitatively the embeddings generated by GeNNius, revealing that the GNN encoder maintains biological information after the graph convolutions while diffusing this information through nodes, eventually distinguishing protein families in the node embedding space. AVAILABILITY AND IMPLEMENTATION: GeNNius code is available at https://github.com/ubioinformat/GeNNius.


Asunto(s)
Sistemas de Liberación de Medicamentos , Reposicionamiento de Medicamentos , Interacciones Farmacológicas , Difusión , Redes Neurales de la Computación
9.
PLoS Comput Biol ; 19(10): e1011544, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-37819942

RESUMEN

Emerging ultra-low coverage single-cell DNA sequencing (scDNA-seq) technologies have enabled high resolution evolutionary studies of copy number aberrations (CNAs) within tumors. While these sequencing technologies are well suited for identifying CNAs due to the uniformity of sequencing coverage, the sparsity of coverage poses challenges for the study of single-nucleotide variants (SNVs). In order to maximize the utility of increasingly available ultra-low coverage scDNA-seq data and obtain a comprehensive understanding of tumor evolution, it is important to also analyze the evolution of SNVs from the same set of tumor cells. We present Phertilizer, a method to infer a clonal tree from ultra-low coverage scDNA-seq data of a tumor. Based on a probabilistic model, our method recursively partitions the data by identifying key evolutionary events in the history of the tumor. We demonstrate the performance of Phertilizer on simulated data as well as on two real datasets, finding that Phertilizer effectively utilizes the copy-number signal inherent in the data to more accurately uncover clonal structure and genotypes compared to previous methods.


Asunto(s)
Neoplasias , Árboles , Humanos , Variaciones en el Número de Copia de ADN/genética , Neoplasias/genética , Análisis de Secuencia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de la Célula Individual
10.
Commun Biol ; 5(1): 351, 2022 04 12.
Artículo en Inglés | MEDLINE | ID: mdl-35414121

RESUMEN

Single-cell RNA-Sequencing has the potential to provide deep biological insights by revealing complex regulatory interactions across diverse cell phenotypes at single-cell resolution. However, current single-cell gene regulatory network inference methods produce a single regulatory network per input dataset, limiting their capability to uncover complex regulatory relationships across related cell phenotypes. We present SimiC, a single-cell gene regulatory inference framework that overcomes this limitation by jointly inferring distinct, but related, gene regulatory dynamics per phenotype. We show that SimiC uncovers key regulatory dynamics missed by previously proposed methods across a range of systems, both model and non-model alike. In particular, SimiC was able to uncover CAR T cell dynamics after tumor recognition and key regulatory patterns on a regenerating liver, and was able to implicate glial cells in the generation of distinct behavioral states in honeybees. SimiC hence establishes a new approach to quantitating regulatory architectures between distinct cellular phenotypes, with far-reaching implications for systems biology.


Asunto(s)
Redes Reguladoras de Genes , Neoplasias , Animales , Abejas , Regulación de la Expresión Génica , Fenotipo , Biología de Sistemas
11.
Bioinformatics ; 38(9): 2488-2495, 2022 04 28.
Artículo en Inglés | MEDLINE | ID: mdl-35253844

RESUMEN

MOTIVATION: An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND (joint integration and discrimination for automated single-cell annotation), a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified. RESULTS: We show on several batched datasets that the joint approach to integration and classification of JIND outperforms in accuracy existing pipelines, and a smaller fraction of cells is rejected as unlabeled as a result of the cell-specific confidence thresholds. Moreover, we investigate cells misclassified by JIND and provide evidence suggesting that they could be due to outliers in the annotated datasets or errors in the original approach used for annotation of the target batch. AVAILABILITY AND IMPLEMENTATION: Implementation for JIND is available at https://github.com/mohit1997/JIND and the data underlying this article can be accessed at https://doi.org/10.5281/zenodo.6246322. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Perfilación de la Expresión Génica
12.
Bioinform Adv ; 2(1): vbac054, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36699360

RESUMEN

Motivation: The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files. Results: We polished de novo assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios and various quality score quantizers. In all cases, we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, such as gzip, yields large storage space savings after the quantization of quality scores. Availability and supplementary information: Quantizers are freely available for download at: https://github.com/mrivarauy/QS-Quantizer.

13.
Bioinformatics ; 37(21): 3923-3925, 2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34478503

RESUMEN

MOTIVATION: Mass spectrometry (MS) data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for MS data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. RESULTS: We tested mspack on several datasets generated by commonly used MS instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared with the original files. Lossless mspack achieves 10-60% lower file sizes than MassComp, and lossy mspack compresses 36-60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. AVAILABILITY AND IMPLEMENTATION: mspack is implemented in C++ and freely available at https://github.com/fhanau/mspack under the Apache license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Compresión de Datos/métodos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Espectrometría de Masas
14.
Bioinformatics ; 37(24): 4862-4864, 2021 12 11.
Artículo en Inglés | MEDLINE | ID: mdl-34128963

RESUMEN

MOTIVATION: Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in <72 h). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. RESULTS: We introduce RENANO, a reference-based lossless data compressor specifically tailored to FASTQ files generated with nanopore sequencing technologies. RENANO improves on its predecessor ENANO, currently the state of the art, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor and (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file. We compare the compression performance of RENANO against ENANO on several publicly available nanopore datasets. RENANO improves the base call sequences compression of ENANO by 39.8% in scenario (1), and by 33.5% in scenario (2), on average, over all the datasets. As for total file compression, the average improvements are 12.7% and 10.6%, respectively. We also show that RENANO consistently outperforms the recent general-purpose genomic compressor Genozip. AVAILABILITY AND IMPLEMENTATION: RENANO is freely available for download at: https://github.com/guilledufort/RENANO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Nanoporos , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Compresión de Datos/métodos
15.
Nat Commun ; 12(1): 2204, 2021 04 13.
Artículo en Inglés | MEDLINE | ID: mdl-33850139

RESUMEN

Intra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor's mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss' improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics.


Asunto(s)
ADN de Neoplasias/genética , Neoplasias Hepáticas/genética , Nucleótidos , Algoritmos , Carcinoma Hepatocelular , Neoplasias Colorrectales/genética , Frecuencia de los Genes , Genómica/métodos , Humanos , Leucemia Mieloide Aguda/genética , Mutación , Polimorfismo de Nucleótido Simple
16.
J Bioinform Comput Biol ; 18(6): 2050031, 2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-32938284

RESUMEN

The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the "infinitesimal successive refinability" property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.


Asunto(s)
Algoritmos , Compresión de Datos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Cromosomas Humanos Par 20/genética , Biología Computacional , Simulación por Computador , Compresión de Datos/normas , Compresión de Datos/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Análisis de Fourier , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Programas Informáticos
17.
Bioinformatics ; 36(18): 4810-4812, 2020 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-32609343

RESUMEN

MOTIVATION: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. RESULTS: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. AVAILABILITY AND IMPLEMENTATION: GPress is freely available at https://github.com/qm2/gpress. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , RNA-Seq , Programas Informáticos , Secuenciación del Exoma
18.
Bioinformatics ; 36(16): 4506-4507, 2020 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-32470109

RESUMEN

MOTIVATION: The amount of genomic data generated globally is seeing explosive growth, leading to increasing needs for processing, storage and transmission resources, which motivates the development of efficient compression tools for these data. Work so far has focused mainly on the compression of data generated by short-read technologies. However, nanopore sequencing technologies are rapidly gaining popularity due to the advantages offered by the large increase in the average size of the produced reads, the reduction in their cost and the portability of the sequencing technology. We present ENANO (Encoder for NANOpore), a novel lossless compression algorithm especially designed for nanopore sequencing FASTQ files. RESULTS: The main focus of ENANO is on the compression of the quality scores, as they dominate the size of the compressed file. ENANO offers two modes, Maximum Compression and Fast (default), which trade-off compression efficiency and speed. We tested ENANO, the current state-of-the-art compressor SPRING and the general compressor pigz on several publicly available nanopore datasets. The results show that the proposed algorithm consistently achieves the best compression performance (in both modes) on every considered nanopore dataset, with an average improvement over pigz and SPRING of >24.7% and 6.3%, respectively. In addition, in terms of encoding and decoding speeds, ENANO is 2.9× and 1.7× times faster than SPRING, respectively, with memory consumption up to 0.2 GB. AVAILABILITY AND IMPLEMENTATION: ENANO is freely available for download at: https://github.com/guilledufort/EnanoFASTQ. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Nanoporos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Programas Informáticos
19.
Bioinformatics ; 36(7): 2275-2277, 2020 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-31830243

RESUMEN

MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY AND IMPLEMENTATION: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Genoma , Genómica , Programas Informáticos
20.
Bioinformatics ; 36(8): 2328-2336, 2020 04 15.
Artículo en Inglés | MEDLINE | ID: mdl-31873730

RESUMEN

MOTIVATION: Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known 'true' variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). RESULTS: For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample). AVAILABILITY AND IMPLEMENTATION: Code and scripts available at: github.com/ChuanyiZ/vef. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica , Programas Informáticos , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Polimorfismo de Nucleótido Simple , Secuenciación Completa del Genoma
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA