RESUMEN
T cell immunity is central for the control of viral infections. To characterize T cell immunity, but also for the development of vaccines, identification of exact viral T cell epitopes is fundamental. Here we identify and characterize multiple dominant and subdominant SARS-CoV-2 HLA class I and HLA-DR peptides as potential T cell epitopes in COVID-19 convalescent and unexposed individuals. SARS-CoV-2-specific peptides enabled detection of post-infectious T cell immunity, even in seronegative convalescent individuals. Cross-reactive SARS-CoV-2 peptides revealed pre-existing T cell responses in 81% of unexposed individuals and validated similarity with common cold coronaviruses, providing a functional basis for heterologous immunity in SARS-CoV-2 infection. Diversity of SARS-CoV-2 T cell responses was associated with mild symptoms of COVID-19, providing evidence that immunity requires recognition of multiple epitopes. Together, the proposed SARS-CoV-2 T cell epitopes enable identification of heterologous and post-infectious T cell immunity and facilitate development of diagnostic, preventive and therapeutic measures for COVID-19.
Asunto(s)
COVID-19/inmunología , Epítopos de Linfocito T/inmunología , Péptidos/inmunología , SARS-CoV-2/inmunología , Linfocitos T/inmunología , Vacunas Virales/inmunología , COVID-19/prevención & control , COVID-19/virología , Reacciones Cruzadas/inmunología , Antígenos HLA-DR/inmunología , Antígenos HLA-DR/metabolismo , Antígenos de Histocompatibilidad Clase I/inmunología , Antígenos de Histocompatibilidad Clase I/metabolismo , Humanos , Memoria Inmunológica/inmunología , SARS-CoV-2/fisiología , Linfocitos T/metabolismo , Vacunas Virales/administración & dosificaciónRESUMEN
The ancient city of Chichén Itzá in Yucatán, Mexico, was one of the largest and most influential Maya settlements during the Late and Terminal Classic periods (AD 600-1000) and it remains one of the most intensively studied archaeological sites in Mesoamerica1-4. However, many questions about the social and cultural use of its ceremonial spaces, as well as its population's genetic ties to other Mesoamerican groups, remain unanswered2. Here we present genome-wide data obtained from 64 subadult individuals dating to around AD 500-900 that were found in a subterranean mass burial near the Sacred Cenote (sinkhole) in the ceremonial centre of Chichén Itzá. Genetic analyses showed that all analysed individuals were male and several individuals were closely related, including two pairs of monozygotic twins. Twins feature prominently in Mayan and broader Mesoamerican mythology, where they embody qualities of duality among deities and heroes5, but until now they had not been identified in ancient Mayan mortuary contexts. Genetic comparison to present-day people in the region shows genetic continuity with the ancient inhabitants of Chichén Itzá, except at certain genetic loci related to human immunity, including the human leukocyte antigen complex, suggesting signals of adaptation due to infectious diseases introduced to the region during the colonial period.
Asunto(s)
Conducta Ceremonial , ADN Antiguo , Genoma Humano , Humanos , México , Genoma Humano/genética , Masculino , ADN Antiguo/análisis , Historia Antigua , Femenino , Entierro/historia , Arqueología , Gemelos/genética , Historia MedievalRESUMEN
Differentiation of human embryonic stem cells (hESCs) provides a unique opportunity to study the regulatory mechanisms that facilitate cellular transitions in a human context. To that end, we performed comprehensive transcriptional and epigenetic profiling of populations derived through directed differentiation of hESCs representing each of the three embryonic germ layers. Integration of whole-genome bisulfite sequencing, chromatin immunoprecipitation sequencing, and RNA sequencing reveals unique events associated with specification toward each lineage. Lineage-specific dynamic alterations in DNA methylation and H3K4me1 are evident at putative distal regulatory elements that are frequently bound by pluripotency factors in the undifferentiated hESCs. In addition, we identified germ-layer-specific H3K27me3 enrichment at sites exhibiting high DNA methylation in the undifferentiated state. A better understanding of these initial specification events will facilitate identification of deficiencies in current approaches, leading to more faithful differentiation strategies as well as providing insights into the rewiring of human regulatory programs during cellular transitions.
Asunto(s)
Células Madre Embrionarias/metabolismo , Epigénesis Genética , Transcripción Genética , Acetilación , Diferenciación Celular , Cromatina/química , Cromatina/metabolismo , Metilación de ADN , Elementos de Facilitación Genéticos , Histonas/metabolismo , Humanos , MetilaciónRESUMEN
DEAD box (DDX) RNA helicases are a large family of ATPases, many of which have unknown functions. There is emerging evidence that besides their role in RNA biology, DDX proteins may stimulate protein kinases. To investigate if protein kinase-DDX interaction is a more widespread phenomenon, we conducted three orthogonal large-scale screens, including proteomics analysis with 32 RNA helicases, protein array profiling, and kinome-wide in vitro kinase assays. We retrieved Ser/Thr protein kinases as prominent interactors of RNA helicases and report hundreds of binary interactions. We identified members of ten protein kinase families, which bind to, and are stimulated by, DDX proteins, including CDK, CK1, CK2, DYRK, MARK, NEK, PRKC, SRPK, STE7/MAP2K, and STE20/PAK family members. We identified MARK1 in all screens and validated that DDX proteins accelerate the MARK1 catalytic rate. These findings indicate pervasive interactions between protein kinases and DEAD box RNA helicases, and provide a rich resource to explore their regulatory relationships.
Asunto(s)
ARN Helicasas DEAD-box , ARN Helicasas DEAD-box/metabolismo , ARN Helicasas DEAD-box/genética , Humanos , Unión Proteica , Proteómica/métodos , Proteínas Quinasas/metabolismo , Proteínas Quinasas/genética , Proteínas Serina-Treonina Quinasas/metabolismo , Proteínas Serina-Treonina Quinasas/genéticaRESUMEN
Top-down proteomics using mass spectrometry facilitates the identification of intact proteoforms, that is, all molecular forms of proteins. Multiple past advances have lead to the development of numerous sample preparation workflows. Here we systematically investigated the influence of different sample preparation steps on proteoform and protein identifications, including cell lysis, reduction and alkylation, proteoform enrichment, purification and fractionation. We found that all steps in sample preparation influence the subset of proteoforms identified (for example, their number, confidence, physicochemical properties and artificially generated modifications). The various sample preparation strategies resulted in complementary identifications, substantially increasing the proteome coverage. Overall, we identified 13,975 proteoforms from 2,720 proteins of human Caco-2 cells. The results presented can serve as suggestions for designing and adapting top-down proteomics sample preparation strategies to particular research questions. Moreover, we expect that the sampling bias and modifications identified at the intact protein level will also be useful in improving bottom-up proteomics approaches.
RESUMEN
The volume of public proteomics data is rapidly increasing, causing a computational challenge for large-scale reanalysis. Here, we introduce quantms ( https://quant,ms.org/ ), an open-source cloud-based pipeline for massively parallel proteomics data analysis. We used quantms to reanalyze 83 public ProteomeXchange datasets, comprising 29,354 instrument files from 13,132 human samples, to quantify 16,599 proteins based on 1.03 million unique peptides. quantms is based on standard file formats improving the reproducibility, submission and dissemination of the data to ProteomeXchange.
Asunto(s)
Nube Computacional , Proteómica , Programas Informáticos , Proteómica/métodos , Humanos , Bases de Datos de Proteínas , Proteoma/análisis , Reproducibilidad de los Resultados , Biología Computacional/métodos , Péptidos/análisis , Péptidos/químicaRESUMEN
MOTIVATION: Cross-linking mass spectrometry has made remarkable advancements in the high-throughput characterization of protein structures and interactions. The resulting pairs of cross-linked peptides typically require geometric assessment and validation, given the availability of their corresponding structures. RESULTS: CLAUDIO (Cross-linking Analysis Using Distances and Overlaps) is an open-source software tool designed for the automated analysis and validation of different varieties of large-scale cross-linking experiments. Many of the otherwise manual processes for structural validation (i.e. structure retrieval and mapping) are performed fully automatically to simplify and accelerate the data interpretation process. In addition, CLAUDIO has the ability to remap intra-protein links as inter-protein links and discover evidence for homo-multimers. AVAILABILITY AND IMPLEMENTATION: CLAUDIO is available as open-source software under the MIT license at https://github.com/KohlbacherLab/CLAUDIO.
Asunto(s)
Péptidos , Programas Informáticos , Péptidos/química , Espectrometría de Masas , Reactivos de Enlaces Cruzados/químicaRESUMEN
Top-down proteomics (TDP) directly analyzes intact proteins and thus provides more comprehensive qualitative and quantitative proteoform-level information than conventional bottom-up proteomics (BUP) that relies on digested peptides and protein inference. While significant advancements have been made in TDP in sample preparation, separation, instrumentation, and data analysis, reliable and reproducible data analysis still remains one of the major bottlenecks in TDP. A key step for robust data analysis is the establishment of an objective estimation of proteoform-level false discovery rate (FDR) in proteoform identification. The most widely used FDR estimation scheme is based on the target-decoy approach (TDA), which has primarily been established for BUP. We present evidence that the TDA-based FDR estimation may not work at the proteoform-level due to an overlooked factor, namely the erroneous deconvolution of precursor masses, which leads to incorrect FDR estimation. We argue that the conventional TDA-based FDR in proteoform identification is in fact protein-level FDR rather than proteoform-level FDR unless precursor deconvolution error rate is taken into account. To address this issue, we propose a formula to correct for proteoform-level FDR bias by combining TDA-based FDR and precursor deconvolution error rate.
Asunto(s)
Péptidos , Proteómica , Proteínas de Unión al ADNRESUMEN
In protein-RNA cross-linking mass spectrometry, UV or chemical cross-linking introduces stable bonds between amino acids and nucleic acids in protein-RNA complexes that are then analyzed and detected in mass spectra. This analytical tool delivers valuable information about RNA-protein interactions and RNA docking sites in proteins, both in vitro and in vivo. The identification of cross-linked peptides with oligonucleotides of different length leads to a combinatorial increase in search space. We demonstrate that the peptide retention time prediction tasks can be transferred to the task of cross-linked peptide retention time prediction using a simple amino acid composition encoding, yielding improved identification rates when the prediction error is included in rescoring. For the more challenging task of including fragment intensity prediction of cross-linked peptides in the rescoring, we obtain, on average, a similar improvement. Further improvement in the encoding and fine-tuning of retention time and intensity prediction models might lead to further gains, and merit further research.
Asunto(s)
Ácidos Nucleicos , ARN , Aminoácidos , Espectrometría de Masas , PéptidosRESUMEN
Accurate quantification of individual proteoforms is a crucial step in identifying proteome-wide alterations in different biological conditions. Intact proteoforms have been analyzed predominantly by liquid chromatography-mass spectrometry (LC-MS)-based top-down proteomics (TDP) and quantified primarily by the label-free quantification (LFQ) method, as it requires no additional costly labeling. In TDP, due to frequent coelution and complex signal structures, overlapping signals deriving from multiple proteoforms complicate accurate quantification. Here, we introduce FLASHQuant for MS1-level LFQ analysis in TDP, which is capable of automatically resolving and quantifying coeluting proteoforms. In benchmark tests performed with both spike-in proteins and proteome-level mixture data sets, FLASHQuant was shown to perform highly accurate and reproducible quantification in short runtimes of just a few minutes per LC-MS run. In particular, it was demonstrated that resolving overlapping proteoforms boosts the quantification accuracy. FLASHQuant is publicly available as platform-independent open-source software at https://openms.org/flashquant/, accompanied by the simple alignment algorithm ConsensusFeatureGroupDetector for multiple LC-MS runs.
Asunto(s)
Algoritmos , Proteómica , Proteómica/métodos , Cromatografía Liquida/métodos , Programas Informáticos , Humanos , Proteoma/análisis , Espectrometría de Masas/métodos , Espectrometría de Masas en Tándem/métodosRESUMEN
Healthcare data are an important resource in applied medical research. They are available multicentrically. However, it remains a challenge to enable standardized data exchange processes between federal states and their individual laws and regulations. The Medical Informatics Initiative (MII) was founded in 2016 to implement processes that enable cross-clinic access to healthcare data in Germany. Several working groups (WGs) have been set up to coordinate standardized data structures (WG Interoperability), patient information and declarations of consent (WG Consent), and regulations on data exchange (WG Data Sharing). Here we present the most important results of the Data Sharing working group, which include agreed terms of use, legal regulations, and data access processes. They are already being implemented by the established Data Integration Centers (DIZ) and Use and Access Committees (UACs). We describe the services that are necessary to provide researchers with standardized data access. They are implemented with the Research Data Portal for Health, among others. Since the pilot phase, the processes of 385 active researchers have been used on this basis, which, as of April 2024, has resulted in 19 registered projects and 31 submitted research applications.
Asunto(s)
Registros Electrónicos de Salud , Difusión de la Información , Humanos , Investigación Biomédica , Registros Electrónicos de Salud/estadística & datos numéricos , Alemania , Investigación sobre Servicios de Salud , Informática Médica , Registro Médico Coordinado/métodos , Modelos OrganizacionalesRESUMEN
BACKGROUND: Personalized oncology represents a shift in cancer treatment from conventional methods to target specific therapies where the decisions are made based on the patient specific tumor profile. Selection of the optimal therapy relies on a complex interdisciplinary analysis and interpretation of these variants by experts in molecular tumor boards. With up to hundreds of somatic variants identified in a tumor, this process requires visual analytics tools to guide and accelerate the annotation process. RESULTS: The Personal Cancer Network Explorer (PeCaX) is a visual analytics tool supporting the efficient annotation, navigation, and interpretation of somatic genomic variants through functional annotation, drug target annotation, and visual interpretation within the context of biological networks. Starting with somatic variants in a VCF file, PeCaX enables users to explore these variants through a web-based graphical user interface. The most protruding feature of PeCaX is the combination of clinical variant annotation and gene-drug networks with an interactive visualization. This reduces the time and effort the user needs to invest to get to a treatment suggestion and helps to generate new hypotheses. PeCaX is being provided as a platform-independent containerized software package for local or institution-wide deployment. PeCaX is available for download at https://github.com/KohlbacherLab/PeCaX-docker .
Asunto(s)
Neoplasias , Programas Informáticos , Humanos , Genómica/métodos , Neoplasias/genética , Oncología MédicaRESUMEN
Human expansion in the course of the Neolithic transition in western Eurasia has been one of the major topics in ancient DNA research in the last 10 years. Multiple studies have shown that the spread of agriculture and animal husbandry from the Near East across Europe was accompanied by large-scale human expansions. Moreover, changes in subsistence and migration associated with the Neolithic transition have been hypothesized to involve genetic adaptation. Here, we present high quality genome-wide data from the Linear Pottery Culture site Derenburg-Meerenstieg II (DER) (N = 32 individuals) in Central Germany. Population genetic analyses show that the DER individuals carried predominantly Anatolian Neolithic-like ancestry and a very limited degree of local hunter-gatherer admixture, similar to other early European farmers. Increasing the Linear Pottery culture cohort size to â¼100 individuals allowed us to perform various frequency- and haplotype-based analyses to investigate signatures of selection associated with changes following the adoption of the Neolithic lifestyle. In addition, we developed a new method called Admixture-informed Maximum-likelihood Estimation for Selection Scans that allowed us test for selection signatures in an admixture-aware fashion. Focusing on the intersection of results from these selection scans, we identified various loci associated with immune function (JAK1, HLA-DQB1) and metabolism (LMF1, LEPR, SORBS1), as well as skin color (SLC24A5, CD82) and folate synthesis (MTHFR, NBPF3). Our findings shed light on the evolutionary pressures, such as infectious disease and changing diet, that were faced by the early farmers of Western Eurasia.
Asunto(s)
Agricultores , Migración Humana , Agricultura , ADN Antiguo , ADN Mitocondrial/genética , Europa (Continente) , Genética de Población , Historia Antigua , HumanosRESUMEN
BACKGROUND: The immune peptidome of OPSCC has not previously been studied. Cancer-antigen specific vaccination may improve clinical outcome and efficacy of immune checkpoint inhibitors such as PD1/PD-L1 antibodies. METHODS: Mapping of the OPSCC HLA ligandome was performed by mass spectrometry (MS) based analysis of naturally presented HLA ligands isolated from tumour tissue samples (n = 40) using immunoaffinity purification. The cohort included 22 HPV-positive (primarily HPV-16) and 18 HPV-negative samples. A benign reference dataset comprised of the HLA ligandomes of benign haematological and tissue datasets was used to identify tumour-associated antigens. RESULTS: MS analysis led to the identification of naturally HLA-presented peptides in OPSCC tumour tissue. In total, 22,769 peptides from 9485 source proteins were detected on HLA class I. For HLA class II, 15,203 peptides from 4634 source proteins were discovered. By comparative profiling against the benign HLA ligandomic datasets, 29 OPSCC-associated HLA class I ligands covering 11 different HLA allotypes and nine HLA class II ligands were selected to create a peptide warehouse. CONCLUSION: Tumour-associated peptides are HLA-presented on the cell surfaces of OPSCCs. The established warehouse of OPSCC-associated peptides can be used for downstream immunogenicity testing and peptide-based immunotherapy in (semi)personalised strategies.
Asunto(s)
Antígenos HLA , Neoplasias de Oído, Nariz y Garganta , Infecciones por Papillomavirus , Carcinoma de Células Escamosas de Cabeza y Cuello , Humanos , Infecciones por Papillomavirus/inmunología , Péptidos/inmunología , Vacunación , Neoplasias de Oído, Nariz y Garganta/inmunología , Antígenos HLA/inmunología , Antígenos de Neoplasias/inmunología , Papillomavirus Humano 16 , Papillomavirus Humano 18RESUMEN
MOTIVATION: Diagnosis and treatment decisions on genomic data have become widespread as the cost of genome sequencing decreases gradually. In this context, disease-gene association studies are of great importance. However, genomic data are very sensitive when compared to other data types and contains information about individuals and their relatives. Many studies have shown that this information can be obtained from the query-response pairs on genomic databases. In this work, we propose a method that uses secure multi-party computation to query genomic databases in a privacy-protected manner. The proposed solution privately outsources genomic data from arbitrarily many sources to the two non-colluding proxies and allows genomic databases to be safely stored in semi-honest cloud environments. It provides data privacy, query privacy and output privacy by using XOR-based sharing and unlike previous solutions, it allows queries to run efficiently on hundreds of thousands of genomic data. RESULTS: We measure the performance of our solution with parameters similar to real-world applications. It is possible to query a genomic database with 3 000 000 variants with five genomic query predicates under 400 ms. Querying 1 048 576 genomes, each containing 1 000 000 variants, for the presence of five different query variants can be achieved approximately in 6 min with a small amount of dedicated hardware and connectivity. These execution times are in the right range to enable real-world applications in medical research and healthcare. Unlike previous studies, it is possible to query multiple databases with response times fast enough for practical application. To the best of our knowledge, this is the first solution that provides this performance for querying large-scale genomic data. AVAILABILITY AND IMPLEMENTATION: https://gitlab.com/DIFUTURE/privacy-preserving-variant-queries. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Seguridad Computacional , Privacidad , Humanos , Genómica , Bases de Datos FactualesRESUMEN
BACKGROUND: Although of high individual and socioeconomic relevance, a reliable prediction model for the prognosis of juvenile stroke (18-55 years) is missing. Therefore, the study presented in this protocol aims to prospectively validate the discriminatory power of a prediction score for the 3 months functional outcome after juvenile stroke or transient ischemic attack (TIA) that has been derived from an independent retrospective study using standard clinical workup data. METHODS: PREDICT-Juvenile-Stroke is a multi-centre (n = 4) prospective observational cohort study collecting standard clinical workup data and data on treatment success at 3 months after acute ischemic stroke or TIA that aims to validate a new prediction score for juvenile stroke. The prediction score has been developed upon single center retrospective analysis of 340 juvenile stroke patients. The score determines the patient's individual probability for treatment success defined by a modified Rankin Scale (mRS) 0-2 or return to pre-stroke baseline mRS 3 months after stroke or TIA. This probability will be compared to the observed clinical outcome at 3 months using the area under the receiver operating characteristic curve. The primary endpoint is to validate the clinical potential of the new prediction score for a favourable outcome 3 months after juvenile stroke or TIA. Secondary outcomes are to determine to what extent predictive factors in juvenile stroke or TIA patients differ from those in older patients and to determine the predictive accuracy of the juvenile stroke prediction score on other clinical and paraclinical endpoints. A minimum of 430 juvenile patients (< 55 years) with acute ischemic stroke or TIA, and the same number of older patients will be enrolled for the prospective validation study. DISCUSSION: The juvenile stroke prediction score has the potential to enable personalisation of counselling, provision of appropriate information regarding the prognosis and identification of patients who benefit from specific treatments. TRIAL REGISTRATION: The study has been registered at https://drks.de on March 31, 2022 ( DRKS00024407 ).
Asunto(s)
Ataque Isquémico Transitorio , Accidente Cerebrovascular Isquémico , Accidente Cerebrovascular , Humanos , Adulto Joven , Anciano , Ataque Isquémico Transitorio/diagnóstico , Ataque Isquémico Transitorio/epidemiología , Ataque Isquémico Transitorio/complicaciones , Accidente Cerebrovascular Isquémico/complicaciones , Estudios Retrospectivos , Accidente Cerebrovascular/diagnóstico , Accidente Cerebrovascular/epidemiología , Accidente Cerebrovascular/complicaciones , Pronóstico , Valor Predictivo de las Pruebas , Estudios Observacionales como AsuntoRESUMEN
Today it is the norm that all relevant proteomics data that support the conclusions in scientific publications are made available in public proteomics data repositories. However, given the increase in the number of clinical proteomics studies, an important emerging topic is the management and dissemination of clinical, and thus potentially sensitive, human proteomics data. Both in the United States and in the European Union, there are legal frameworks protecting the privacy of individuals. Implementing privacy standards for publicly released research data in genomics and transcriptomics has led to processes to control who may access the data, so-called "controlled access" data. In parallel with the technological developments in the field, it is clear that the privacy risks of sharing proteomics data need to be properly assessed and managed. In our view, the proteomics community must be proactive in addressing these issues. Yet a careful balance must be kept. On the one hand, neglecting to address the potential of identifiability in human proteomics data could lead to reputational damage of the field, while on the other hand, erecting barriers to open access to clinical proteomics data will inevitably reduce reuse of proteomics data and could substantially delay critical discoveries in biomedical research. In order to balance these apparently conflicting requirements for data privacy and efficient use and reuse of research efforts through the sharing of clinical proteomics data, development efforts will be needed at different levels including bioinformatics infrastructure, policymaking, and mechanisms of oversight.
Asunto(s)
Manejo de Datos , Proteómica , Confidencialidad , Humanos , Difusión de la InformaciónRESUMEN
Liver fibrosis interferes with normal liver function and facilitates hepatocellular carcinoma (HCC) development, representing a major threat to human health. Here, we present a comprehensive perspective of microRNA (miRNA) function on targeting the fibrotic microenvironment. Starting from a murine HCC model, we identify a miRNA network composed of 8 miRNA hubs and 54 target genes. We show that let-7, miR-30, miR-29c, miR-335, and miR-338 (collectively termed antifibrotic microRNAs [AF-miRNAs]) down-regulate key structural, signaling, and remodeling components of the extracellular matrix. During fibrogenic transition, these miRNAs are transcriptionally regulated by the transcription factor Pparγ and thus we identify a role of Pparγ as regulator of a functionally related class of AF-miRNAs. The miRNA network is active in human HCC, breast, and lung carcinomas, as well as in 2 independent mouse liver fibrosis models. Therefore, we identify a miRNA:mRNA network that contributes to formation of fibrosis in tumorous and nontumorous organs of mice and humans.
Asunto(s)
Carcinoma Hepatocelular/genética , Regulación Neoplásica de la Expresión Génica , Cirrosis Hepática/patología , Neoplasias Hepáticas/genética , MicroARNs/genética , PPAR gamma/metabolismo , Animales , Neoplasias de la Mama/genética , Neoplasias de la Mama/patología , Carcinoma Hepatocelular/patología , Islas de CpG/genética , Metilación de ADN , Conjuntos de Datos como Asunto , Modelos Animales de Enfermedad , Epigénesis Genética , Matriz Extracelular/patología , Femenino , Células Estrelladas Hepáticas/patología , Humanos , Hígado/citología , Hígado/patología , Neoplasias Hepáticas/patología , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patología , Ratones , Cultivo Primario de Células , Regiones Promotoras Genéticas/genética , RNA-Seq , Microambiente Tumoral/genéticaRESUMEN
BACKGROUND: With a growing amount of (multi-)omics data being available, the extraction of knowledge from these datasets is still a difficult problem. Classical enrichment-style analyses require predefined pathways or gene sets that are tested for significant deregulation to assess whether the pathway is functionally involved in the biological process under study. De novo identification of these pathways can reduce the bias inherent in predefined pathways or gene sets. At the same time, the definition and efficient identification of these pathways de novo from large biological networks is a challenging problem. RESULTS: We present a novel algorithm, DeRegNet, for the identification of maximally deregulated subnetworks on directed graphs based on deregulation scores derived from (multi-)omics data. DeRegNet can be interpreted as maximum likelihood estimation given a certain probabilistic model for de-novo subgraph identification. We use fractional integer programming to solve the resulting combinatorial optimization problem. We can show that the approach outperforms related algorithms on simulated data with known ground truths. On a publicly available liver cancer dataset we can show that DeRegNet can identify biologically meaningful subgraphs suitable for patient stratification. DeRegNet can also be used to find explicitly multi-omics subgraphs which we demonstrate by presenting subgraphs with consistent methylation-transcription patterns. DeRegNet is freely available as open-source software. CONCLUSION: The proposed algorithmic framework and its available implementation can serve as a valuable heuristic hypothesis generation tool contextualizing omics data within biomolecular networks.
Asunto(s)
Algoritmos , Programas Informáticos , Sesgo , Humanos , Modelos EstadísticosRESUMEN
Machine learning is increasingly applied in proteomics and metabolomics to predict molecular structure, function, and physicochemical properties, including behavior in chromatography, ion mobility, and tandem mass spectrometry. These must be described in sufficient detail to apply or evaluate the performance of trained models. Here we look at and interpret the recently published and general DOME (Data, Optimization, Model, Evaluation) recommendations for conducting and reporting on machine learning in the specific context of proteomics and metabolomics.