Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 77
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 39(9)2023 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-37672022

RESUMEN

MOTIVATION: Genome-wide association studies (GWAS) present several computational and statistical challenges for their data analysis, including knowledge discovery, interpretability, and translation to clinical practice. RESULTS: We develop, apply, and comparatively evaluate an automated machine learning (AutoML) approach, customized for genomic data that delivers reliable predictive and diagnostic models, the set of genetic variants that are important for predictions (called a biosignature), and an estimate of the out-of-sample predictive power. This AutoML approach discovers variants with higher predictive performance compared to standard GWAS methods, computes an individual risk prediction score, generalizes to new, unseen data, is shown to better differentiate causal variants from other highly correlated variants, and enhances knowledge discovery and interpretability by reporting multiple equivalent biosignatures. AVAILABILITY AND IMPLEMENTATION: Code for this study is available at: https://github.com/mensxmachina/autoML-GWAS. JADBio offers a free version at: https://jadbio.com/sign-up/. SNP data can be downloaded from the EGA repository (https://ega-archive.org/). PRS data are found at: https://www.aicrowd.com/challenges/opensnp-height-prediction. Simulation data to study population structure can be found at: https://easygwas.ethz.ch/data/public/dataset/view/1/.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Humanos , Fenotipo , Simulación por Computador , Aprendizaje Automático
2.
Oncologist ; 27(4): 272-284, 2022 04 05.
Artículo en Inglés | MEDLINE | ID: mdl-35380712

RESUMEN

Within the last decade, the science of molecular testing has evolved from single gene and single protein analysis to broad molecular profiling as a standard of care, quickly transitioning from research to practice. Terms such as genomics, transcriptomics, proteomics, circulating omics, and artificial intelligence are now commonplace, and this rapid evolution has left us with a significant knowledge gap within the medical community. In this paper, we attempt to bridge that gap and prepare the physician in oncology for multiomics, a group of technologies that have gone from looming on the horizon to become a clinical reality. The era of multiomics is here, and we must prepare ourselves for this exciting new age of cancer medicine.


Asunto(s)
Inteligencia Artificial , Neoplasias , Genómica , Humanos , Oncología Médica , Neoplasias/genética , Neoplasias/terapia , Proteómica
3.
PLoS Biol ; 17(4): e2006506, 2019 04.
Artículo en Inglés | MEDLINE | ID: mdl-30978178

RESUMEN

The differentiation of self-renewing progenitor cells requires not only the regulation of lineage- and developmental stage-specific genes but also the coordinated adaptation of housekeeping functions from a metabolically active, proliferative state toward quiescence. How metabolic and cell-cycle states are coordinated with the regulation of cell type-specific genes is an important question, because dissociation between differentiation, cell cycle, and metabolic states is a hallmark of cancer. Here, we use a model system to systematically identify key transcriptional regulators of Ikaros-dependent B cell-progenitor differentiation. We find that the coordinated regulation of housekeeping functions and tissue-specific gene expression requires a feedforward circuit whereby Ikaros down-regulates the expression of Myc. Our findings show how coordination between differentiation and housekeeping states can be achieved by interconnected regulators. Similar principles likely coordinate differentiation and housekeeping functions during progenitor cell differentiation in other cell lineages.


Asunto(s)
Linfocitos B/citología , Genes myc , Células Precursoras de Linfocitos B/citología , Animales , Linfocitos B/metabolismo , Ciclo Celular/fisiología , Diferenciación Celular/genética , Linaje de la Célula , Bases de Datos Genéticas , Regulación hacia Abajo , Regulación de la Expresión Génica , Genes Esenciales , Humanos , Factor de Transcripción Ikaros/metabolismo , Activación de Linfocitos , Ratones , Células Precursoras de Linfocitos B/metabolismo , Factores de Transcripción/metabolismo
4.
Int J Mol Sci ; 23(6)2022 Mar 09.
Artículo en Inglés | MEDLINE | ID: mdl-35328380

RESUMEN

Tissue-specific gene methylation events are key to the pathogenesis of several diseases and can be utilized for diagnosis and monitoring. Here, we established an in silico pipeline to analyze high-throughput methylome datasets to identify specific methylation fingerprints in three pathological entities of major burden, i.e., breast cancer (BrCa), osteoarthritis (OA) and diabetes mellitus (DM). Differential methylation analysis was conducted to compare tissues/cells related to the pathology and different types of healthy tissues, revealing Differentially Methylated Genes (DMGs). Highly performing and low feature number biosignatures were built with automated machine learning, including: (1) a five-gene biosignature discriminating BrCa tissue from healthy tissues (AUC 0.987 and precision 0.987), (2) three equivalent OA cartilage-specific biosignatures containing four genes each (AUC 0.978 and precision 0.986) and (3) a four-gene pancreatic ß-cell-specific biosignature (AUC 0.984 and precision 0.995). Next, the BrCa biosignature was validated using an independent ccfDNA dataset showing an AUC and precision of 1.000, verifying the biosignature's applicability in liquid biopsy. Functional and protein interaction prediction analysis revealed that most DMGs identified are involved in pathways known to be related to the studied diseases or pointed to new ones. Overall, our data-driven approach contributes to the maximum exploitation of high-throughput methylome readings, helping to establish specific disease profiles to be applied in clinical practice and to understand human pathology.


Asunto(s)
Neoplasias de la Mama , Osteoartritis , Neoplasias de la Mama/metabolismo , Metilación de ADN , Epigenoma , Femenino , Humanos , Osteoartritis/metabolismo
5.
Bioinformatics ; 35(18): 3387-3396, 2019 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-30715136

RESUMEN

MOTIVATION: Temporal variations in biological systems and more generally in natural sciences are typically modeled as a set of ordinary, partial or stochastic differential or difference equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously. RESULTS: In this paper, we present a unified approach to infer both the structure and the parameters of non-linear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm under multiple interventions and/or stochasticity. Additionally, USDL's accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway. AVAILABILITY AND IMPLEMENTATION: Source code is available at Bioinformatics online. USDL algorithm has been also integrated in SCENERY (http://scenery.csd.uoc.gr/); an online tool for single-cell mass cytometry analytics. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Dinámicas no Lineales , Transducción de Señal , Programas Informáticos
6.
Cytometry A ; 97(3): 241-252, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-32100455

RESUMEN

Single-cell platforms provide statistically large samples of snapshot observations capable of resolving intrercellular heterogeneity. Currently, there is a growing literature on algorithms that exploit this attribute in order to infer the trajectory of biological mechanisms, such as cell proliferation and differentiation. Despite the efforts, the trajectory inference methodology has not yet been used for addressing the challenging problem of learning the dynamics of protein signaling systems. In this work, we assess this prospect by testing the performance of this class of algorithms on four proteomic temporal datasets. To evaluate the learning quality, we design new general-purpose evaluation metrics that are able to quantify performance on (i) the biological meaning of the output, (ii) the consistency of the inferred trajectory, (iii) the algorithm robustness, (iv) the correlation of the learning output with the initial dataset, and (v) the roughness of the cell parameter levels though the inferred trajectory. We show that experimental time alone is insufficient to provide knowledge about the order of proteins during signal transduction. Accordingly, we show that the inferred trajectories provide richer information about the underlying dynamics. We learn that established methods tested on high-dimensional data with small sample size, slow dynamics, and complex structures (e.g. bifurcations) cannot always work in the signaling setting. Among the methods we evaluate, Scorpius and a newly introduced approach that combines Diffusion Maps and Principal Curves were found to perform adequately in recovering the progression of signal transduction although their performance on some metrics varies from one dataset to another. The novel metrics we devise highlight that it is difficult to conclude, which one method is universally applicable for the task. Arguably, there are still many challenges and open problems to resolve. © 2020 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.


Asunto(s)
Algoritmos , Proteómica , Humanos
7.
Cytometry A ; 95(11): 1178-1190, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31692248

RESUMEN

Cytometry by time-of-flight (CyTOF) has emerged as a high-throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high-dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high-dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.


Asunto(s)
Algoritmos , Citometría de Flujo/métodos , Leucocitos Mononucleares/citología , Linfocitos B/citología , Linfocitos B/metabolismo , Capa Leucocitaria de la Sangre/citología , Capa Leucocitaria de la Sangre/metabolismo , Análisis por Conglomerados , Humanos , Leucocitos Mononucleares/metabolismo , Análisis Multivariante , Redes Neurales de la Computación , Distribución Aleatoria , Análisis de la Célula Individual , Linfocitos T/citología , Linfocitos T/metabolismo
8.
Nucleic Acids Res ; 45(W1): W270-W275, 2017 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-28525568

RESUMEN

Flow and mass cytometry technologies can probe proteins as biological markers in thousands of individual cells simultaneously, providing unprecedented opportunities for reconstructing networks of protein interactions through machine learning algorithms. The network reconstruction (NR) problem has been well-studied by the machine learning community. However, the potentials of available methods remain largely unknown to the cytometry community, mainly due to their intrinsic complexity and the lack of comprehensive, powerful and easy-to-use NR software implementations specific for cytometry data. To bridge this gap, we present Single CEll NEtwork Reconstruction sYstem (SCENERY), a web server featuring several standard and advanced cytometry data analysis methods coupled with NR algorithms in a user-friendly, on-line environment. In SCENERY, users may upload their data and set their own study design. The server offers several data analysis options categorized into three classes of methods: data (pre)processing, statistical analysis and NR. The server also provides interactive visualization and download of results as ready-to-publish images or multimedia reports. Its core is modular and based on the widely-used and robust R platform allowing power users to extend its functionalities by submitting their own NR methods. SCENERY is available at scenery.csd.uoc.gr or http://mensxmachina.org/en/software/.


Asunto(s)
Citometría de Flujo/métodos , Mapeo de Interacción de Proteínas/métodos , Programas Informáticos , Humanos , Internet , Aprendizaje Automático , Espectrometría de Masas/métodos , Linfocitos T Reguladores/metabolismo
9.
BMC Bioinformatics ; 19(1): 17, 2018 01 23.
Artículo en Inglés | MEDLINE | ID: mdl-29357817

RESUMEN

BACKGROUND: Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work we extend established constrained-based, feature-selection methods to high-dimensional "omics" temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. RESULTS: The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. CONCLUSIONS: The use of this algorithm is suggested for variable selection with high-dimensional temporal data.


Asunto(s)
Algoritmos , Genómica , Modelos Lineales
10.
BMC Bioinformatics ; 17 Suppl 5: 194, 2016 Jun 06.
Artículo en Inglés | MEDLINE | ID: mdl-27294826

RESUMEN

BACKGROUND: We address the problem of integratively analyzing multiple gene expression, microarray datasets in order to reconstruct gene-gene interaction networks. Integrating multiple datasets is generally believed to provide increased statistical power and to lead to a better characterization of the system under study. However, the presence of systematic variation across different studies makes network reverse-engineering tasks particularly challenging. We contrast two approaches that have been frequently used in the literature for addressing systematic biases: meta-analysis methods, which first calculate opportune statistics on single datasets and successively summarize them, and data-merging methods, which directly analyze the pooled data after removing eventual biases. This comparative evaluation is performed on both synthetic and real data, the latter consisting of two manually curated microarray compendia comprising several E. coli and Yeast studies, respectively. Furthermore, the reconstruction of the regulatory network of the transcription factor Ikaros in human Peripheral Blood Mononuclear Cells (PBMCs) is presented as a case-study. RESULTS: The meta-analysis and data-merging methods included in our experimentations provided comparable performances on both synthetic and real data. Furthermore, both approaches outperformed (a) the naïve solution of merging data together ignoring possible biases, and (b) the results that are expected when only one dataset out of the available ones is analyzed in isolation. Using correlation statistics proved to be more effective than using p-values for correctly ranking candidate interactions. The results from the PBMC case-study indicate that the findings of the present study generalize to different types of network reconstruction algorithms. CONCLUSIONS: Ignoring the systematic variations that differentiate heterogeneous studies can produce results that are statistically indistinguishable from random guessing. Meta-analysis and data merging methods have proved equally effective in addressing this issue, and thus researchers may safely select the approach that best suit their specific application.


Asunto(s)
Algoritmos , Redes Reguladoras de Genes/genética , Área Bajo la Curva , Escherichia coli/genética , Escherichia coli/metabolismo , Humanos , Leucocitos Mononucleares/citología , Leucocitos Mononucleares/metabolismo , Metaanálisis como Asunto , Curva ROC , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo
11.
J Transl Med ; 14(1): 295, 2016 10 19.
Artículo en Inglés | MEDLINE | ID: mdl-27756323

RESUMEN

The goal of biomarker research is to identify clinically valid markers. Despite decades of research there has been disappointingly few molecules or techniques that are in use today. The "1st International NTNU Symposium on Current and Future Clinical Biomarkers of Cancer: Innovation and Implementation", was held June 16th and 17th 2016, at the Knowledge Center of the St. Olavs Hospital in Trondheim, Norway, under the auspices of the Norwegian University of Science and Technology (NTNU) and the HUNT biobank and research center. The Symposium attracted approximately 100 attendees and invited speakers from 12 countries and 4 continents. In this Symposium original research and overviews on diagnostic, predictive and prognostic cancer biomarkers in serum, plasma, urine, pleural fluid and tumor, circulating tumor cells and bioinformatics as well as how to implement biomarkers in clinical trials were presented. Senior researchers and young investigators presented, reviewed and vividly discussed important new developments in the field of clinical biomarkers of cancer, with the goal of accelerating biomarker research and implementation. The excerpts of this symposium aim to give a cutting-edge overview and insight on some highly important aspects of clinical cancer biomarkers to-date to connect molecular innovation with clinical implementation to eventually improve patient care.


Asunto(s)
Biomarcadores de Tumor/metabolismo , Internacionalidad , Biomarcadores de Tumor/sangre , Biomarcadores de Tumor/orina , Bases de Datos como Asunto , Humanos , Neoplasias/sangre , Neoplasias/patología , Neoplasias/orina , Noruega , Reproducibilidad de los Resultados
12.
Nucleic Acids Res ; 41(9): 4938-48, 2013 May.
Artículo en Inglés | MEDLINE | ID: mdl-23519611

RESUMEN

We report the genomic occupancy profiles of the key hematopoietic transcription factor GATA-1 in pro-erythroblasts and mature erythroid cells fractionated from day E12.5 mouse fetal liver cells. Integration of GATA-1 occupancy profiles with available genome-wide transcription factor and epigenetic profiles assayed in fetal liver cells enabled as to evaluate GATA-1 involvement in modulating local chromatin structure of target genes during erythroid differentiation. Our results suggest that GATA-1 associates preferentially with changes of specific epigenetic modifications, such as H4K16, H3K27 acetylation and H3K4 di-methylation. Furthermore, we used random forest (RF) non-linear regression to predict changes in the expression levels of GATA-1 target genes based on the genomic features available for pro-erythroblasts and mature fetal liver-derived erythroid cells. Remarkably, our prediction model explained a high proportion of 62% of variation in gene expression. Hierarchical clustering of the proximity values calculated by the RF model produced a clear separation of upregulated versus downregulated genes and a further separation of downregulated genes in two distinct groups. Thus, our study of GATA-1 genome-wide occupancy profiles in mouse primary erythroid cells and their integration with global epigenetic marks reveals three clusters of GATA-1 gene targets that are associated with specific epigenetic signatures and functional characteristics.


Asunto(s)
Epigénesis Genética , Eritropoyesis/genética , Factor de Transcripción GATA1/metabolismo , Hígado/metabolismo , Animales , Células Cultivadas , Células Eritroides/metabolismo , Feto , Genoma , Histonas/metabolismo , Hígado/citología , Hígado/embriología , Ratones
13.
JTO Clin Res Rep ; 5(4): 100660, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38586302

RESUMEN

Background: Improving the method for selecting participants for lung cancer (LC) screening is an urgent need. Here, we compared the performance of the Helseundersøkelsen i Nord-Trøndelag (HUNT) Lung Cancer Model (HUNT LCM) versus the Dutch-Belgian lung cancer screening trial (Nederlands-Leuvens Longkanker Screenings Onderzoek (NELSON)) and 2021 United States Preventive Services Task Force (USPSTF) criteria regarding LC risk prediction and efficiency. Methods: We used linked data from 10 Norwegian prospective population-based cohorts, Cohort of Norway. The study included 44,831 ever-smokers, of which 686 (1.5%) patients developed LC; the median follow-up time was 11.6 years (0.01-20.8 years). Results: Within 6 years, 222 (0.5%) individuals developed LC. The NELSON and 2021 USPSTF criteria predicted 37.4% and 59.5% of the LC cases, respectively. By considering the same number of individuals as the NELSON and 2021 USPSTF criteria selected, the HUNT LCM increased the LC prediction rate by 41.0% and 12.1%, respectively. The HUNT LCM significantly increased sensitivity (p < 0.001 and p = 0.028), and reduced the number needed to predict one LC case (29 versus 40, p < 0.001 and 36 versus 40, p = 0.02), respectively. Applying the HUNT LCM 6-year 0.98% risk score as a cutoff (14.0% of ever-smokers) predicted 70.7% of all LC, increasing LC prediction rate with 89.2% and 18.9% versus the NELSON and 2021 USPSTF, respectively (both p < 0.001). Conclusions: The HUNT LCM was significantly more efficient than the NELSON and 2021 USPSTF criteria, improving the prediction of LC diagnosis, and may be used as a validated clinical tool for screening selection.

14.
J Cancer Res Clin Oncol ; 150(7): 355, 2024 Jul 20.
Artículo en Inglés | MEDLINE | ID: mdl-39031255

RESUMEN

INTRODUCTION: Blood biomarkers for early detection of lung cancer (LC) are in demand. There are few studies of the full microRNome in serum of asymptomatic subjects that later develop LC. Here we searched for novel microRNA biomarkers in blood from non-cancer, ever-smokers populations up to eight years before diagnosis. METHODS: Serum samples from 98,737 subjects from two prospective population studies, HUNT2 and HUNT3, were considered initially. Inclusion criteria for cases were: ever-smokers; no known cancer at study entrance; 0-8 years from blood sampling to LC diagnosis. Each future LC case had one control matched to sex, age at study entrance, pack-years, smoking cessation time, and similar HUNT Lung Cancer Model risk score. A total of 240 and 72 serum samples were included in the discovery (HUNT2) and validation (HUNT3) datasets, respectively, and analysed by next-generation sequencing. The validated serum microRNAs were also tested in two pre-diagnostic plasma datasets from the prospective population studies NOWAC (n = 266) and NSHDS (n = 258). A new model adding clinical variables was also developed and validated. RESULTS: Fifteen unique microRNAs were discovered and validated in the pre-diagnostic serum datasets when all cases were contrasted against all controls, all with AUC > 0.60. In combination as a 15-microRNAs signature, the AUC reached 0.708 (discovery) and 0.703 (validation). A non-small cell lung cancer signature of six microRNAs showed AUC 0.777 (discovery) and 0.806 (validation). Combined with clinical variables of the HUNT Lung Cancer Model (age, gender, pack-years, daily cough parts of the year, hours of indoor smoke exposure, quit time in years, number of cigarettes daily, body mass index (BMI)) the AUC reached 0.790 (discovery) and 0.833 (validation). These results could not be validated in the plasma samples. CONCLUSION: There were a few significantly differential expressed microRNAs in serum up to eight years before diagnosis. These promising microRNAs alone, in concert, or combined with clinical variables have the potential to serve as early diagnostic LC biomarkers. Plasma is not suitable for this analysis. Further validation in larger prospective serum datasets is needed.


Asunto(s)
Biomarcadores de Tumor , Detección Precoz del Cáncer , Neoplasias Pulmonares , MicroARNs , Humanos , Neoplasias Pulmonares/sangre , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/diagnóstico , Femenino , Masculino , Persona de Mediana Edad , Biomarcadores de Tumor/sangre , Biomarcadores de Tumor/genética , MicroARNs/sangre , MicroARNs/genética , Estudios Prospectivos , Detección Precoz del Cáncer/métodos , Anciano , Estudios de Casos y Controles , Fumar/sangre , Fumar/efectos adversos , Adulto
16.
Mach Learn ; 112(11): 4257-4287, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37900054

RESUMEN

Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (i.e., high dimensional data). However, lower-dimensional representations that retain the useful biological information do exist. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways (genesets in general) and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL's latent space has a fairly straightforward biological interpretation. PASL is shown to outperform in predictive performance the state-of-the-art method (PLIER) on two collections of breast cancer and leukemia gene expression datasets. PASL is also trained on a large corpus of 50000 gene expression samples to construct a universal dictionary of features across different tissues and pathologies. The dictionary validated on 35643 held-out samples for reconstruction error. It is then applied on 165 held-out datasets spanning a diverse range of diseases. The AutoML tool JADBio is employed to show that the predictive information in the PASL-created feature space is retained after the transformation. The code is available at https://github.com/mensxmachina/PASL.

17.
IBRO Neurosci Rep ; 15: 77-89, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38025660

RESUMEN

Background: Transcriptomic profile differences between patients with bipolar disorder and healthy controls can be identified using machine learning and can provide information about the potential role of the cerebellum in the pathogenesis of bipolar disorder.With this aim, user-friendly, fully automated machine learning algorithms can achieve extremely high classification scores and disease-related predictive biosignature identification, in short time frames and scaled down to small datasets. Method: A fully automated machine learning platform, based on the most suitable algorithm selection and relevant set of hyper-parameter values, was applied on a preprocessed transcriptomics dataset, in order to produce a model for biosignature selection and to classify subjects into groups of patients and controls. The parent GEO datasets were originally produced from the cerebellar and parietal lobe tissue of deceased bipolar patients and healthy controls, using Affymetrix Human Gene 1.0 ST Array. Results: Patients and controls were classified into two separate groups, with no close-to-the-boundary cases, and this classification was based on the cerebellar transcriptomic biosignature of 25 features (genes), with Area Under Curve 0.929 and Average Precision 0.955. The biosignature includes both genes connected before to bipolar disorder, depression, psychosis or epilepsy, as well as genes not linked before with any psychiatric disease. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis revealed participation of 4 identified features in 6 pathways which have also been associated with bipolar disorder. Conclusion: Automated machine learning (AutoML) managed to identify accurately 25 genes that can jointly - in a multivariate-fashion - separate bipolar patients from healthy controls with high predictive power. The discovered features lead to new biological insights. Machine Learning (ML) analysis considers the features in combination (in contrast to standard differential expression analysis), removing both irrelevant as well as redundant markers, and thus, focusing to biological interpretation.

19.
Sci Adv ; 9(45): eadi2095, 2023 11 10.
Artículo en Inglés | MEDLINE | ID: mdl-37939182

RESUMEN

Co-transcriptional RNA-DNA hybrids can not only cause DNA damage threatening genome integrity but also regulate gene activity in a mechanism that remains unclear. Here, we show that the nucleotide excision repair factor XPF interacts with the insulator binding protein CTCF and the cohesin subunits SMC1A and SMC3, leading to R-loop-dependent DNA looping upon transcription activation. To facilitate R-loop processing, XPF interacts and recruits with TOP2B on active gene promoters, leading to double-strand break accumulation and the activation of a DNA damage response. Abrogation of TOP2B leads to the diminished recruitment of XPF, CTCF, and the cohesin subunits to promoters of actively transcribed genes and R-loops and the concurrent impairment of CTCF-mediated DNA looping. Together, our findings disclose an essential role for XPF with TOP2B and the CTCF/cohesin complex in R-loop processing for transcription activation with important ramifications for DNA repair-deficient syndromes associated with transcription-associated DNA damage.


Asunto(s)
Proteínas de Unión al ADN , Estructuras R-Loop , Factor de Unión a CCCTC/genética , Factor de Unión a CCCTC/metabolismo , Proteínas de Unión al ADN/genética , Proteínas de Unión al ADN/metabolismo , Cromosomas , Reparación del ADN , Cromatina
20.
Patterns (N Y) ; 3(12): 100612, 2022 Dec 09.
Artículo en Inglés | MEDLINE | ID: mdl-36569551

RESUMEN

In a typical predictive modeling task, we are asked to produce a final predictive model to employ operationally for predictions, as well as an estimate of its out-of-sample predictive performance. Typically, analysts hold out a portion of the available data, called a Test set, to estimate the model predictive performance on unseen (out-of-sample) records, thus "losing these samples to estimation." However, this practice is unacceptable when the total sample size is low. To avoid losing data to estimation, we need a shift in our perspective: we do not estimate the performance of a specific model instance; we estimate the performance of the pipeline that produces the model. This pipeline is applied on all available samples to produce the final model; no samples are lost to estimation. An estimate of its performance is provided by training the same pipeline on subsets of the samples. When multiple pipelines are tried, additional considerations that correct for the "winner's curse" need to be in place.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA