Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 4.285
Filtrar
Más filtros

Publication year range
1.
Cell ; 173(7): 1692-1704.e11, 2018 06 14.
Artículo en Inglés | MEDLINE | ID: mdl-29779949

RESUMEN

Heritability is essential for understanding the biological causes of disease but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHRs) passively capture a wide range of clinically relevant data and provide a resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified 7.4 million familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with the literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a validation of the use of EHRs for genetics and disease research.


Asunto(s)
Registros Electrónicos de Salud , Enfermedades Genéticas Congénitas/genética , Algoritmos , Bases de Datos Factuales , Relaciones Familiares , Enfermedades Genéticas Congénitas/patología , Genotipo , Humanos , Linaje , Fenotipo , Carácter Cuantitativo Heredable
2.
Immunity ; 55(6): 1105-1117.e4, 2022 06 14.
Artículo en Inglés | MEDLINE | ID: mdl-35397794

RESUMEN

Global research to combat the COVID-19 pandemic has led to the isolation and characterization of thousands of human antibodies to the SARS-CoV-2 spike protein, providing an unprecedented opportunity to study the antibody response to a single antigen. Using the information derived from 88 research publications and 13 patents, we assembled a dataset of ∼8,000 human antibodies to the SARS-CoV-2 spike protein from >200 donors. By analyzing immunoglobulin V and D gene usages, complementarity-determining region H3 sequences, and somatic hypermutations, we demonstrated that the common (public) responses to different domains of the spike protein were quite different. We further used these sequences to train a deep-learning model to accurately distinguish between the human antibodies to SARS-CoV-2 spike protein and those to influenza hemagglutinin protein. Overall, this study provides an informative resource for antibody research and enhances our molecular understanding of public antibody responses.


Asunto(s)
COVID-19 , SARS-CoV-2 , Anticuerpos Neutralizantes , Anticuerpos Antivirales , Formación de Anticuerpos , Humanos , Pandemias , Glicoproteína de la Espiga del Coronavirus
3.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38493292

RESUMEN

Computational predictors of immunogenic peptides, or epitopes, are traditionally built based on data from a broad range of pathogens without consideration for taxonomic information. While this approach may be reasonable if one aims to develop one-size-fits-all models, it may be counterproductive if the proteins for which the model is expected to generalize are known to come from a specific subset of phylogenetically related pathogens. There is mounting evidence that, for these cases, taxon-specific models can outperform generalist ones, even when trained with substantially smaller amounts of data. In this comment, we provide some perspective on the current state of taxon-specific modelling for the prediction of linear B-cell epitopes, and the challenges faced when building and deploying these predictors.


Asunto(s)
Péptidos , Proteínas , Secuencia de Aminoácidos , Epítopos de Linfocito B
4.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38314912

RESUMEN

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.


Asunto(s)
Medicina , Humanos , Línea Celular , Inmunoprecipitación de Cromatina , Bases de Datos Factuales , Lenguaje
5.
RNA ; 29(12): 1896-1909, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-37793790

RESUMEN

The characterization of the conformational landscape of the RNA backbone is rather complex due to the ability of RNA to assume a large variety of conformations. These backbone conformations can be depicted by pseudotorsional angles linking RNA backbone atoms, from which Ramachandran-like plots can be built. We explore here different definitions of these pseudotorsional angles, finding that the most accurate ones are the traditional η (eta) and θ (theta) angles, which represent the relative position of RNA backbone atoms P and C4'. We explore the distribution of η - θ in known experimental structures, comparing the pseudotorsional space generated with structures determined exclusively by one experimental technique. We found that the complete picture only appears when combining data from different sources. The maps provide a quite comprehensive representation of the RNA accessible space, which can be used in RNA-structural predictions. Finally, our results highlight that protein interactions lead to significant changes in the population of the η - θ space, pointing toward the role of induced-fit mechanisms in protein-RNA recognition.


Asunto(s)
Proteínas , ARN , ARN/genética , ARN/química , Proteínas/química , Conformación de Ácido Nucleico
6.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37401369

RESUMEN

As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.


Asunto(s)
Biología Computacional , Proteínas , Humanos , Biología Computacional/métodos , Proteínas/química , Secuencia de Aminoácidos , Redes Neurales de la Computación , Bases de Datos Factuales , Bases de Datos de Proteínas
7.
Proteomics ; 24(12-13): e2300105, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38458994

RESUMEN

Peptides have a plethora of activities in biological systems that can potentially be exploited biotechnologically. Several peptides are used clinically, as well as in industry and agriculture. The increase in available 'omics data has recently provided a large opportunity for mining novel enzymes, biosynthetic gene clusters, and molecules. While these data primarily consist of DNA sequences, other types of data provide important complementary information. Due to their size, the approaches proven successful at discovering novel proteins of canonical size cannot be naïvely applied to the discovery of peptides. Peptides can be encoded directly in the genome as short open reading frames (smORFs), or they can be derived from larger proteins by proteolysis. Both of these peptide classes pose challenges as simple methods for their prediction result in large numbers of false positives. Similarly, functional annotation of larger proteins, traditionally based on sequence similarity to infer orthology and then transferring functions between characterized proteins and uncharacterized ones, cannot be applied for short sequences. The use of these techniques is much more limited and alternative approaches based on machine learning are used instead. Here, we review the limitations of traditional methods as well as the alternative methods that have recently been developed for discovering novel bioactive peptides with a focus on prokaryotic genomes and metagenomes.


Asunto(s)
Biología Computacional , Péptidos , Péptidos/química , Péptidos/metabolismo , Péptidos/genética , Biología Computacional/métodos , Proteómica/métodos , Humanos , Sistemas de Lectura Abierta/genética , Aprendizaje Automático
8.
Proteomics ; : e2300280, 2024 May 14.
Artículo en Inglés | MEDLINE | ID: mdl-38742951

RESUMEN

Mass spectrometry proteomics data are typically evaluated against publicly available annotated sequences, but the proteogenomics approach is a useful alternative. A single genome is commonly utilized in custom proteomic and proteogenomic data analysis. We pose the question of whether utilizing numerous different genome assemblies in a search database would be beneficial. We reanalyzed raw data from the exoprotein fraction of four reference Enterobacterial Repetitive Intergenic Consensus (ERIC) I-IV genotypes of the honey bee bacterial pathogen Paenibacillus larvae and evaluated them against three reference databases (from NCBI-protein, RefSeq, and UniProt) together with an array of protein sequences generated by six-frame direct translation of 15 genome assemblies from GenBank. The wide search yielded 453 protein hits/groups, which UpSet analysis categorized into 50 groups based on the success of protein identification by the 18 database components. Nine hits that were not identified by a unique peptide were not considered for marker selection, which discarded the only protein that was not identified by the reference databases. We propose that the variability in successful identifications between genome assemblies is useful for marker mining. The results suggest that various strains of P. larvae can exhibit specific traits that set them apart from the established genotypes ERIC I-V.

9.
BMC Bioinformatics ; 25(1): 23, 2024 Jan 12.
Artículo en Inglés | MEDLINE | ID: mdl-38216898

RESUMEN

BACKGROUND: With the exponential growth of high-throughput technologies, multiple pathway analysis methods have been proposed to estimate pathway activities from gene expression profiles. These pathway activity inference methods can be divided into two main categories: non-Topology-Based (non-TB) and Pathway Topology-Based (PTB) methods. Although some review and survey articles discussed the topic from different aspects, there is a lack of systematic assessment and comparisons on the robustness of these approaches. RESULTS: Thus, this study presents comprehensive robustness evaluations of seven widely used pathway activity inference methods using six cancer datasets based on two assessments. The first assessment seeks to investigate the robustness of pathway activity in pathway activity inference methods, while the second assessment aims to assess the robustness of risk-active pathways and genes predicted by these methods. The mean reproducibility power and total number of identified informative pathways and genes were evaluated. Based on the first assessment, the mean reproducibility power of pathway activity inference methods generally decreased as the number of pathway selections increased. Entropy-based Directed Random Walk (e-DRW) distinctly outperformed other methods in exhibiting the greatest reproducibility power across all cancer datasets. On the other hand, the second assessment shows that no methods provide satisfactory results across datasets. CONCLUSION: However, PTB methods generally appear to perform better in producing greater reproducibility power and identifying potential cancer markers compared to non-TB methods.


Asunto(s)
Neoplasias , Humanos , Reproducibilidad de los Resultados , Neoplasias/genética , Entropía , Expresión Génica
10.
J Proteome Res ; 23(2): 523-531, 2024 02 02.
Artículo en Inglés | MEDLINE | ID: mdl-38096378

RESUMEN

The trends of the last 20 years in biotechnology were revealed using artificial intelligence and natural language processing (NLP) of publicly available data. Implementing this "science-of-science" approach, we capture convergent trends in the field of proteomics in both technology development and application across the phylogenetic tree of life. With major gaps in our knowledge about protein composition, structure, and location over time, we report trends in persistent, popular approaches and emerging technologies across 94 ideas from a corpus of 29 journals in PubMed over two decades. New metrics for clusters of these ideas reveal the progression and popularity of emerging approaches like single-cell, spatial, compositional, and chemical proteomics designed to better capture protein-level chemistry and biology. This analysis of the proteomics literature with advanced analytic tools quantifies the Rate of Rise for a next generation of technologies to better define, quantify, and visualize the multiple dimensions of the proteome that will transform our ability to measure and understand proteins in the coming decade.


Asunto(s)
Inteligencia Artificial , Proteómica , Proteómica/métodos , Filogenia , Proteoma/metabolismo , Tecnología
11.
Curr Issues Mol Biol ; 46(5): 4133-4146, 2024 Apr 30.
Artículo en Inglés | MEDLINE | ID: mdl-38785522

RESUMEN

Today, colorectal cancer (CRC) diagnosis is performed using colonoscopy, which is the current, most effective screening method. However, colonoscopy poses risks of harm to the patient and is an invasive process. Recent research has proven metabolomics as a potential, non-invasive detection method, which can use identified biomarkers to detect potential cancer in a patient's body. The aim of this study is to develop a machine-learning (ML) model based on chemical descriptors that will recognize CRC-associated metabolites. We selected a set of metabolites found as the biomarkers of CRC, confirmed that they participate in cancer-related pathways, and used them for training a machine-learning model for the diagnostics of CRC. Using a set of selective metabolites and random compounds, we developed a range of ML models. The best performing ML model trained on Stage 0-2 CRC metabolite data predicted a metabolite class with 89.55% accuracy. The best performing ML model trained on Stage 3-4 CRC metabolite data predicted a metabolite class with 95.21% accuracy. Lastly, the best-performing ML model trained on Stage 0-4 CRC metabolite data predicted a metabolite class with 93.04% accuracy. These models were then tested on independent datasets, including random and unrelated-disease metabolites. In addition, six pathways related to these CRC metabolites were also distinguished: aminoacyl-tRNA biosynthesis; glyoxylate and dicarboxylate metabolism; glycine, serine, and threonine metabolism; phenylalanine, tyrosine, and tryptophan biosynthesis; arginine biosynthesis; and alanine, aspartate, and glutamate metabolism. Thus, in this research study, we created machine-learning models based on metabolite-related descriptors that may be helpful in developing a non-invasive diagnosis method for CRC.

12.
Oncologist ; 29(5): 415-421, 2024 May 03.
Artículo en Inglés | MEDLINE | ID: mdl-38330451

RESUMEN

PURPOSE: Immune checkpoint inhibitors (ICIs) have significantly improved the survival of patients with cancer and provided long-term durable benefit. However, ICI-treated patients develop a range of toxicities known as immune-related adverse events (irAEs), which could compromise clinical benefits from these treatments. As the incidence and spectrum of irAEs differs across cancer types and ICI agents, it is imperative to characterize the incidence and spectrum of irAEs in a pan-cancer cohort to aid clinical management. DESIGN: We queried >400 000 trials registered at ClinicalTrials.gov and retrieved a comprehensive pan-cancer database of 71 087 ICI-treated participants from 19 cancer types and 7 ICI agents. We performed data harmonization and cleaning of these trial results into 293 harmonized adverse event categories using Medical Dictionary for Regulatory Activities. RESULTS: We developed irAExplorer (https://irae.tanlab.org), an interactive database that focuses on adverse events in patients administered with ICIs from big data mining. irAExplorer encompasses 71 087 distinct clinical trial participants from 343 clinical trials across 19 cancer types with well-annotated ICI treatment regimens and harmonized adverse event categories. We demonstrated a few of the irAE analyses through irAExplorer and highlighted some associations between treatment- or cancer-specific irAEs. CONCLUSION: The irAExplorer is a user-friendly resource that offers exploration, validation, and discovery of treatment- or cancer-specific irAEs across pan-cancer cohorts. We envision that irAExplorer can serve as a valuable resource to cross-validate users' internal datasets to increase the robustness of their findings.


Asunto(s)
Ensayos Clínicos como Asunto , Minería de Datos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Inhibidores de Puntos de Control Inmunológico , Neoplasias , Humanos , Inhibidores de Puntos de Control Inmunológico/efectos adversos , Inhibidores de Puntos de Control Inmunológico/uso terapéutico , Neoplasias/tratamiento farmacológico , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/epidemiología , Macrodatos , Bases de Datos Factuales/estadística & datos numéricos
13.
Biochem Biophys Res Commun ; 695: 149420, 2024 Feb 05.
Artículo en Inglés | MEDLINE | ID: mdl-38154263

RESUMEN

Addressing drug resistance poses a significant challenge in cancer treatment, as cancer cells develop diverse mechanisms to evade chemotherapy drugs, leading to treatment failure and disease relapse. Three-dimensional (3D) cell culture has emerged as a valuable model for studying drug resistance, although the underlying mechanisms remain elusive. By obtaining a better understanding of drug resistance within the 3D culture environment, we can develop more effective strategies to overcome it and improve the success of cancer treatments. Notably, the physical structure undergoes notable changes in 3D culture, with mechanical effects believed to play a pivotal role in drug resistance. Hence, our study aimed to explore the influence of mechanical effects on drug resistance by analyzing data related to "drug resistance" and "mechanobiology". Through this analysis, we identified ß-catenin and JNK1 as potential factors, which were further examined in MCF-7 cells cultivated under both 2D and 3D culture conditions. Our findings demonstrate that ß-catenin is activated through canonical and non-canonical pathways and associated with the drug resistance, particularly in organoids obtained under 3D culture.


Asunto(s)
Vía de Señalización Wnt , beta Catenina , Humanos , Células MCF-7 , beta Catenina/metabolismo , Resistencia a Antineoplásicos , Organoides/metabolismo
14.
Brief Bioinform ; 23(4)2022 07 18.
Artículo en Inglés | MEDLINE | ID: mdl-35788820

RESUMEN

Complex biomedical data generated during clinical, omics and mechanism-based experiments have increasingly been exploited through cloud- and visualization-based data mining techniques. However, the scientific community still lacks an easy-to-use web service for the comprehensive visualization of biomedical data, particularly high-quality and publication-ready graphics that allow easy scaling and updatability according to user demands. Therefore, we propose a community-driven modern web service, Hiplot (https://hiplot.org), with concise and top-quality data visualization applications for the life sciences and biomedical fields. This web service permits users to conveniently and interactively complete a few specialized visualization tasks that previously could only be conducted by senior bioinformatics or biostatistics researchers. It covers most of the daily demands of biomedical researchers with its equipped 240+ biomedical data visualization functions, involving basic statistics, multi-omics, regression, clustering, dimensional reduction, meta-analysis, survival analysis, risk modelling, etc. Moreover, to improve the efficiency in use and development of plugins, we introduced some core advantages on the client-/server-side of the website, such as spreadsheet-based data importing, cross-platform command-line controller (Hctl), multi-user plumber workers, JavaScript Object Notation-based plugin system, easy data/parameters, results and errors reproduction and real-time updates mode. Meanwhile, using demo/real data sets and benchmark tests, we explored statistical parameters, cancer genomic landscapes, disease risk factors and the performance of website based on selected native plugins. The statistics of visits and user numbers could further reflect the potential impact of this web service on relevant fields. Thus, researchers devoted to life and data sciences would benefit from this emerging and free web service.


Asunto(s)
Programas Informáticos , Interfaz Usuario-Computador , Biología Computacional/métodos , Visualización de Datos , Genómica , Humanos
15.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34791019

RESUMEN

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for millions of deaths around the world. To help contribute to the understanding of crucial knowledge and to further generate new hypotheses relevant to SARS-CoV-2 and human protein interactions, we make use of the information abundant Biomine probabilistic database and extend the experimentally identified SARS-CoV-2-human protein-protein interaction (PPI) network in silico. We generate an extended network by integrating information from the Biomine database, the PPI network and other experimentally validated results. To generate novel hypotheses, we focus on the high-connectivity sub-communities that overlap most with the integrated experimentally validated results in the extended network. Therefore, we propose a new data analysis pipeline that can efficiently compute core decomposition on the extended network and identify dense subgraphs. We then evaluate the identified dense subgraph and the generated hypotheses in three contexts: literature validation for uncovered virus targeting genes and proteins, gene function enrichment analysis on subgraphs and literature support on drug repurposing for identified tissues and diseases related to COVID-19. The major types of the generated hypotheses are proteins with their encoding genes and we rank them by sorting their connections to the integrated experimentally validated nodes. In addition, we compile a comprehensive list of novel genes, and proteins potentially related to COVID-19, as well as novel diseases which might be comorbidities. Together with the generated hypotheses, our results provide novel knowledge relevant to COVID-19 for further validation.


Asunto(s)
COVID-19 , Simulación por Computador , Modelos Biológicos , Mapas de Interacción de Proteínas , COVID-19/genética , COVID-19/metabolismo , Humanos , SARS-CoV-2/química , SARS-CoV-2/genética , SARS-CoV-2/metabolismo
16.
J Transl Med ; 22(1): 159, 2024 02 16.
Artículo en Inglés | MEDLINE | ID: mdl-38365731

RESUMEN

BACKGROUND: Proximal tubular cells (PTCs) play a critical role in the progression of diabetic kidney disease (DKD). As one of important progenitor markers, CD133 was reported to indicate the regeneration of dedifferentiated PTCs in acute kidney disease. However, its role in chronic DKD is unclear. Therefore, we aimed to investigate the expression patterns and elucidate its functional significance of CD133 in DKD. METHODS: Data mining was employed to illustrate the expression and molecular function of CD133 in PTCs in human DKD. Subsequently, rat models representing various stages of DKD progression were established. The expression of CD133 was confirmed in DKD rats, as well as in human PTCs (HK-2 cells) and rat PTCs (NRK-52E cells) exposed to high glucose. The immunofluorescence and flow cytometry techniques were utilized to determine the expression patterns of CD133, utilizing proliferative and injury indicators. After overexpression or knockdown of CD133 in HK-2 cells, the cell proliferation and apoptosis were detected by EdU assay, real-time cell analysis and flow analysis. Additionally, the evaluation of epithelial, progenitor cell, and apoptotic indices was performed through western blot and quantitative RT-PCR analyses. RESULTS: The expression of CD133 was notably elevated in both human and rat PTCs in DKD, and this expression increased as DKD progressed. CD133 was found to be co-expressed with CD24, KIM-1, SOX9, and PCNA, suggesting that CD133+ cells were damaged and associated with proliferation. In terms of functionality, the knockdown of CD133 resulted in a significant reduction in proliferation and an increase in apoptosis in HK-2 cells compared to the high glucose stimulus group. Conversely, the overexpression of CD133 significantly mitigated high glucose-induced cell apoptosis, but had no impact on cellular proliferation. Furthermore, the Nephroseq database provided additional evidence to support the correlation between CD133 expression and the progression of DKD. Analysis of single-cell RNA-sequencing data revealed that CD133+ PTCs potentially play a role in the advancement of DKD through multiple mechanisms, including heat damage, cell microtubule stabilization, cell growth inhibition and tumor necrosis factor-mediated signaling pathway. CONCLUSION: Our study demonstrates that the upregulation of CD133 is linked to cellular proliferation and protects PTC from apoptosis in DKD and high glucose induced PTC injury. We propose that heightened CD133 expression may facilitate cellular self-protective responses during the initial stages of high glucose exposure. However, its sustained increase is associated with the pathological progression of DKD. In conclusion, CD133 exhibits dual roles in the advancement of DKD, necessitating further investigation.


Asunto(s)
Antígeno AC133 , Diabetes Mellitus , Nefropatías Diabéticas , Animales , Humanos , Ratas , Línea Celular , Proliferación Celular , Diabetes Mellitus/patología , Nefropatías Diabéticas/metabolismo , Células Epiteliales/patología , Glucosa/metabolismo , Hiperplasia/patología , Antígeno AC133/genética , Antígeno AC133/metabolismo
17.
J Transl Med ; 22(1): 185, 2024 02 20.
Artículo en Inglés | MEDLINE | ID: mdl-38378565

RESUMEN

Clinical data mining of predictive models offers significant advantages for re-evaluating and leveraging large amounts of complex clinical real-world data and experimental comparison data for tasks such as risk stratification, diagnosis, classification, and survival prediction. However, its translational application is still limited. One challenge is that the proposed clinical requirements and data mining are not synchronized. Additionally, the exotic predictions of data mining are difficult to apply directly in local medical institutions. Hence, it is necessary to incisively review the translational application of clinical data mining, providing an analytical workflow for developing and validating prediction models to ensure the scientific validity of analytic workflows in response to clinical questions. This review systematically revisits the purpose, process, and principles of clinical data mining and discusses the key causes contributing to the detachment from practice and the misuse of model verification in developing predictive models for research. Based on this, we propose a niche-targeting framework of four principles: Clinical Contextual, Subgroup-Oriented, Confounder- and False Positive-Controlled (CSCF), to provide guidance for clinical data mining prior to the model's development in clinical settings. Eventually, it is hoped that this review can help guide future research and develop personalized predictive models to achieve the goal of discovering subgroups with varied remedial benefits or risks and ensuring that precision medicine can deliver its full potential.


Asunto(s)
Minería de Datos , Medicina de Precisión
18.
BMC Cancer ; 24(1): 52, 2024 Jan 10.
Artículo en Inglés | MEDLINE | ID: mdl-38200421

RESUMEN

BACKGROUND: As biomarkers, microRNAs (miRNAs) are closely associated with the occurrence, progression, and prognosis of non-small cell lung cancer (NSCLC). However, the prognostic predictive value of miRNAs in NSCLC has rarely been explored. In this study, the value in prognosis prediction of NSCLC was mined based on data mining models using clinical data and plasma miRNAs biomarkers. METHODS: A total of 69 patients were included in this prospective cohort study. After informed consent, they filled out questionnaires and had their peripheral blood collected. The expressions of plasma miRNAs were examined by quantitative polymerase chain reaction (qPCR). The Whitney U test was used to analyze non-normally distributed data. Kaplan-Meier was used to plot the survival curve, the log-rank test was used to compare with the overall survival curve, and the Cox proportional hazards model was used to screen the factors related to the prognosis of lung cancer. Data mining techniques were utilized to predict the prognostic status of patients. RESULTS: We identified that smoking (HR = 2.406, 95% CI = 1.256-4.611), clinical stage III + IV (HR = 5.389, 95% CI = 2.290-12.684), the high expression group of miR-20a (HR = 4.420, 95% CI = 1.760-11.100), the high expression group of miR-197 (HR = 3.828, 95% CI = 1.778-8.245), the low expression group of miR-145 ( HR = 0.286, 95% CI = 0.116-0.709), and the low expression group of miR-30a (HR = 0.307, 95% CI = 0.133-0.706) was associated with worse prognosis. Among the five data mining models, the decision trees (DT) C5.0 model performs the best, with accuracy and Area Under Curve (AUC) of 93.75% and 0.929 (0.685, 0.997), respectively. CONCLUSION: The results showed that the high expression level of miR-20a and miR-197, the low expression level of miR-145 and miR-30a were strongly associated with poorer prognosis in NSCLC patients, and the DT C5.0 model may serve as a novel, accurate, method for predicting prognosis of NSCLC.


Asunto(s)
Carcinoma de Pulmón de Células no Pequeñas , Neoplasias Pulmonares , MicroARNs , Humanos , MicroARNs/genética , Carcinoma de Pulmón de Células no Pequeñas/genética , Pronóstico , Estudios Prospectivos , Neoplasias Pulmonares/genética , Minería de Datos , Biomarcadores
19.
Anal Biochem ; 688: 115474, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38286352

RESUMEN

The aim of this study is to investigate the role of CFHR1 in bile duct carcinoma (BDC) and its mechanism of action, and we hope that our analysis and research will contribute to a better understanding of cholangiocarcinoma (BDC) disease genesis, progression and the development of new therapeutic strategies. The prognostic receiver operating characteristic curve of CFHR1 was generated using survival ROC. The ROC curve for CFHR1 showed that there is a correlation between CFHR1 expression and clinicopathological parameters and has an impact on poor prognosis. STRING was used to predict the protein-protein interaction network of the identified genes, and the Microenvironment Cell Populations counter algorithm was used to analyze immune cell infiltration within the BDC. The combined analysis showed that CFHR1 was found to be upregulated in BDC tissues, along with a total of 20 related differentially expressed genes (DEGs) (8 downregulated and 12 upregulated genes). Also, the results showed that the expression of CFHR1 is correlated with immune cell infiltration in tumor and immune cell markers in BDC (P < 0.05). In addition, we have verified experimentally the biological function of CFHR1. These findings suggest that CFHR1 may be a prognostic marker and a potential therapeutic target for BDC. Information regarding the detailed roles of CFHR1 in BDC could be valuable for improving the diagnosis and treatment of this rare cancer.


Asunto(s)
Neoplasias de los Conductos Biliares , Colangiocarcinoma , Humanos , Neoplasias de los Conductos Biliares/genética , Neoplasias de los Conductos Biliares/diagnóstico , Neoplasias de los Conductos Biliares/patología , Colangiocarcinoma/genética , Biomarcadores , Pronóstico , Conductos Biliares Intrahepáticos/patología , Microambiente Tumoral , Proteínas Inactivadoras del Complemento C3b
20.
Int J Legal Med ; 138(3): 961-970, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38240839

RESUMEN

This study aimed to explore and develop data mining models for adult age estimation based on CT reconstruction images from the sternum. Maximum intensity projection (MIP) images of chest CT were retrospectively collected from a modern Chinese population, and data from 2700 patients (1349 males and 1351 females) aged 20 to 70 years were obtained. A staging technique within four indicators was applied. Several data mining models were established, and mean absolute error (MAE) was the primary comparison parameter. The intraobserver and interobserver agreement levels were good. Within internal validation, the optimal data mining model obtained the lowest MAE of 9.08 in males and 10.41 in females. For the external validation (N = 200), MAEs were 7.09 in males and 7.15 in females. In conclusion, the accuracy of our model for adult age estimation was among similar studies. MIP images of the sternum could be a potential age indicator. However, it should be combined with other indicators since the accuracy level is still unsatisfactory.


Asunto(s)
Esternón , Tomografía Computarizada por Rayos X , Adulto , Masculino , Femenino , Humanos , Estudios Retrospectivos , Tomografía Computarizada por Rayos X/métodos , Esternón/diagnóstico por imagen , Minería de Datos , China
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda