Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
NPJ Digit Med ; 7(1): 106, 2024 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-38693429

RESUMO

Existing natural language processing (NLP) methods to convert free-text clinical notes into structured data often require problem-specific annotations and model training. This study aims to evaluate ChatGPT's capacity to extract information from free-text medical notes efficiently and comprehensively. We developed a large language model (LLM)-based workflow, utilizing systems engineering methodology and spiral "prompt engineering" process, leveraging OpenAI's API for batch querying ChatGPT. We evaluated the effectiveness of this method using a dataset of more than 1000 lung cancer pathology reports and a dataset of 191 pediatric osteosarcoma pathology reports, comparing the ChatGPT-3.5 (gpt-3.5-turbo-16k) outputs with expert-curated structured data. ChatGPT-3.5 demonstrated the ability to extract pathological classifications with an overall accuracy of 89%, in lung cancer dataset, outperforming the performance of two traditional NLP methods. The performance is influenced by the design of the instructive prompt. Our case analysis shows that most misclassifications were due to the lack of highly specialized pathology terminology, and erroneous interpretation of TNM staging rules. Reproducibility shows the relatively stable performance of ChatGPT-3.5 over time. In pediatric osteosarcoma dataset, ChatGPT-3.5 accurately classified both grades and margin status with accuracy of 98.6% and 100% respectively. Our study shows the feasibility of using ChatGPT to process large volumes of clinical notes for structured information extraction without requiring extensive task-specific human annotation and model training. The results underscore the potential role of LLMs in transforming unstructured healthcare data into structured formats, thereby supporting research and aiding clinical decision-making.

2.
Cancers (Basel) ; 14(17)2022 Aug 30.
Artigo em Inglês | MEDLINE | ID: mdl-36077736

RESUMO

Nearly all tumors have multiple mutations in cancer-causing genes. Which of these mutations act in tandem with other mutations to drive malignancy and also provide therapeutic vulnerability? To address this fundamental question, we conducted a pan-cancer screen of co-mutation enrichment (looking for two genes mutated together in the same tumor at a statistically significant rate) using the AACR-GENIE 11.0 data (AACR, Philadelphia, PA, USA). We developed a web tool for users to review results and perform ad hoc analyses. From our screen, we identified a number of such co-mutations and their associated lineages. Here, we focus on the RB1/TP53 co-mutation, which we discovered was the most frequently observed co-mutation across diverse cancer types, with particular enrichment in small cell carcinomas, neuroendocrine carcinomas, and sarcomas. Furthermore, in many cancers with a substantial fraction of co-mutant tumors, the presence of concurrent RB1/TP53 mutations is associated with poor clinical outcomes. From pan-cancer cell line multi-omics and functional screening datasets, we identified many targetable co-mutant-specific molecular alterations. Overall, our analyses revealed the prevalence, cancer type-specificity, clinical significance, and therapeutic vulnerabilities of the RB1/TP53 co-mutation in the pan-cancer landscape and provide a roadmap forward for future clinical translational research.

3.
Stat Med ; 41(23): 4647-4665, 2022 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-35871762

RESUMO

A recent technology breakthrough in spatial molecular profiling (SMP) has enabled the comprehensive molecular characterizations of single cells while preserving spatial information. It provides new opportunities to delineate how cells from different origins form tissues with distinctive structures and functions. One immediate question in SMP data analysis is to identify genes whose expressions exhibit spatially correlated patterns, called spatially variable (SV) genes. Most current methods to identify SV genes are built upon the geostatistical model with Gaussian process to capture the spatial patterns. However, the Gaussian process models rely on ad hoc kernels that could limit the models' ability to identify complex spatial patterns. In order to overcome this challenge and capture more types of spatial patterns, we introduce a Bayesian approach to identify SV genes via a modified Ising model. The key idea is to use the energy interaction parameter of the Ising model to characterize spatial expression patterns. We use auxiliary variable Markov chain Monte Carlo algorithms to sample from the posterior distribution with an intractable normalizing constant in the model. Simulation studies using both simulated and synthetic data showed that the energy-based modeling approach led to higher accuracy in detecting SV genes than those kernel-based methods. When applied to two real spatial transcriptomics (ST) datasets, the proposed method discovered novel spatial patterns that shed light on the biological mechanisms. In summary, the proposed method presents a new perspective for analyzing ST data.


Assuntos
Algoritmos , Transcriptoma , Teorema de Bayes , Humanos , Cadeias de Markov , Método de Monte Carlo , Transcriptoma/genética
4.
Brief Bioinform ; 20(1): 178-189, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-28968705

RESUMO

Rank aggregation (RA), the process of combining multiple ranked lists into a single ranking, has played an important role in integrating information from individual genomic studies that address the same biological question. In previous research, attention has been focused on aggregating full lists. However, partial and/or top ranked lists are prevalent because of the great heterogeneity of genomic studies and limited resources for follow-up investigation. To be able to handle such lists, some ad hoc adjustments have been suggested in the past, but how RA methods perform on them (after the adjustments) has never been fully evaluated. In this article, a systematic framework is proposed to define different situations that may occur based on the nature of individually ranked lists. A comprehensive simulation study is conducted to examine the performance characteristics of a collection of existing RA methods that are suitable for genomic applications under various settings simulated to mimic practical situations. A non-small cell lung cancer data example is provided for further comparison. Based on our numerical results, general guidelines about which methods perform the best/worst, and under what conditions, are provided. Also, we discuss key factors that substantially affect the performance of the different methods.


Assuntos
Biologia Computacional/métodos , Genômica/estatística & dados numéricos , Teorema de Bayes , Carcinoma Pulmonar de Células não Pequenas/genética , Simulação por Computador , Interpretação Estatística de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Neoplasias Pulmonares/genética , Cadeias de Markov , Modelos Estatísticos , Software
5.
Biostatistics ; 20(4): 565-581, 2019 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-29788035

RESUMO

Digital pathology imaging of tumor tissues, which captures histological details in high resolution, is fast becoming a routine clinical procedure. Recent developments in deep-learning methods have enabled the identification, characterization, and classification of individual cells from pathology images analysis at a large scale. This creates new opportunities to study the spatial patterns of and interactions among different types of cells. Reliable statistical approaches to modeling such spatial patterns and interactions can provide insight into tumor progression and shed light on the biological mechanisms of cancer. In this article, we consider the problem of modeling a pathology image with irregular locations of three different types of cells: lymphocyte, stromal, and tumor cells. We propose a novel Bayesian hierarchical model, which incorporates a hidden Potts model to project the irregularly distributed cells to a square lattice and a Markov random field prior model to identify regions in a heterogeneous pathology image. The model allows us to quantify the interactions between different types of cells, some of which are clinically meaningful. We use Markov chain Monte Carlo sampling techniques, combined with a double Metropolis-Hastings algorithm, in order to simulate samples approximately from a distribution with an intractable normalizing constant. The proposed model was applied to the pathology images of $205$ lung cancer patients from the National Lung Screening trial, and the results show that the interaction strength between tumor and stromal cells predicts patient prognosis (P = $0.005$). This statistical methodology provides a new perspective for understanding the role of cell-cell interactions in cancer progression.


Assuntos
Algoritmos , Interpretação de Imagem Assistida por Computador , Neoplasias Pulmonares/diagnóstico por imagem , Neoplasias Pulmonares/patologia , Modelos Estatísticos , Teorema de Bayes , Humanos , Cadeias de Markov , Método de Monte Carlo
6.
Methods Mol Biol ; 1552: 177-184, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28224499

RESUMO

RNA-binding proteins play important roles in the various stages of RNA maturation through binding to its target RNAs. Cross-linking immunoprecipitation coupled with high-throughput sequencing (CLIP-Seq) has made it possible to identify the targeting sites of RNA-binding proteins in various cell culture systems and tissue types on a genome-wide scale. Several Hidden Markov model-based (HMM) approaches have been suggested to identify protein-RNA binding sites from CLIP-Seq datasets. In this chapter, we describe how HMM can be applied to analyze CLIP-Seq datasets, including the bioinformatics preprocessing steps to extract count information from the sequencing data before HMM and the downstream analysis steps following peak-calling.


Assuntos
Algoritmos , Biologia Computacional/métodos , Cadeias de Markov , Proteínas de Ligação a RNA/metabolismo , RNA/metabolismo , Análise de Sequência de RNA/métodos , Sítios de Ligação , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Imunoprecipitação
7.
Biometrics ; 70(2): 430-40, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-24571656

RESUMO

The photoactivatable ribonucleoside enhanced cross-linking immunoprecipitation (PAR-CLIP) has been increasingly used for the global mapping of RNA-protein interaction sites. There are two key features of the PAR-CLIP experiments: The sequence read tags are likely to form an enriched peak around each RNA-protein interaction site; and the cross-linking procedure is likely to introduce a specific mutation in each sequence read tag at the interaction site. Several ad hoc methods have been developed to identify the RNA-protein interaction sites using either sequence read counts or mutation counts alone; however, rigorous statistical methods for analyzing PAR-CLIP are still lacking. In this article, we propose an integrative model to establish a joint distribution of observed read and mutation counts. To pinpoint the interaction sites at single base-pair resolution, we developed a novel modeling approach that adopts non-homogeneous hidden Markov models to incorporate the nucleotide sequence at each genomic location. Both simulation studies and data application showed that our method outperforms the ad hoc methods, and provides reliable inferences for the RNA-protein binding sites from PAR-CLIP data.


Assuntos
Modelos Estatísticos , Proteínas/química , RNA/química , Teorema de Bayes , Sítios de Ligação , Biometria/métodos , Simulação por Computador , Reagentes de Ligações Cruzadas , Proteína do X Frágil da Deficiência Intelectual/química , Proteína do X Frágil da Deficiência Intelectual/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Imunoprecipitação , Cadeias de Markov , Conformação de Ácido Nucleico , Estrutura Secundária de Proteína , Proteínas/metabolismo , RNA/genética , RNA/metabolismo , Proteína FUS de Ligação a RNA/química , Proteína FUS de Ligação a RNA/metabolismo , Análise de Sequência de RNA
8.
Stat Med ; 32(13): 2292-307, 2013 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-23097332

RESUMO

Epigenetics is the study of changes to the genome that can switch genes on or off and determine which proteins are transcribed without altering the DNA sequence. Recently, epigenetic changes have been linked to the development and progression of disease such as psychiatric disorders. High-throughput epigenetic experiments have enabled researchers to measure genome-wide epigenetic profiles and yield data consisting of intensity ratios of immunoprecipitation versus reference samples. The intensity ratios can provide a view of genomic regions where protein binding occur under one experimental condition and further allow us to detect epigenetic alterations through comparison between two different conditions. However, such experiments can be expensive, with only a few replicates available. Moreover, epigenetic data are often spatially correlated with high noise levels. In this paper, we develop a Bayesian hierarchical model, combined with hidden Markov processes with four states for modeling spatial dependence, to detect genomic sites with epigenetic changes from two-sample experiments with paired internal control. One attractive feature of the proposed method is that the four states of the hidden Markov process have well-defined biological meanings and allow us to directly call the change patterns based on the corresponding posterior probabilities. In contrast, none of existing methods can offer this advantage. In addition, the proposed method offers great power in statistical inference by spatial smoothing (via hidden Markov modeling) and information pooling (via hierarchical modeling). Both simulation studies and real data analysis in a cocaine addiction study illustrate the reliability and success of this method.


Assuntos
Teorema de Bayes , Epigênese Genética , Genoma , Modelos Estatísticos , Animais , Proteína de Ligação a CREB/genética , Transtornos Relacionados ao Uso de Cocaína/genética , Simulação por Computador , Humanos , Cadeias de Markov , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos
9.
J Bioinform Comput Biol ; 9(1): 131-48, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21328710

RESUMO

DNA copy number (DCN) is the number of copies of DNA at a region of a genome. The alterations of DCN are highly associated with the development of different tumors. Recently, microarray technologies are being employed to detect DCN changes at many loci at the same time in tumor samples. The resulting DCN data are often very noisy, and the tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task do not model the tumor/normal cell mixture ratio explicitly and they cannot output segments with DCN annotations. We developed a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves better accuracies on average as compared to three previous methods, namely Circular Binary Segmentation, Hidden Markov Model and Ultrasome.


Assuntos
Variações do Número de Cópias de DNA , Algoritmos , Biologia Computacional , Simulação por Computador , DNA de Neoplasias/genética , Interpretação Estatística de Dados , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Cadeias de Markov , Modelos Estatísticos , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Software
10.
Stat Med ; 29(4): 489-503, 2010 Feb 20.
Artigo em Inglês | MEDLINE | ID: mdl-20049751

RESUMO

The genome-wide DNA-protein-binding data, DNA sequence data and gene expression data represent complementary means to deciphering global and local transcriptional regulatory circuits. Combining these different types of data can not only improve the statistical power, but also provide a more comprehensive picture of gene regulation. In this paper, we propose a novel statistical model to augment protein-DNA-binding data with gene expression and DNA sequence data when available. We specify a hierarchical Bayes model and use Markov chain Monte Carlo simulations to draw inferences. Both simulation studies and an analysis of an experimental data set show that the proposed joint modeling method can significantly improve the specificity and sensitivity of identifying target genes as compared with conventional approaches relying on a single data source.


Assuntos
DNA/genética , Escherichia coli/genética , Regulação Bacteriana da Expressão Gênica , Proteína Reguladora de Resposta a Leucina/metabolismo , Ligação Proteica , Análise de Sequência de DNA/estatística & dados numéricos , Teorema de Bayes , Simulação por Computador , DNA/metabolismo , Escherichia coli/metabolismo , Expressão Gênica , Proteína Reguladora de Resposta a Leucina/genética , Cadeias de Markov , Modelos Estatísticos , Método de Monte Carlo , Regulon
11.
NMR Biomed ; 22(7): 779-86, 2009 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-19388006

RESUMO

Cerebrovascular reactivity (CVR) reflects the capacity of blood vessels to dilate and is an important marker for brain vascular reserve. It may provide a useful addition to the traditional baseline blood flow measurement when assessing vascular factors in brain disorders. Blood-oxygenation-level-dependent MRI under CO(2) inhalation offers a non-invasive and quantitative means to estimate CVR in humans. In this study, we investigated several important methodological aspects of this technique with the goal of optimizing the experimental and data processing strategies for clinical use. Comparing 4 min of 5% CO(2) inhalation (less comfortable) to a 1 min inhalation (more comfortable) duration, it was found that the CVR values were 0.31 +/- 0.05%/mmHg (N = 11) and 0.31 +/- 0.08%/mmHg (N = 9), respectively, showing no significant differences between the two breathing paradigms. Therefore, the 1 min paradigm is recommended for future application studies for patient comfort and tolerability. Furthermore, we have found that end-tidal CO(2) recording was useful for accurate quantification of CVR because it provided both timing and amplitude information regarding the input function to the brain vascular system, which can be subject-dependent. Finally, we show that inter-subject variations in CVR are of physiologic origin and affect the whole brain in a similar fashion. Based on this, it is proposed that relative CVR (normalized against the CVR of the whole brain or a reference tissue) may be a more sensitive biomarker than absolute CVR in clinical applications as it minimizes inter-subject variations. With these technological optimizations, CVR mapping may become a useful method for studies of neurological and psychiatric diseases.


Assuntos
Circulação Cerebrovascular/fisiologia , Hipercapnia/sangue , Hipercapnia/fisiopatologia , Imageamento por Ressonância Magnética/métodos , Oxigênio/sangue , Adulto , Mapeamento Encefálico , Dióxido de Carbono/administração & dosagem , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Respiração , Fatores de Tempo , Adulto Jovem
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA