RESUMO
The versatility of ChatGPT in performing a diverse range of tasks has elicited considerable interest on its potential applications within professional fields. Taking drug discovery as a testbed, this paper provides a comprehensive evaluation of ChatGPT's ability on molecule property prediction. The study focuses on three aspects: 1) Effects of different prompt settings, where we investigate the impact of varying prompts on the prediction outcomes of ChatGPT; 2) Comprehensive evaluation on molecule property prediction, where we conduct a comprehensive evaluation on 53 ADMET-related endpoints; 3) Analysis of ChatGPT's potential and limitations, where we make comparisons with models tailored for molecule property prediction, thus gaining a more accurate understanding of ChatGPT's capabilities and limitations in this area. Through comprehensive evaluation, we find that 1) With appropriate prompt settings, ChatGPT can attain satisfactory prediction outcomes that are competitive with specialized models designed for those tasks. 2) Prompt settings significantly affect ChatGPT's performance. Among all prompt settings, the strategy of selecting examples in few-shot has the greatest impact on results. Scaffold sampling greatly outperforms random sampling. 3) The capacity of ChatGPT to accomplish high-precision predictions is significantly influenced by the quality of examples provided, which may constrain its practical applicability in real-world scenarios. This work highlights ChatGPT's potential and limitations on molecule property prediction, which we hope can inspire future design and evaluation of Large Language Models within scientific domains.
Assuntos
Descoberta de Drogas , Projetos de PesquisaRESUMO
Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.
Assuntos
Inteligência Artificial , Modelos TeóricosRESUMO
OBJECTIVE: This study aimed to evaluate the association between four single nucleotide polymorphisms (SNPs) of the interleukin-6 (IL-6) gene and gastric cancer (GC), and impact of interaction between IL-6 SNPs and Helicobacter pylori (H. pylori ) infection on susceptibility to GC. METHODS: Logistic regression was used to test the relationships between four SNPs of IL-6 gene and GC susceptibility. A generalized multifactor dimensionality reduction (GMDR) model was employed to assess the interaction effect between IL-6 gene and H. pylori infection on GC risk. RESULTS: Logistic regression analysis indicated that the rs1800795-C allele was associated with increased GC risk, adjusted ORs (95% CI) were 1.80 (1.21-2.41) (CC vs. GG) and 1.68 (1.09-2.30) (C vs. G), respectively. The rs10499563-C allele was associated with decreased risk of GC, and adjusted ORs (95% CI) were 0.62 (0.31-0.93) (TC vs. TT), 0.52 (0.18-0.89) (CC vs. TT) and 0.60 (0.29-0.92) (C vs. T), respectively. GMDR methods found a two-dimensional model combination (including rs1800795 and H. pylori infection) was statistically significant. The selected model had testing balanced accuracy of 59.85% and the best cross-validation consistencies of 10/10 ( P â =â 0.0107). Compared with H. pylori -negative subjects with rs1800795- GG genotype, H. pylori -positive participants with GC or CC genotype had the highest risk of GC, the OR (95% CI) was 3.34 (1.78-4.97). CONCLUSION: The rs1800795-C allele was associated with increased GC risk and the rs10499563-C allele was associated with decreased GC risk. The interaction between rs1800795 and H. pylori infection was also correlated with increased risk of GC.
Assuntos
Infecções por Helicobacter , Helicobacter pylori , Neoplasias Gástricas , Humanos , Interleucina-6/genética , Predisposição Genética para Doença , Neoplasias Gástricas/epidemiologia , Neoplasias Gástricas/genética , Infecções por Helicobacter/complicações , Infecções por Helicobacter/diagnóstico , Infecções por Helicobacter/genética , Genótipo , Polimorfismo de Nucleotídeo Único/genética , Estudos de Casos e ControlesRESUMO
BACKGROUND: Many diseases can mimic the symptoms of gastric cancer (GC). Therefore, misdiagnosis of GC is common. Our preliminary sequencing analysis revealed the altered expression of circSLIT2 in GC. In this study, we further explored the role of circSLIT2 in GC. METHODS: The research subjects included GC patients, patients with irritable bowel syndrome (IBS), patients with gastric ulcer (GU), patients with gastric tuberculosis (GT), patients with Crohn's disease (CD), and healthy controls (HC). Accumulation of circSLIT2 RNA in both tissue and plasma samples was determined with RT-qPCR. The diagnostic and prognostic values of circSLIT2 for GC were explored by performing ROC and survival curve analysis. The χ2-test was applied for association analysis. RESULTS: Increased circSLIT2 RNA accumulation was observed in GC tissues compared to non-tumor tissues. Compared to the HC group, increased plasma circSLIT2 RNA accumulation was only observed in the GC group, but not in the IBS, GU, GT, and CD groups. Plasma circSLIT2 showed a positive correlation with circSLIT2 in GC tissues but not circSLIT2 in non-tumor tissues. Using increased plasma circSLIT2 as a biomarker, GC patients were effectively separated from other disease groups and the HC group. Survival curve analysis revealed that most patients who died during the 5year follow-up had high levels of circSLIT2 accumulation in GC tissues and plasma. CircSLIT2 in plasma and GC tissue was only closely associated with distant tumor metastases, but not other clinical factors. CONCLUSION: Increased circSLIT2 accumulation may serve as a novel diagnostic and prognostic biomarker for GC.
Assuntos
Síndrome do Intestino Irritável , Neoplasias Gástricas , Humanos , RNA Circular/genética , Neoplasias Gástricas/diagnóstico , Neoplasias Gástricas/genética , Prognóstico , Biomarcadores Tumorais/genética , RNA/genética , RNA/metabolismoRESUMO
Aim: In this study, we aimed to evaluate the associations of vascular endothelial growth factor (VEGF) gene single nucleotide polymorphisms (SNPs) and its interaction with current smoking with gastric cancer (GC) risk in the Chinese Han population. Methods: We used logistic regression model to test the association between VEGF gene polymorphism and the risk of GC. The association strength was evaluated by odds ratio (OR) and 95% confidence interval (CI) calculated using logistic regression. Generalized multifactor dimensionality reduction (GMDR) was used to analyze the effect of the interaction between VEGF gene and current smoking on GC risk. Results: Logistic regression analysis showed that the risk of GC was significantly higher in rs10434 -G allele carriers than that in AA genotype carriers (AG + GG and AA), and the adjusted OR (95% CI) = 1.64 (1.24-2.08). In addition, we found a significantly higher GC risk in subjects with rs833061-T allele than those with CC allele (CT + TT and CC), adjusted or (95% CI) = 1.43 (1.10-1.87). We also found a statistically significant two- locus model (p = 0.018), including rs10434 and current smoking, indicating a significant interaction between rs10434 and current smoking on the risk of GC. Hierarchical analysis found that current smokers with AG or GG genotype have the highest GC risk, compared to never- smokers with AA genotype, OR (95% CI) = 2.43 (1.64-3.28). Conclusion: We found that rs10434 -G and rs833061-T alleles, gene- environment interaction between rs10434, and current smoking were all related to increased GC risk.