Pesquisa | BVS Violência e Saúde

1.

Mining human microbiomes reveals an untapped source of peptide antibiotics.

Torres, Marcelo D T; Brooks, Erin F; Cesaro, Angela; Sberro, Hila; Gill, Matthew O; Nicolaou, Cosmos; Bhatt, Ami S; de la Fuente-Nunez, Cesar.

Cell ; 2024 Aug 16.

Artigo em Inglês | MEDLINE | ID: mdl-39163860

RESUMO

Drug-resistant bacteria are outpacing traditional antibiotic discovery efforts. Here, we computationally screened 444,054 previously reported putative small protein families from 1,773 human metagenomes for antimicrobial properties, identifying 323 candidates encoded in small open reading frames (smORFs). To test our computational predictions, 78 peptides were synthesized and screened for antimicrobial activity in vitro, with 70.5% displaying antimicrobial activity. As these compounds were different compared with previously reported antimicrobial peptides, we termed them smORF-encoded peptides (SEPs). SEPs killed bacteria by targeting their membrane, synergizing with each other, and modulating gut commensals, indicating a potential role in reconfiguring microbiome communities in addition to counteracting pathogens. The lead candidates were anti-infective in both murine skin abscess and deep thigh infection models. Notably, prevotellin-2 from Prevotella copri presented activity comparable to the commonly used antibiotic polymyxin B. Our report supports the existence of hundreds of antimicrobials in the human microbiome amenable to clinical translation.

2.

An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies.

Wang, Yiquan; Lv, Huibin; Teo, Qi Wen; Lei, Ruipeng; Gopal, Akshita B; Ouyang, Wenhao O; Yeung, Yuen-Hei; Tan, Timothy J C; Choi, Danbi; Shen, Ivana R; Chen, Xin; Graham, Claire S; Wu, Nicholas C.

Immunity ; 2024 Aug 15.

Artigo em Inglês | MEDLINE | ID: mdl-39163866

RESUMO

Despite decades of antibody research, it remains challenging to predict the specificity of an antibody solely based on its sequence. Two major obstacles are the lack of appropriate models and the inaccessibility of datasets for model training. In this study, we curated >5,000 influenza hemagglutinin (HA) antibodies by mining research publications and patents, which revealed many distinct sequence features between antibodies to HA head and stem domains. We then leveraged this dataset to develop a lightweight memory B cell language model (mBLM) for sequence-based antibody specificity prediction. Model explainability analysis showed that mBLM could identify key sequence features of HA stem antibodies. Additionally, by applying mBLM to HA antibodies with unknown epitopes, we discovered and experimentally validated many HA stem antibodies. Overall, this study not only advances our molecular understanding of the antibody response to the influenza virus but also provides a valuable resource for applying deep learning to antibody research.

3.

Disease Heritability Inferred from Familial Relationships Reported in Medical Records.

Polubriaginof, Fernanda C G; Vanguri, Rami; Quinnies, Kayla; Belbin, Gillian M; Yahi, Alexandre; Salmasian, Hojjat; Lorberbaum, Tal; Nwankwo, Victor; Li, Li; Shervey, Mark M; Glowe, Patricia; Ionita-Laza, Iuliana; Simmerling, Mary; Hripcsak, George; Bakken, Suzanne; Goldstein, David; Kiryluk, Krzysztof; Kenny, Eimear E; Dudley, Joel; Vawdrey, David K; Tatonetti, Nicholas P.

Cell ; 173(7): 1692-1704.e11, 2018 06 14.

Artigo em Inglês | MEDLINE | ID: mdl-29779949

RESUMO

Heritability is essential for understanding the biological causes of disease but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHRs) passively capture a wide range of clinically relevant data and provide a resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified 7.4 million familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with the literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a validation of the use of EHRs for genetics and disease research.

Assuntos

Registros Eletrônicos de Saúde , Doenças Genéticas Inatas/genética , Algoritmos , Bases de Dados Factuais , Relações Familiares , Doenças Genéticas Inatas/patologia , Genótipo , Humanos , Linhagem , Fenótipo , Característica Quantitativa Herdável

4.

A large-scale systematic survey reveals recurring molecular features of public antibody responses to SARS-CoV-2.

Wang, Yiquan; Yuan, Meng; Lv, Huibin; Peng, Jian; Wilson, Ian A; Wu, Nicholas C.

Immunity ; 55(6): 1105-1117.e4, 2022 06 14.

Artigo em Inglês | MEDLINE | ID: mdl-35397794

RESUMO

Global research to combat the COVID-19 pandemic has led to the isolation and characterization of thousands of human antibodies to the SARS-CoV-2 spike protein, providing an unprecedented opportunity to study the antibody response to a single antigen. Using the information derived from 88 research publications and 13 patents, we assembled a dataset of â¼8,000 human antibodies to the SARS-CoV-2 spike protein from >200 donors. By analyzing immunoglobulin V and D gene usages, complementarity-determining region H3 sequences, and somatic hypermutations, we demonstrated that the common (public) responses to different domains of the spike protein were quite different. We further used these sequences to train a deep-learning model to accurately distinguish between the human antibodies to SARS-CoV-2 spike protein and those to influenza hemagglutinin protein. Overall, this study provides an informative resource for antibody research and enhances our molecular understanding of public antibody responses.

Assuntos

COVID-19 , SARS-CoV-2 , Anticorpos Neutralizantes , Anticorpos Antivirais , Formação de Anticorpos , Humanos , Pandemias , Glicoproteína da Espícula de Coronavírus

5.

A basal-level activity of ATR links replication fork surveillance and stress response.

Yin, Yandong; Lee, Wei Ting Chelsea; Gupta, Dipika; Xue, Huijun; Tonzi, Peter; Borowiec, James A; Huang, Tony T; Modesti, Mauro; Rothenberg, Eli.

Mol Cell ; 81(20): 4243-4257.e6, 2021 10 21.

Artigo em Inglês | MEDLINE | ID: mdl-34473946

RESUMO

Mammalian cells use diverse pathways to prevent deleterious consequences during DNA replication, yet the mechanism by which cells survey individual replisomes to detect spontaneous replication impediments at the basal level, and their accumulation during replication stress, remain undefined. Here, we used single-molecule localization microscopy coupled with high-order-correlation image-mining algorithms to quantify the composition of individual replisomes in single cells during unperturbed replication and under replicative stress. We identified a basal-level activity of ATR that monitors and regulates the amounts of RPA at forks during normal replication. Replication-stress amplifies the basal activity through the increased volume of ATR-RPA interaction and diffusion-driven enrichment of ATR at forks. This localized crowding of ATR enhances its collision probability, stimulating the activation of its replication-stress response. Finally, we provide a computational model describing how the basal activity of ATR is amplified to produce its canonical replication stress response.

Assuntos

Proteínas Mutadas de Ataxia Telangiectasia/metabolismo , Replicação do DNA , DNA de Neoplasias/biossíntese , Algoritmos , Proteínas Mutadas de Ataxia Telangiectasia/genética , Linhagem Celular Tumoral , Quinase 1 do Ponto de Checagem/genética , Quinase 1 do Ponto de Checagem/metabolismo , DNA de Neoplasias/genética , Humanos , Processamento de Imagem Assistida por Computador , Cinética , Mutação , Fosforilação , Proteína de Replicação A/genética , Proteína de Replicação A/metabolismo , Imagem Individual de Molécula

6.

Literature-based predictions of Mendelian disease therapies.

Deisseroth, Cole A; Lee, Won-Seok; Kim, Jiyoen; Jeong, Hyun-Hwan; Dhindsa, Ryan S; Wang, Julia; Zoghbi, Huda Y; Liu, Zhandong.

Am J Hum Genet ; 110(10): 1661-1672, 2023 10 05.

Artigo em Inglês | MEDLINE | ID: mdl-37741276

RESUMO

In the effort to treat Mendelian disorders, correcting the underlying molecular imbalance may be more effective than symptomatic treatment. Identifying treatments that might accomplish this goal requires extensive and up-to-date knowledge of molecular pathways-including drug-gene and gene-gene relationships. To address this challenge, we present "parsing modifiers via article annotations" (PARMESAN), a computational tool that searches PubMed and PubMed Central for information to assemble these relationships into a central knowledge base. PARMESAN then predicts putatively novel drug-gene relationships, assigning an evidence-based score to each prediction. We compare PARMESAN's drug-gene predictions to all of the drug-gene relationships displayed by the Drug-Gene Interaction Database (DGIdb) and show that higher-scoring relationship predictions are more likely to match the directionality (up- versus down-regulation) indicated by this database. PARMESAN had more than 200,000 drug predictions scoring above 8 (as one example cutoff), for more than 3,700 genes. Among these predicted relationships, 210 were registered in DGIdb and 201 (96%) had matching directionality. This publicly available tool provides an automated way to prioritize drug screens to target the most-promising drugs to test, thereby saving time and resources in the development of therapeutics for genetic disorders.

Assuntos

PubMed , Humanos , Bases de Dados Factuais

7.

The rise of taxon-specific epitope predictors.

Campelo, Felipe; Lobo, Francisco P.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38493292

RESUMO

Computational predictors of immunogenic peptides, or epitopes, are traditionally built based on data from a broad range of pathogens without consideration for taxonomic information. While this approach may be reasonable if one aims to develop one-size-fits-all models, it may be counterproductive if the proteins for which the model is expected to generalize are known to come from a specific subset of phylogenetically related pathogens. There is mounting evidence that, for these cases, taxon-specific models can outperform generalist ones, even when trained with substantially smaller amounts of data. In this comment, we provide some perspective on the current state of taxon-specific modelling for the prediction of linear B-cell epitopes, and the challenges faced when building and deploying these predictors.

Assuntos

Peptídeos , Proteínas , Sequência de Aminoácidos , Epitopos de Linfócito B

8.

Multi-omic analysis tools for microbial metabolites prediction.

Wu, Shengbo; Zhou, Haonan; Chen, Danlei; Lu, Yutong; Li, Yanni; Qiao, Jianjun.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38859767

RESUMO

How to resolve the metabolic dark matter of microorganisms has long been a challenging problem in discovering active molecules. Diverse omics tools have been developed to guide the discovery and characterization of various microbial metabolites, which make it gradually possible to predict the overall metabolites for individual strains. The combinations of multi-omic analysis tools effectively compensates for the shortcomings of current studies that focus only on single omics or a broad class of metabolites. In this review, we systematically update, categorize and sort out different analysis tools for microbial metabolites prediction in the last five years to appeal for the multi-omic combination on the understanding of the metabolic nature of microbes. First, we provide the general survey on different updated prediction databases, webservers, or software that based on genomics, transcriptomics, proteomics, and metabolomics, respectively. Then, we discuss the essentiality on the integration of multi-omics data to predict metabolites of different microbial strains and communities, as well as stressing the combination of other techniques, such as systems biology methods and data-driven algorithms. Finally, we identify key challenges and trends in developing multi-omic analysis tools for more comprehensive prediction on diverse microbial metabolites that contribute to human health and disease treatment.

Assuntos

Metabolômica , Software , Metabolômica/métodos , Genômica/métodos , Proteômica/métodos , Humanos , Biologia Computacional/métodos , Bactérias/metabolismo , Bactérias/genética , Bactérias/classificação , Metaboloma , Algoritmos , Multiômica

9.

ChIP-GPT: a managed large language model for robust data extraction from biomedical database records.

Cinquin, Olivier.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38314912

RESUMO

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.

Assuntos

Medicina , Humanos , Linhagem Celular , Imunoprecipitação da Cromatina , Bases de Dados Factuais , Idioma

10.

IDPpub: Illuminating the Dark Phosphoproteome Through PubMed Mining.

Savage, Sara R; Zhang, Yaoyun; Jaehnig, Eric J; Liao, Yuxing; Shi, Zhiao; Pham, Huy Anh; Xu, Hua; Zhang, Bing.

Mol Cell Proteomics ; 23(1): 100682, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-37993103

RESUMO

Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.

Assuntos

Mineração de Dados , Processamento de Linguagem Natural , Humanos , Mineração de Dados/métodos , Bases de Dados Factuais , PubMed

11.

Mining in space could spur sustainable growth.

Fleming, Maxwell; Lange, Ian; Shojaeinia, Sayeh; Stuermer, Martin.

Proc Natl Acad Sci U S A ; 120(43): e2221345120, 2023 Oct 24.

Artigo em Inglês | MEDLINE | ID: mdl-37844231

RESUMO

Growth models with resources and environmental externalities typically assume that planet Earth is a closed economy. However, private firms like Blue Origin and SpaceX have reduced the cost of rocket launches by a factor of 20 over the last decade. What if these costs continue to decline, making mining from asteroids or the moon feasible? What would be the implications for economic growth and the environment? This paper provides stylized facts about cost trends, geology, and the environmental impact of mining on Earth and potentially in Space. We extend a neoclassical growth model to investigate the transition from mining on Earth to Space. We find that such a transition could potentially allow for continued growth of metal use, while limiting environmental and social costs on Earth. Acknowledging the high uncertainty around the topic, our paper provides a starting point for research on how Space mining could contribute to sustainable growth on Earth.

12.

Resistance gene-guided genome mining reveals the roseopurpurins as inhibitors of cyclin-dependent kinases.

Dunbar, Kyle L; Perlatti, Bruno; Liu, Nicholas; Cornelius, Amber; Mummau, Daniel; Chiang, Yi-Ming; Hon, Lawrence; Nimavat, Monika; Pallas, Jason; Kordes, Sina; Ng, Ho Leung; Harvey, Colin J B.

Proc Natl Acad Sci U S A ; 120(48): e2310522120, 2023 Nov 28.

Artigo em Inglês | MEDLINE | ID: mdl-37983497

RESUMO

With the significant increase in the availability of microbial genome sequences in recent years, resistance gene-guided genome mining has emerged as a powerful approach for identifying natural products with specific bioactivities. Here, we present the use of this approach to reveal the roseopurpurins as potent inhibitors of cyclin-dependent kinases (CDKs), a class of cell cycle regulators implicated in multiple cancers. We identified a biosynthetic gene cluster (BGC) with a putative resistance gene with homology to human CDK2. Using targeted gene disruption and transcription factor overexpression in Aspergillus uvarum, and heterologous expression of the BGC in Aspergillus nidulans, we demonstrated that roseopurpurin C (1) is produced by this cluster and characterized its biosynthesis. We determined the potency, specificity, and mechanism of action of 1 as well as multiple intermediates and shunt products produced from the BGC. We show that 1 inhibits human CDK2 with a Kiapp of 44 nM, demonstrates selectivity for clinically relevant members of the CDK family, and induces G1 cell cycle arrest in HCT116 cells. Structural analysis of 1 complexed with CDK2 revealed the molecular basis of ATP-competitive inhibition.

Assuntos

Quinases Ciclina-Dependentes , Neoplasias , Humanos , Quinases Ciclina-Dependentes/metabolismo , Quinase 2 Dependente de Ciclina/genética , Ciclinas/metabolismo , Ciclo Celular/genética , Inibidores Enzimáticos

13.

China, the Democratic Republic of the Congo, and artisanal cobalt mining from 2000 through 2020.

Gulley, Andrew L.

Proc Natl Acad Sci U S A ; 120(26): e2212037120, 2023 06 27.

Artigo em Inglês | MEDLINE | ID: mdl-37339197

RESUMO

From 2000 through 2020, demand for cobalt to manufacture batteries grew 26-fold. Eighty-two percent of this growth occurred in China and China's cobalt refinery production increased 78-fold. Diminished industrial cobalt mine production in the early-to-mid 2000s led many Chinese companies to purchase ores from artisanal cobalt miners in the Democratic Republic of the Congo (DRC), many of whom have been found to be children. Despite extensive research on artisanal cobalt mining, fundamental questions about its production remain unanswered. This gap is addressed here by estimating artisanal cobalt production, processing, and trade. The results show that, while total DRC cobalt mine production grew from 11,000 metric tons (t) in 2000 to 98,000 t in 2020, artisanal production only grew from 1,000 to 2,000 t in 2000 to 9,000 to 11,000 t in 2020 (with a peak of 17,000 to 21,000 t in 2018). Artisanal production's share of world and DRC cobalt mine production peaked around 2008 at 18 to 23% and 40 to 53%, respectively, before trending down to 6 to 8% and 9 to 11% in 2020, respectively. Artisanal production was chiefly exported to China or processed within the DRC by Chinese firms. An average of 72 to 79% of artisanal production was processed at facilities within the DRC from 2016 through 2020. As such, these facilities may be potential monitoring points for artisanal production and its downstream consumers. This finding may help to support responsible sourcing initiatives and better address abuses related to artisanal cobalt mining by focusing local efforts at the artisanal processing facilities through which most artisanal cobalt production flows.

Assuntos

Cobalto , Mineração , Humanos , Criança , República Democrática do Congo , Indústrias , China

14.

The pseudotorsional space of RNA.

Grille, Leandro; Gallego, Diego; Darré, Leonardo; da Rosa, Gabriela; Battistini, Federica; Orozco, Modesto; Dans, Pablo D.

RNA ; 29(12): 1896-1909, 2023 12.

Artigo em Inglês | MEDLINE | ID: mdl-37793790

RESUMO

The characterization of the conformational landscape of the RNA backbone is rather complex due to the ability of RNA to assume a large variety of conformations. These backbone conformations can be depicted by pseudotorsional angles linking RNA backbone atoms, from which Ramachandran-like plots can be built. We explore here different definitions of these pseudotorsional angles, finding that the most accurate ones are the traditional Î· (eta) and Î¸ (theta) angles, which represent the relative position of RNA backbone atoms P and C4'. We explore the distribution of Î· - Î¸ in known experimental structures, comparing the pseudotorsional space generated with structures determined exclusively by one experimental technique. We found that the complete picture only appears when combining data from different sources. The maps provide a quite comprehensive representation of the RNA accessible space, which can be used in RNA-structural predictions. Finally, our results highlight that protein interactions lead to significant changes in the population of the Î· - Î¸ space, pointing toward the role of induced-fit mechanisms in protein-RNA recognition.

Assuntos

Proteínas , RNA , RNA/genética , RNA/química , Proteínas/química , Conformação de Ácido Nucleico

15.

OTTM: an automated classification tool for translational drug discovery from omics data.

Yang, Xiaobo; Zhang, Bei; Wang, Siqi; Lu, Ye; Chen, Kaixian; Luo, Cheng; Sun, Aihua; Zhang, Hao.

Brief Bioinform ; 24(5)2023 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-37594310

RESUMO

Omics data from clinical samples are the predominant source of target discovery and drug development. Typically, hundreds or thousands of differentially expressed genes or proteins can be identified from omics data. This scale of possibilities is overwhelming for target discovery and validation using biochemical or cellular experiments. Most of these proteins and genes have no corresponding drugs or even active compounds. Moreover, a proportion of them may have been previously reported as being relevant to the disease of interest. To facilitate translational drug discovery from omics data, we have developed a new classification tool named Omics and Text driven Translational Medicine (OTTM). This tool can markedly narrow the range of proteins or genes that merit further validation via drug availability assessment and literature mining. For the 4489 candidate proteins identified in our previous proteomics study, OTTM recommended 40 FDA-approved or clinical trial drugs. Of these, 15 are available commercially and were tested on hepatocellular carcinoma Hep-G2 cells. Two drugs-tafenoquine succinate (an FDA-approved antimalarial drug targeting CYC1) and branaplam (a Phase 3 clinical drug targeting SMN1 for the treatment of spinal muscular atrophy)-showed potent inhibitory activity against Hep-G2 cell viability, suggesting that CYC1 and SMN1 may be potential therapeutic target proteins for hepatocellular carcinoma. In summary, OTTM is an efficient classification tool that can accelerate the discovery of effective drugs and targets using thousands of candidate proteins identified from omics data. The online and local versions of OTTM are available at http://otter-simm.com/ottm.html.

Assuntos

Carcinoma Hepatocelular , Neoplasias Hepáticas , Humanos , Ciência Translacional Biomédica , Proteômica , Descoberta de Drogas

16.

Large-scale predicting protein functions through heterogeneous feature fusion.

Zheng, Rongtao; Huang, Zhijian; Deng, Lei.

Brief Bioinform ; 24(4)2023 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-37401369

RESUMO

As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.

Assuntos

Biologia Computacional , Proteínas , Humanos , Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos , Redes Neurais de Computação , Bases de Dados Factuais , Bases de Dados de Proteínas

17.

Triangulating evidence in health sciences with Annotated Semantic Queries.

Liu, Yi; Gaunt, Tom R.

Bioinformatics ; 2024 Aug 22.

Artigo em Inglês | MEDLINE | ID: mdl-39171832

RESUMO

MOTIVATION: Integrating information from data sources representing different study designs has the potential to strengthen evidence in population health research. However, this concept of evidence "triangulation" presents a number of challenges for systematically identifying and integrating relevant information. These include the harmonization of heterogenous evidence with common semantic concepts and properties, as well as the priortization of the retrieved evidence for triangulation with the question of interest. RESULTS: We present ASQ (Annotated Semantic Queries), a natural language query interface to the integrated biomedical entities and epidemiological evidence in EpiGraphDB, which enables users to extract "claims" from a piece of unstructured text, and then investigate the evidence that could either support, contradict the claims, or offer additional information to the query.This approach has the potential to support the rapid review of preprints, grant applications, conference abstracts and articles submitted for peer review. ASQ implements strategies to harmonize biomedical entities in different taxonomies and evidence from different sources, to facilitate evidence triangulation and interpretation. AVAILABILITY AND IMPLEMENTATION: ASQ is openly available at https://asq.epigraphdb.org and its source code is available at https://github.com/mrcieu/epigraphdb-asq under GPL-3.0 license. SUPPLEMENTARY INFORMATION: Further information can be found in the Supplementary Materials as well as on the ASQ platform via https://asq.epigraphdb.org/docs.

18.

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4).

Truhn, Daniel; Loeffler, Chiara Ml; Müller-Franzes, Gustav; Nebelung, Sven; Hewitt, Katherine J; Brandner, Sebastian; Bressem, Keno K; Foersch, Sebastian; Kather, Jakob Nikolas.

J Pathol ; 262(3): 310-319, 2024 03.

Artigo em Inglês | MEDLINE | ID: mdl-38098169

RESUMO

Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.

Assuntos

Glioblastoma , Medicina de Precisão , Humanos , Aprendizado de Máquina , Reino Unido

19.

RDscan: Extracting RNA-disease relationship from the literature based on pre-training model.

Zhang, Yang; Yang, Yu; Ren, Liping; Ning, Lin; Zou, Quan; Luo, Nanchao; Zhang, Yinghui; Liu, Ruijun.

Methods ; 228: 48-54, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-38789016

RESUMO

With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.

Assuntos

Mineração de Dados , RNA , Mineração de Dados/métodos , RNA/genética , Humanos , Aprendizado de Máquina , Doença/genética , Máquina de Vetores de Suporte , Software

20.

A pantropical assessment of deforestation caused by industrial mining.

Giljum, Stefan; Maus, Victor; Kuschnig, Nikolas; Luckeneder, Sebastian; Tost, Michael; Sonter, Laura J; Bebbington, Anthony J.

Proc Natl Acad Sci U S A ; 119(38): e2118273119, 2022 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-36095187

RESUMO

Growing demand for minerals continues to drive deforestation worldwide. Tropical forests are particularly vulnerable to the environmental impacts of mining and mineral processing. Many local- to regional-scale studies document extensive, long-lasting impacts of mining on biodiversity and ecosystem services. However, the full scope of deforestation induced by industrial mining across the tropics is yet unknown. Here, we present a biome-wide assessment to show where industrial mine expansion has caused the most deforestation from 2000 to 2019. We find that 3,264 km2 of forest was directly lost due to industrial mining, with 80% occurring in only four countries: Indonesia, Brazil, Ghana, and Suriname. Additionally, controlling for other nonmining determinants of deforestation, we find that mining caused indirect forest loss in two-thirds of the investigated countries. Our results illustrate significant yet unevenly distributed and often unmanaged impacts on these biodiverse ecosystems. Impact assessments and mitigation plans of industrial mining activities must address direct and indirect impacts to support conservation of the world's tropical forests.

Assuntos

Biodiversidade , Conservação dos Recursos Naturais , Florestas , Mineração , Conservação dos Recursos Naturais/métodos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA