Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 10.904
Filter
Add more filters

Publication year range
1.
Cell ; 187(19): 5453-5467.e15, 2024 Sep 19.
Article in English | MEDLINE | ID: mdl-39163860

ABSTRACT

Drug-resistant bacteria are outpacing traditional antibiotic discovery efforts. Here, we computationally screened 444,054 previously reported putative small protein families from 1,773 human metagenomes for antimicrobial properties, identifying 323 candidates encoded in small open reading frames (smORFs). To test our computational predictions, 78 peptides were synthesized and screened for antimicrobial activity in vitro, with 70.5% displaying antimicrobial activity. As these compounds were different compared with previously reported antimicrobial peptides, we termed them smORF-encoded peptides (SEPs). SEPs killed bacteria by targeting their membrane, synergizing with each other, and modulating gut commensals, indicating a potential role in reconfiguring microbiome communities in addition to counteracting pathogens. The lead candidates were anti-infective in both murine skin abscess and deep thigh infection models. Notably, prevotellin-2 from Prevotella copri presented activity comparable to the commonly used antibiotic polymyxin B. Our report supports the existence of hundreds of antimicrobials in the human microbiome amenable to clinical translation.


Subject(s)
Anti-Bacterial Agents , Antimicrobial Peptides , Microbiota , Humans , Animals , Mice , Anti-Bacterial Agents/pharmacology , Microbiota/drug effects , Antimicrobial Peptides/pharmacology , Antimicrobial Peptides/chemistry , Metagenome , Female , Open Reading Frames , Bacteria/drug effects , Bacteria/genetics , Bacteria/classification , Prevotella/drug effects
2.
Immunity ; 57(10): 2453-2465.e7, 2024 Oct 08.
Article in English | MEDLINE | ID: mdl-39163866

ABSTRACT

Despite decades of antibody research, it remains challenging to predict the specificity of an antibody solely based on its sequence. Two major obstacles are the lack of appropriate models and the inaccessibility of datasets for model training. In this study, we curated >5,000 influenza hemagglutinin (HA) antibodies by mining research publications and patents, which revealed many distinct sequence features between antibodies to HA head and stem domains. We then leveraged this dataset to develop a lightweight memory B cell language model (mBLM) for sequence-based antibody specificity prediction. Model explainability analysis showed that mBLM could identify key sequence features of HA stem antibodies. Additionally, by applying mBLM to HA antibodies with unknown epitopes, we discovered and experimentally validated many HA stem antibodies. Overall, this study not only advances our molecular understanding of the antibody response to the influenza virus but also provides a valuable resource for applying deep learning to antibody research.


Subject(s)
Antibodies, Viral , Antibody Specificity , Hemagglutinin Glycoproteins, Influenza Virus , Hemagglutinin Glycoproteins, Influenza Virus/immunology , Humans , Antibodies, Viral/immunology , Antibody Specificity/immunology , Influenza, Human/immunology , Epitopes/immunology , Animals , Deep Learning
3.
Cell ; 173(7): 1692-1704.e11, 2018 06 14.
Article in English | MEDLINE | ID: mdl-29779949

ABSTRACT

Heritability is essential for understanding the biological causes of disease but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHRs) passively capture a wide range of clinically relevant data and provide a resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified 7.4 million familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with the literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a validation of the use of EHRs for genetics and disease research.


Subject(s)
Electronic Health Records , Genetic Diseases, Inborn/genetics , Algorithms , Databases, Factual , Family Relations , Genetic Diseases, Inborn/pathology , Genotype , Humans , Pedigree , Phenotype , Quantitative Trait, Heritable
4.
Immunity ; 55(6): 1105-1117.e4, 2022 06 14.
Article in English | MEDLINE | ID: mdl-35397794

ABSTRACT

Global research to combat the COVID-19 pandemic has led to the isolation and characterization of thousands of human antibodies to the SARS-CoV-2 spike protein, providing an unprecedented opportunity to study the antibody response to a single antigen. Using the information derived from 88 research publications and 13 patents, we assembled a dataset of ∼8,000 human antibodies to the SARS-CoV-2 spike protein from >200 donors. By analyzing immunoglobulin V and D gene usages, complementarity-determining region H3 sequences, and somatic hypermutations, we demonstrated that the common (public) responses to different domains of the spike protein were quite different. We further used these sequences to train a deep-learning model to accurately distinguish between the human antibodies to SARS-CoV-2 spike protein and those to influenza hemagglutinin protein. Overall, this study provides an informative resource for antibody research and enhances our molecular understanding of public antibody responses.


Subject(s)
COVID-19 , SARS-CoV-2 , Antibodies, Neutralizing , Antibodies, Viral , Antibody Formation , Humans , Pandemics , Spike Glycoprotein, Coronavirus
5.
Mol Cell ; 81(20): 4243-4257.e6, 2021 10 21.
Article in English | MEDLINE | ID: mdl-34473946

ABSTRACT

Mammalian cells use diverse pathways to prevent deleterious consequences during DNA replication, yet the mechanism by which cells survey individual replisomes to detect spontaneous replication impediments at the basal level, and their accumulation during replication stress, remain undefined. Here, we used single-molecule localization microscopy coupled with high-order-correlation image-mining algorithms to quantify the composition of individual replisomes in single cells during unperturbed replication and under replicative stress. We identified a basal-level activity of ATR that monitors and regulates the amounts of RPA at forks during normal replication. Replication-stress amplifies the basal activity through the increased volume of ATR-RPA interaction and diffusion-driven enrichment of ATR at forks. This localized crowding of ATR enhances its collision probability, stimulating the activation of its replication-stress response. Finally, we provide a computational model describing how the basal activity of ATR is amplified to produce its canonical replication stress response.


Subject(s)
Ataxia Telangiectasia Mutated Proteins/metabolism , DNA Replication , DNA, Neoplasm/biosynthesis , Algorithms , Ataxia Telangiectasia Mutated Proteins/genetics , Cell Line, Tumor , Checkpoint Kinase 1/genetics , Checkpoint Kinase 1/metabolism , DNA, Neoplasm/genetics , Humans , Image Processing, Computer-Assisted , Kinetics , Mutation , Phosphorylation , Replication Protein A/genetics , Replication Protein A/metabolism , Single Molecule Imaging
6.
Proc Natl Acad Sci U S A ; 121(45): e2417688121, 2024 Nov 05.
Article in English | MEDLINE | ID: mdl-39475648

ABSTRACT

Mining of electronic health records (EHR) promises to automate the identification of comprehensive disease phenotypes. However, the realization of this promise is hindered by the unavailability of generalizable ground-truth information, data incompleteness and heterogeneity, and the lack of generalization to multiple cohorts. We present here a data-driven approach to identify clinical states that we implement for 585 critical care patients with suspected pneumonia recruited by the SCRIPT study, which we compare to and integrate with 9,918 pneumonia patients from the MIMIC-IV dataset. We extract and curate from their structured EHRs a primary set of clinical features (53 and 59 features for SCRIPT and MIMIC-IV, respectively), including disease severity scores, vital signs, and so on, at various degrees of completeness. We aggregate irregular time series into daily frequency, resulting in 12,495 and 94,684 patient-day pairs for SCRIPT and MIMIC, respectively. We define a "common-sense" ground truth that we then use in a semisupervised pipeline to optimize choices for data preprocessing, and reduce the feature space to four principal components. We describe and validate an ensemble-based clustering method that enables us to robustly identify five clinical states, and use a Gaussian mixture model to quantify uncertainty in cluster assignment. Demonstrating the clinical relevance of the identified states, we find that three states are strongly associated with disease outcomes (dying vs. recovering), while the other two reflect disease etiology. The outcome associated clinical states provide significantly increased discrimination of mortality rates over standard approaches.


Subject(s)
Data Mining , Electronic Health Records , Pneumonia , Humans , Pneumonia/mortality , Pneumonia/epidemiology , Data Mining/methods , Male , Female , Cluster Analysis
7.
Am J Hum Genet ; 110(10): 1661-1672, 2023 10 05.
Article in English | MEDLINE | ID: mdl-37741276

ABSTRACT

In the effort to treat Mendelian disorders, correcting the underlying molecular imbalance may be more effective than symptomatic treatment. Identifying treatments that might accomplish this goal requires extensive and up-to-date knowledge of molecular pathways-including drug-gene and gene-gene relationships. To address this challenge, we present "parsing modifiers via article annotations" (PARMESAN), a computational tool that searches PubMed and PubMed Central for information to assemble these relationships into a central knowledge base. PARMESAN then predicts putatively novel drug-gene relationships, assigning an evidence-based score to each prediction. We compare PARMESAN's drug-gene predictions to all of the drug-gene relationships displayed by the Drug-Gene Interaction Database (DGIdb) and show that higher-scoring relationship predictions are more likely to match the directionality (up- versus down-regulation) indicated by this database. PARMESAN had more than 200,000 drug predictions scoring above 8 (as one example cutoff), for more than 3,700 genes. Among these predicted relationships, 210 were registered in DGIdb and 201 (96%) had matching directionality. This publicly available tool provides an automated way to prioritize drug screens to target the most-promising drugs to test, thereby saving time and resources in the development of therapeutics for genetic disorders.


Subject(s)
PubMed , Humans , Databases, Factual
8.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38493292

ABSTRACT

Computational predictors of immunogenic peptides, or epitopes, are traditionally built based on data from a broad range of pathogens without consideration for taxonomic information. While this approach may be reasonable if one aims to develop one-size-fits-all models, it may be counterproductive if the proteins for which the model is expected to generalize are known to come from a specific subset of phylogenetically related pathogens. There is mounting evidence that, for these cases, taxon-specific models can outperform generalist ones, even when trained with substantially smaller amounts of data. In this comment, we provide some perspective on the current state of taxon-specific modelling for the prediction of linear B-cell epitopes, and the challenges faced when building and deploying these predictors.


Subject(s)
Peptides , Proteins , Amino Acid Sequence , Epitopes, B-Lymphocyte
9.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38859767

ABSTRACT

How to resolve the metabolic dark matter of microorganisms has long been a challenging problem in discovering active molecules. Diverse omics tools have been developed to guide the discovery and characterization of various microbial metabolites, which make it gradually possible to predict the overall metabolites for individual strains. The combinations of multi-omic analysis tools effectively compensates for the shortcomings of current studies that focus only on single omics or a broad class of metabolites. In this review, we systematically update, categorize and sort out different analysis tools for microbial metabolites prediction in the last five years to appeal for the multi-omic combination on the understanding of the metabolic nature of microbes. First, we provide the general survey on different updated prediction databases, webservers, or software that based on genomics, transcriptomics, proteomics, and metabolomics, respectively. Then, we discuss the essentiality on the integration of multi-omics data to predict metabolites of different microbial strains and communities, as well as stressing the combination of other techniques, such as systems biology methods and data-driven algorithms. Finally, we identify key challenges and trends in developing multi-omic analysis tools for more comprehensive prediction on diverse microbial metabolites that contribute to human health and disease treatment.


Subject(s)
Metabolomics , Software , Metabolomics/methods , Genomics/methods , Proteomics/methods , Humans , Computational Biology/methods , Bacteria/metabolism , Bacteria/genetics , Bacteria/classification , Metabolome , Algorithms , Multiomics
10.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38314912

ABSTRACT

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.


Subject(s)
Medicine , Humans , Cell Line , Chromatin Immunoprecipitation , Databases, Factual , Language
11.
Mol Cell Proteomics ; 23(1): 100682, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37993103

ABSTRACT

Global phosphoproteomics experiments quantify tens of thousands of phosphorylation sites. However, data interpretation is hampered by our limited knowledge on functions, biological contexts, or precipitating enzymes of the phosphosites. This study establishes a repository of phosphosites with associated evidence in biomedical abstracts, using deep learning-based natural language processing techniques. Our model for illuminating the dark phosphoproteome through PubMed mining (IDPpub) was generated by fine-tuning BioBERT, a deep learning tool for biomedical text mining. Trained using sentences containing protein substrates and phosphorylation site positions from 3000 abstracts, the IDPpub model was then used to extract phosphorylation sites from all MEDLINE abstracts. The extracted proteins were normalized to gene symbols using the National Center for Biotechnology Information gene query, and sites were mapped to human UniProt sequences using ProtMapper and mouse UniProt sequences by direct match. Precision and recall were calculated using 150 curated abstracts, and utility was assessed by analyzing the CPTAC (Clinical Proteomics Tumor Analysis Consortium) pan-cancer phosphoproteomics datasets and the PhosphoSitePlus database. Using 10-fold cross validation, pairs of correct substrates and phosphosite positions were extracted with an average precision of 0.93 and recall of 0.94. After entity normalization and site mapping to human reference sequences, an independent validation achieved a precision of 0.91 and recall of 0.77. The IDPpub repository contains 18,458 unique human phosphorylation sites with evidence sentences from 58,227 abstracts and 5918 mouse sites in 14,610 abstracts. This included evidence sentences for 1803 sites identified in CPTAC studies that are not covered by manually curated functional information in PhosphoSitePlus. Evaluation results demonstrate the potential of IDPpub as an effective biomedical text mining tool for collecting phosphosites. Moreover, the repository (http://idppub.ptmax.org), which can be automatically updated, can serve as a powerful complement to existing resources.


Subject(s)
Data Mining , Natural Language Processing , Humans , Data Mining/methods , Databases, Factual , PubMed
12.
Proc Natl Acad Sci U S A ; 120(43): e2221345120, 2023 Oct 24.
Article in English | MEDLINE | ID: mdl-37844231

ABSTRACT

Growth models with resources and environmental externalities typically assume that planet Earth is a closed economy. However, private firms like Blue Origin and SpaceX have reduced the cost of rocket launches by a factor of 20 over the last decade. What if these costs continue to decline, making mining from asteroids or the moon feasible? What would be the implications for economic growth and the environment? This paper provides stylized facts about cost trends, geology, and the environmental impact of mining on Earth and potentially in Space. We extend a neoclassical growth model to investigate the transition from mining on Earth to Space. We find that such a transition could potentially allow for continued growth of metal use, while limiting environmental and social costs on Earth. Acknowledging the high uncertainty around the topic, our paper provides a starting point for research on how Space mining could contribute to sustainable growth on Earth.

13.
Proc Natl Acad Sci U S A ; 120(48): e2310522120, 2023 Nov 28.
Article in English | MEDLINE | ID: mdl-37983497

ABSTRACT

With the significant increase in the availability of microbial genome sequences in recent years, resistance gene-guided genome mining has emerged as a powerful approach for identifying natural products with specific bioactivities. Here, we present the use of this approach to reveal the roseopurpurins as potent inhibitors of cyclin-dependent kinases (CDKs), a class of cell cycle regulators implicated in multiple cancers. We identified a biosynthetic gene cluster (BGC) with a putative resistance gene with homology to human CDK2. Using targeted gene disruption and transcription factor overexpression in Aspergillus uvarum, and heterologous expression of the BGC in Aspergillus nidulans, we demonstrated that roseopurpurin C (1) is produced by this cluster and characterized its biosynthesis. We determined the potency, specificity, and mechanism of action of 1 as well as multiple intermediates and shunt products produced from the BGC. We show that 1 inhibits human CDK2 with a Kiapp of 44 nM, demonstrates selectivity for clinically relevant members of the CDK family, and induces G1 cell cycle arrest in HCT116 cells. Structural analysis of 1 complexed with CDK2 revealed the molecular basis of ATP-competitive inhibition.


Subject(s)
Cyclin-Dependent Kinases , Neoplasms , Humans , Cyclin-Dependent Kinases/metabolism , Cyclin-Dependent Kinase 2/genetics , Cyclins/metabolism , Cell Cycle/genetics , Enzyme Inhibitors
14.
Proc Natl Acad Sci U S A ; 120(26): e2212037120, 2023 06 27.
Article in English | MEDLINE | ID: mdl-37339197

ABSTRACT

From 2000 through 2020, demand for cobalt to manufacture batteries grew 26-fold. Eighty-two percent of this growth occurred in China and China's cobalt refinery production increased 78-fold. Diminished industrial cobalt mine production in the early-to-mid 2000s led many Chinese companies to purchase ores from artisanal cobalt miners in the Democratic Republic of the Congo (DRC), many of whom have been found to be children. Despite extensive research on artisanal cobalt mining, fundamental questions about its production remain unanswered. This gap is addressed here by estimating artisanal cobalt production, processing, and trade. The results show that, while total DRC cobalt mine production grew from 11,000 metric tons (t) in 2000 to 98,000 t in 2020, artisanal production only grew from 1,000 to 2,000 t in 2000 to 9,000 to 11,000 t in 2020 (with a peak of 17,000 to 21,000 t in 2018). Artisanal production's share of world and DRC cobalt mine production peaked around 2008 at 18 to 23% and 40 to 53%, respectively, before trending down to 6 to 8% and 9 to 11% in 2020, respectively. Artisanal production was chiefly exported to China or processed within the DRC by Chinese firms. An average of 72 to 79% of artisanal production was processed at facilities within the DRC from 2016 through 2020. As such, these facilities may be potential monitoring points for artisanal production and its downstream consumers. This finding may help to support responsible sourcing initiatives and better address abuses related to artisanal cobalt mining by focusing local efforts at the artisanal processing facilities through which most artisanal cobalt production flows.


Subject(s)
Cobalt , Mining , Humans , Child , Democratic Republic of the Congo , Industry , China
15.
RNA ; 29(12): 1896-1909, 2023 12.
Article in English | MEDLINE | ID: mdl-37793790

ABSTRACT

The characterization of the conformational landscape of the RNA backbone is rather complex due to the ability of RNA to assume a large variety of conformations. These backbone conformations can be depicted by pseudotorsional angles linking RNA backbone atoms, from which Ramachandran-like plots can be built. We explore here different definitions of these pseudotorsional angles, finding that the most accurate ones are the traditional η (eta) and θ (theta) angles, which represent the relative position of RNA backbone atoms P and C4'. We explore the distribution of η - θ in known experimental structures, comparing the pseudotorsional space generated with structures determined exclusively by one experimental technique. We found that the complete picture only appears when combining data from different sources. The maps provide a quite comprehensive representation of the RNA accessible space, which can be used in RNA-structural predictions. Finally, our results highlight that protein interactions lead to significant changes in the population of the η - θ space, pointing toward the role of induced-fit mechanisms in protein-RNA recognition.


Subject(s)
Proteins , RNA , RNA/genetics , RNA/chemistry , Proteins/chemistry , Nucleic Acid Conformation
16.
Brief Bioinform ; 24(4)2023 07 20.
Article in English | MEDLINE | ID: mdl-37401369

ABSTRACT

As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.


Subject(s)
Computational Biology , Proteins , Humans , Computational Biology/methods , Proteins/chemistry , Amino Acid Sequence , Neural Networks, Computer , Databases, Factual , Databases, Protein
17.
Brief Bioinform ; 24(5)2023 09 20.
Article in English | MEDLINE | ID: mdl-37594310

ABSTRACT

Omics data from clinical samples are the predominant source of target discovery and drug development. Typically, hundreds or thousands of differentially expressed genes or proteins can be identified from omics data. This scale of possibilities is overwhelming for target discovery and validation using biochemical or cellular experiments. Most of these proteins and genes have no corresponding drugs or even active compounds. Moreover, a proportion of them may have been previously reported as being relevant to the disease of interest. To facilitate translational drug discovery from omics data, we have developed a new classification tool named Omics and Text driven Translational Medicine (OTTM). This tool can markedly narrow the range of proteins or genes that merit further validation via drug availability assessment and literature mining. For the 4489 candidate proteins identified in our previous proteomics study, OTTM recommended 40 FDA-approved or clinical trial drugs. Of these, 15 are available commercially and were tested on hepatocellular carcinoma Hep-G2 cells. Two drugs-tafenoquine succinate (an FDA-approved antimalarial drug targeting CYC1) and branaplam (a Phase 3 clinical drug targeting SMN1 for the treatment of spinal muscular atrophy)-showed potent inhibitory activity against Hep-G2 cell viability, suggesting that CYC1 and SMN1 may be potential therapeutic target proteins for hepatocellular carcinoma. In summary, OTTM is an efficient classification tool that can accelerate the discovery of effective drugs and targets using thousands of candidate proteins identified from omics data. The online and local versions of OTTM are available at http://otter-simm.com/ottm.html.


Subject(s)
Carcinoma, Hepatocellular , Liver Neoplasms , Humans , Translational Science, Biomedical , Proteomics , Drug Discovery
18.
J Pathol ; 262(3): 310-319, 2024 03.
Article in English | MEDLINE | ID: mdl-38098169

ABSTRACT

Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.


Subject(s)
Glioblastoma , Precision Medicine , Humans , Machine Learning , United Kingdom
19.
Methods ; 228: 48-54, 2024 Aug.
Article in English | MEDLINE | ID: mdl-38789016

ABSTRACT

With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.


Subject(s)
Data Mining , RNA , Data Mining/methods , RNA/genetics , Humans , Machine Learning , Disease/genetics , Support Vector Machine , Software
20.
Proc Natl Acad Sci U S A ; 119(38): e2118273119, 2022 09 20.
Article in English | MEDLINE | ID: mdl-36095187

ABSTRACT

Growing demand for minerals continues to drive deforestation worldwide. Tropical forests are particularly vulnerable to the environmental impacts of mining and mineral processing. Many local- to regional-scale studies document extensive, long-lasting impacts of mining on biodiversity and ecosystem services. However, the full scope of deforestation induced by industrial mining across the tropics is yet unknown. Here, we present a biome-wide assessment to show where industrial mine expansion has caused the most deforestation from 2000 to 2019. We find that 3,264 km2 of forest was directly lost due to industrial mining, with 80% occurring in only four countries: Indonesia, Brazil, Ghana, and Suriname. Additionally, controlling for other nonmining determinants of deforestation, we find that mining caused indirect forest loss in two-thirds of the investigated countries. Our results illustrate significant yet unevenly distributed and often unmanaged impacts on these biodiverse ecosystems. Impact assessments and mitigation plans of industrial mining activities must address direct and indirect impacts to support conservation of the world's tropical forests.


Subject(s)
Biodiversity , Conservation of Natural Resources , Forests , Mining , Conservation of Natural Resources/methods
SELECTION OF CITATIONS
SEARCH DETAIL