Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 191
Filter
1.
Methods Mol Biol ; 2834: 115-130, 2025.
Article in English | MEDLINE | ID: mdl-39312162

ABSTRACT

The recent advancements in machine learning and the new availability of large chemical datasets made the development of tools and protocols for computational chemistry a topic of high interest. In this chapter a standard procedure to develop Quantitative Structure-Activity Relationship (QSAR) models was presented and implemented in two freely available and easy-to-use workflows. The first workflow helps the user retrieving chemical data (SMILES) from the web, checking their correctness and curating them to produce consistent and ready-to-use datasets for cheminformatic. The second workflow implements six machine learning methods to develop classification QSAR models. Models can be additionally used to predict external chemicals. Calculation and selection of chemical descriptors, tuning of models' hyperparameters, and methods to handle data unbalancing are also incorporated in the workflow. Both the workflows are implemented in KNIME and represent a useful tool for computational scientists, as well as an intuitive and straightforward introduction to QSAR.


Subject(s)
Data Curation , Machine Learning , Quantitative Structure-Activity Relationship , Workflow , Data Curation/methods , Software , Cheminformatics/methods , Computational Biology/methods
2.
Wellcome Open Res ; 9: 523, 2024.
Article in English | MEDLINE | ID: mdl-39360219

ABSTRACT

Background: Data reusability is the driving force of the research data life cycle. However, implementing strategies to generate reusable data from the data creation to the sharing stages is still a significant challenge. Even when datasets supporting a study are publicly shared, the outputs are often incomplete and/or not reusable. The FAIR (Findable, Accessible, Interoperable, Reusable) principles were published as a general guidance to promote data reusability in research, but the practical implementation of FAIR principles in research groups is still falling behind. In biology, the lack of standard practices for a large diversity of data types, data storage and preservation issues, and the lack of familiarity among researchers are some of the main impeding factors to achieve FAIR data. Past literature describes biological curation from the perspective of data resources that aggregate data, often from publications. Methods: Our team works alongside data-generating, experimental researchers so our perspective aligns with publication authors rather than aggregators. We detail the processes for organizing datasets for publication, showcasing practical examples from data curation to data sharing. We also recommend strategies, tools and web resources to maximize data reusability, while maintaining research productivity. Conclusion: We propose a simple approach to address research data management challenges for experimentalists, designed to promote FAIR data sharing. This strategy not only simplifies data management, but also enhances data visibility, recognition and impact, ultimately benefiting the entire scientific community.


Researchers should openly share data associated with their publications unless there is a valid reason not to. Additionally, datasets have to be described with enough detail to ensure that they are reproducible and reusable by others. Since most research institutions offer limited professional support in this area, the responsibility for data sharing largely falls to researchers themselves. However, many research groups still struggle to follow data reusability principles in practice. In this work, we describe our data curation (data organization and management) efforts working directly with the researchers who create the data. We show the steps we took to organize, standardize, and share several datasets in biological sciences, pointing out the main challenges we faced. Finally, we suggest simple and practical data management actions, as well as tools that experimentalists can integrate into their daily work, to make sharing data easier and more effective.

3.
Front Big Data ; 7: 1446071, 2024.
Article in English | MEDLINE | ID: mdl-39314986

ABSTRACT

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.

5.
Stud Health Technol Inform ; 317: 160-170, 2024 Aug 30.
Article in English | MEDLINE | ID: mdl-39234719

ABSTRACT

INTRODUCTION: 16 million German-language free-text laboratory test results are the basis of the daily diagnostic routine of 17 laboratories within the University Hospital Erlangen. As part of the Medical Informatics Initiative, the local data integration centre is responsible for the accessibility of routine care data for medical research. Following the core data set, international interoperability standards such as FHIR and the English-language medical terminology SNOMED CT are used to create harmonised data. To represent each non-numeric laboratory test result within the base module profile ObservationLab, the need for a map and supporting tooling arose. STATE OF THE ART: Due to the requirement of a n:n map and a data safety-compliant local instance, publicly available tools (e.g., SNAP2SNOMED) were insufficient. Concept and Implementation: Therefore, we developed (1) an incremental mapping-validation process with different iteration cycles and (2) a customised mapping tool via Microsoft Access. Time, labour, and cost efficiency played a decisive role. First iterations were used to define requirements (e.g., multiple user access). LESSONS LEARNED: The successful process and tool implementation and the described lessons learned (e.g., cheat sheet) will assist other German hospitals in creating local maps for inter-consortia data exchange and research. In the future, qualitative and quantitative analysis results will be published.


Subject(s)
Systematized Nomenclature of Medicine , Germany , Humans , Electronic Health Records , Systems Integration
6.
Chemosphere ; 364: 143078, 2024 Sep.
Article in English | MEDLINE | ID: mdl-39181462

ABSTRACT

The US EPA ECOTOX database provides key ecotoxicological data that are crucial in environmental risk assessment. It can be used for computational predictions of toxicity or indications of hazard in a wide range of situations. There is no standardised or formalised method for extracting and subsetting data from the database for these purposes. Consequently, results in such meta-analyses are difficult to reproduce. The present study introduces the software package ECOTOXr, which provides the means to formalise data retrieval from the ECOTOX database in the R scripting language. Three cases are presented to evaluate the performance of the package in relation to earlier data extractions and searches on the website. These cases demonstrate that the package can reproduce data sets relatively well. Furthermore, they illustrate how future studies can further improve traceability and reproducibility by applying the package and adhering to some simple guidelines. This contributes to the FAIR principles, credibility and acceptance of research that uses data from the ECOTOX database.


Subject(s)
Databases, Factual , Software , United States Environmental Protection Agency , United States , Ecotoxicology/methods , Risk Assessment/methods , Reproducibility of Results
7.
J Med Libr Assoc ; 112(2): 81-87, 2024 Apr 01.
Article in English | MEDLINE | ID: mdl-39119170

ABSTRACT

Background: NYU Langone Health offers a collaborative research block for PGY3 Primary Care residents that employs a secondary data analysis methodology. As discussions of data reuse and secondary data analysis have grown in the data library literature, we sought to understand what attitudes internal medicine residents at a large urban academic medical center had around secondary data analysis. This case report describes a novel survey on resident attitudes around data sharing. Methods: We surveyed internal medicine residents in three tracks: Primary Care (PC), Categorical, and Clinician-Investigator (CI) tracks as part of a larger pilot study on implementation of a research block. All three tracks are in our institution's internal medicine program. In discussions with residency directors and the chief resident, the term "secondary data analysis" was chosen over "data reuse" due to this being more familiar to clinicians, but examples were given to define the concept. Results: We surveyed a population of 162 residents, and 67 residents responded, representing a 41.36% response rate. Strong majorities of residents exhibited positive views of secondary data analysis. Moreover, in our sample, those with exposure to secondary data analysis research opined that secondary data analysis takes less time and is less difficult to conduct compared to the other residents without curricular exposure to secondary analysis. Discussion: The survey reflects that residents believe secondary data analysis is worthwhile and this highlights opportunities for data librarians. As current residents matriculate into professional roles as clinicians, educators, and researchers, libraries have an opportunity to bolster support for data curation and education.


Subject(s)
Attitude of Health Personnel , Internal Medicine , Internship and Residency , Internship and Residency/statistics & numerical data , Humans , Internal Medicine/education , Surveys and Questionnaires , Male , Female , Adult , Information Dissemination/methods
8.
Front Pharmacol ; 15: 1444733, 2024.
Article in English | MEDLINE | ID: mdl-39170704

ABSTRACT

Background and Objective: Chronic atrophic gastritis (CAG) is a complex chronic disease caused by multiple factors that frequently occurs disease in the clinic. The worldwide prevalence of CAG is high. Interestingly, clinical CAG patients often present with a variety of symptom phenotypes, which makes it more difficult for clinicians to treat. Therefore, there is an urgent need to improve our understanding of the complexity of the clinical CAG population, obtain more accurate disease subtypes, and explore the relationship between clinical symptoms and medication. Therefore, based on the integrated platform of complex networks and clinical research, we classified the collected patients with CAG according to their different clinical characteristics and conducted correlation analysis on the classification results to identify more accurate disease subtypes to aid in personalized clinical treatment. Method: Traditional Chinese medicine (TCM) offers an empirical understanding of the clinical subtypes of complicated disorders since TCM therapy is tailored to the patient's symptom profile. We gathered 6,253 TCM clinical electronic medical records (EMRs) from CAG patients and manually annotated, extracted, and preprocessed the data. A shared symptom-patient similarity network (PSN) was created. CAG patient subgroups were established, and their clinical features were determined through enrichment analysis employing community identification methods. Different clinical features of relevant subgroups were correlated based on effectiveness to identify symptom-botanical botanical drugs correspondence. Moreover, network pharmacology was employed to identify possible biological relationships between screened symptoms and medications and to identify various clinical and molecular aspects of the key subtypes using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. Results: 5,132 patients were included in the study: 2,699 males (52.60%) and 2,433 females (47.41%). The population was divided into 176 modules. We selected the first 3 modules (M29, M3, and M0) to illustrate the characteristic phenotypes and genotypes of CAG disease subtypes. The M29 subgroup was characterized by gastric fullness disease and internal syndrome of turbidity and poison. The M3 subgroup was characterized by epigastric pain and disharmony between the liver and stomach. The M0 subgroup was characterized by epigastric pain and dampness-heat syndrome. In symptom analysis, The top symptoms for symptom improvement in all three subgroups were stomach pain, bloating, insomnia, poor appetite, and heartburn. However, the three groups were different. The M29 subgroup was more likely to have stomach distention, anorexia, and palpitations. Citrus medica, Solanum nigrum, Jiangcan, Shan ci mushrooms, and Dillon were the most popular botanical drugs. The M3 subgroup has a higher incidence of yellow urine, a bitter tongue, and stomachaches. Smilax glabra, Cyperus rotundus, Angelica sinensis, Conioselinum anthriscoides, and Paeonia lactiflora were the botanical drugs used. Vomiting, nausea, stomach pain, and appetite loss are common in the M0 subgroup. The primary medications are Scutellaria baicalensis, Smilax glabra, Picrorhiza kurroa, Lilium lancifolium, and Artemisia scoparia. Through GO and KEGG pathway analysis, We found that in the M29 subgroup, Citrus medica, Solanum nigrum, Jiangcan, Shan ci mushrooms, and Dillon may exert their therapeutic effects on the symptoms of gastric distension, anorexia, and palpitations by modulating apoptosis and NF-κB signaling pathways. In the M3 subgroup, Smilax glabra, Cyperus rotundus, Angelica sinensis, Conioselinum anthriscoides, and Paeonia lactiflora may be treated by NF-κB and JAK-STAT signaling pathway for the treatment of stomach pain, bitter mouth, and yellow urine. In the M0 subgroup, Scutellaria baicalensis, Smilax glabra, Picrorhiza kurroa, Lilium lancifolium, and Artemisia scoparia may exert their therapeutic effects on poor appetite, stomach pain, vomiting, and nausea through the PI3K-Akt signaling pathway. Conclusion: Based on PSN identification and community detection analysis, CAG population division can provide useful recommendations for clinical CAG treatment. This method is useful for CAG illness classification and genotyping investigations and can be used for other complicated chronic diseases.

9.
BMC Med ; 22(1): 288, 2024 Jul 10.
Article in English | MEDLINE | ID: mdl-38987774

ABSTRACT

BACKGROUND: Ethnicity is known to be an important correlate of health outcomes, particularly during the COVID-19 pandemic, where some ethnic groups were shown to be at higher risk of infection and adverse outcomes. The recording of patients' ethnic groups in primary care can support research and efforts to achieve equity in service provision and outcomes; however, the coding of ethnicity is known to present complex challenges. We therefore set out to describe ethnicity coding in detail with a view to supporting the use of this data in a wide range of settings, as part of wider efforts to robustly describe and define methods of using administrative data. METHODS: We describe the completeness and consistency of primary care ethnicity recording in the OpenSAFELY-TPP database, containing linked primary care and hospital records in > 25 million patients in England. We also compared the ethnic breakdown in OpenSAFELY-TPP with that of the 2021 UK census. RESULTS: 78.2% of patients registered in OpenSAFELY-TPP on 1 January 2022 had their ethnicity recorded in primary care records, rising to 92.5% when supplemented with hospital data. The completeness of ethnicity recording was higher for women than for men. The rate of primary care ethnicity recording ranged from 77% in the South East of England to 82.2% in the West Midlands. Ethnicity recording rates were higher in patients with chronic or other serious health conditions. For each of the five broad ethnicity groups, primary care recorded ethnicity was within 2.9 percentage points of the population rate as recorded in the 2021 Census for England as a whole. For patients with multiple ethnicity records, 98.7% of the latest recorded ethnicities matched the most frequently coded ethnicity. Patients whose latest recorded ethnicity was categorised as Other were most likely to have a discordant ethnicity recording (32.2%). CONCLUSIONS: Primary care ethnicity data in OpenSAFELY is present for over three quarters of all patients, and combined with data from other sources can achieve a high level of completeness. The overall distribution of ethnicities across all English OpenSAFELY-TPP practices was similar to the 2021 Census, with some regional variation. This report identifies the best available codelist for use in OpenSAFELY and similar electronic health record data.


Subject(s)
Ethnicity , Primary Health Care , State Medicine , Adult , Aged , Female , Humans , Male , Middle Aged , Cohort Studies , England , Ethnicity/statistics & numerical data , Primary Health Care/statistics & numerical data , Infant, Newborn , Infant , Child, Preschool , Child , Adolescent , Young Adult , Aged, 80 and over
10.
J Cheminform ; 16(1): 82, 2024 Jul 19.
Article in English | MEDLINE | ID: mdl-39030583

ABSTRACT

PURPOSE: Reaction databases are a key resource for a wide variety of applications in computational chemistry and biochemistry, including Computer-aided Synthesis Planning (CASP) and the large-scale analysis of metabolic networks. The full potential of these resources can only be realized if datasets are accurate and complete. Missing co-reactants and co-products, i.e., unbalanced reactions, however, are the rule rather than the exception. The curation and correction of such incomplete entries is thus an urgent need. METHODS: The SynRBL framework addresses this issue with a dual-strategy: a rule-based method for non-carbon compounds, using atomic symbols and counts for prediction, alongside a Maximum Common Subgraph (MCS)-based technique for carbon compounds, aimed at aligning reactants and products to infer missing entities. RESULTS: The rule-based method exceeded 99% accuracy, while MCS-based accuracy varied from 81.19 to 99.33%, depending on reaction properties. Furthermore, an applicability domain and a machine learning scoring function were devised to quantify prediction confidence. The overall efficacy of this framework was delineated through its success rate and accuracy metrics, which spanned from 89.83 to 99.75% and 90.85 to 99.05%, respectively. CONCLUSION: The SynRBL framework offers a novel solution for recalibrating chemical reactions, significantly enhancing reaction completeness. With rigorous validation, it achieved groundbreaking accuracy in reaction rebalancing. This sets the stage for future improvement in particular of atom-atom mapping techniques as well as of downstream tasks such as automated synthesis planning. SCIENTIFIC CONTRIBUTION: SynRBL features a novel computational approach to correcting unbalanced entries in chemical reaction databases. By combining heuristic rules for inferring non-carbon compounds and common subgraph searches to address carbon unbalance, SynRBL successfully addresses most instances of this problem, which affects the majority of data in most large-scale resources. Compared to alternative solutions, SynRBL achieves a dramatic increase in both success rate and accurary, and provides the first freely available open source solution for this problem.

11.
Front Med (Lausanne) ; 11: 1455319, 2024.
Article in English | MEDLINE | ID: mdl-39045419

ABSTRACT

[This corrects the article DOI: 10.3389/fmed.2024.1365501.].

12.
Health Inf Manag ; : 18333583241256049, 2024 Jul 24.
Article in English | MEDLINE | ID: mdl-39045683

ABSTRACT

In 2022 the Australian Data Availability and Transparency Act (DATA) commenced, enabling accredited "data users" to access data from "accredited data service providers." However, the DATA Scheme lacks guidance on "trustworthiness" of the data to be utilised for reuse purposes. Objectives: To determine: (i) Do researchers using government health datasets trust the data? (ii) What factors influence their perceptions of data trustworthiness? and (iii) What are the implications for government and data custodians? Method: Authors of published studies (2008-2020) that utilised Victorian government health datasets were surveyed via a case study approach. Twenty-eight trust constructs (identified via literature review) were grouped into data factors, management properties and provider factors. Results: Fifty experienced health researchers responded. Most (88%) believed that Victorian government health data were trustworthy. When grouped, data factors and management properties were more important than data provider factors in building trust. The most important individual trust constructs were: "compliant with ethical regulation" (100%) and "monitoring privacy and confidentiality" (98%). Constructs of least importance were knowledge of "participant consent" (56%) and "major focus of the data provider was research" (50%). Conclusion: Overall, the researchers trusted government health data, but data factors and data management properties were more important than data provider factors in building trust. Implications: Government should ensure the DATA Scheme incorporates mechanisms to validate those data utilised by accredited data users and data providers have sufficient quality (intrinsic and extrinsic) to meet the requirements of "trustworthiness," and that evidentiary documentation is provided to support these "accredited data."

13.
Genomics Inform ; 22(1): 7, 2024 Jun 17.
Article in English | MEDLINE | ID: mdl-38907285

ABSTRACT

This study evaluated large language models (LLMs), particularly the GPT-4 with vision (GPT-4 V) and GPT-4 Turbo, for annotating biomedical figures, focusing on cellular senescence. We assessed the ability of LLMs to categorize and annotate complex biomedical images to enhance their accuracy and efficiency. Our experiments employed prompt engineering with figures from review articles, achieving more than 70% accuracy for label extraction and approximately 80% accuracy for node-type classification. Challenges were noted in the correct annotation of the relationship between directionality and inhibitory processes, which were exacerbated as the number of nodes increased. Using figure legends was a more precise identification of sources and targets than using captions, but sometimes lacked pathway details. This study underscores the potential of LLMs in decoding biological mechanisms from text and outlines avenues for improving inhibitory relationship representations in biomedical informatics.

14.
J Cheminform ; 16(1): 74, 2024 Jun 27.
Article in English | MEDLINE | ID: mdl-38937840

ABSTRACT

This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis. SCIENTIFIC CONTRIBUTION: The proposed automated preprocessing tool for chemical reaction data aims to identify errors within chemical databases. Specifically, if the errors involve atom mapping or the absence of reactant types, corrections can be systematically applied using reaction templates, ultimately elevating the overall quality of the database.

15.
Methods Cell Biol ; 186: 107-130, 2024.
Article in English | MEDLINE | ID: mdl-38705596

ABSTRACT

Mass cytometry permits the high dimensional analysis of cellular systems at single-cell resolution with high throughput in various areas of biomedical research. Here, we provide a state-of-the-art protocol for the analysis of human peripheral blood mononuclear cells (PBMC) by mass cytometry. We focus on the implementation of measures promoting the harmonization of large and complex studies to aid robustness and reproducibility of immune phenotyping data.


Subject(s)
Flow Cytometry , Leukocytes, Mononuclear , Humans , Leukocytes, Mononuclear/cytology , Leukocytes, Mononuclear/immunology , Flow Cytometry/methods , Flow Cytometry/standards , Immunophenotyping/methods , Single-Cell Analysis/methods
16.
J Bioinform Comput Biol ; 22(2): 2450005, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38779780

ABSTRACT

Enzymes catalyze diverse biochemical reactions and are building blocks of cellular and metabolic pathways. Data and metadata of enzymes are distributed across databases and are archived in various formats. The enzyme databases provide utilities for efficient searches and downloading enzyme records in batch mode but do not support organism-specific extraction of subsets of data. Users are required to write scripts for parsing entries for customized data extraction prior to downstream analysis. Integrated Customized Extraction of Enzyme Data (iCEED) has been developed to provide organism-specific customized data extraction utilities for seven commonly used enzyme databases and brings these resources under an integrated portal. iCEED provides dropdown menus and search boxes using typehead utility for submission of queries as well as enzyme class-based browsing utility. A utility to facilitate mapping and visualization of functionally important features on the three-dimensional (3D) structures of enzymes is integrated. The customized data extraction utilities provided in iCEED are expected to be useful for biochemists, biotechnologists, computational biologists, and life science researchers to build curated datasets of their choice through an easy to navigate web-based interface. The integrated feature visualization system is useful for a fine-grained understanding of the enzyme structure-function relationship. Desired subsets of data, extracted and curated using iCEED can be subsequently used for downstream processing, analyses, and knowledge discovery. iCEED can also be used for training and teaching purposes.


Subject(s)
Databases, Protein , Enzymes , Software , Enzymes/chemistry , Enzymes/metabolism , Computational Biology/methods , User-Computer Interface , Internet
17.
Front Med (Lausanne) ; 11: 1365501, 2024.
Article in English | MEDLINE | ID: mdl-38813389

ABSTRACT

The emerging European Health Data Space (EHDS) Regulation opens new prospects for large-scale sharing and re-use of health data. Yet, the proposed regulation suffers from two important limitations: it is designed to benefit the whole population with limited consideration for individuals, and the generation of secondary datasets from heterogeneous, unlinked patient data will remain burdensome. AIDAVA, a Horizon Europe project that started in September 2022, proposes to address both shortcomings by providing patients with an AI-based virtual assistant that maximises automation in the integration and transformation of their health data into an interoperable, longitudinal health record. This personal record can then be used to inform patient-related decisions at the point of care, whether this is the usual point of care or a possible cross-border point of care. The personal record can also be used to generate population datasets for research and policymaking. The proposed solution will enable a much-needed paradigm shift in health data management, implementing a 'curate once at patient level, use many times' approach, primarily for the benefit of patients and their care providers, but also for more efficient generation of high-quality secondary datasets. After 15 months, the project shows promising preliminary results in achieving automation in the integration and transformation of heterogeneous data of each individual patient, once the content of the data sources managed by the data holders has been formally described. Additionally, the conceptualization phase of the project identified a set of recommendations for the development of a patient-centric EHDS, significantly facilitating the generation of data for secondary use.

18.
Drug Discov Today ; 29(7): 104025, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38762089

ABSTRACT

In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.


Subject(s)
Drug Development , Drug Discovery , Machine Learning , Humans , Drug Discovery/methods , Drug Development/methods , Antibodies , Animals , Reproducibility of Results
19.
Methods Mol Biol ; 2744: 7-32, 2024.
Article in English | MEDLINE | ID: mdl-38683309

ABSTRACT

This chapter on the history of the DNA barcoding enterprise attempts to set the stage for the more scholarly contributions in this volume by addressing the following questions. How did the DNA barcoding enterprise begin? What were its goals, how did it develop, and to what degree are its goals being realized? We have taken a keen interest in the barcoding movement and its relationship to taxonomy, collections, and biodiversity informatics more broadly considered. This chapter integrates our two different perspectives on barcoding. DES was the Executive Secretary of the Consortium for the Barcode of Life from 2004 to 2017, with the mission to support the success of DNA barcoding without being directly involved in generating barcode data. RDMP viewed barcoding as an important entry into the landscape of biodiversity data, with many potential linkages to other components of that landscape. We also saw it as a critical step toward the era of international genomic research that was sure to follow. Like the Mercury Program that paved the way for lunar landings by the Apollo Program, we saw DNA barcoding as the proving grounds for the interdisciplinary and international cooperation that would be needed for success of whole-genome research.


Subject(s)
Biodiversity , DNA Barcoding, Taxonomic , DNA Barcoding, Taxonomic/methods , Entrepreneurship , Humans , Inventions
20.
MethodsX ; 12: 102676, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38617899

ABSTRACT

Identifying biogeographic regions through cluster analysis of species distribution data is a common method for partitioning ecosystems. Selecting the appropriate cluster analysis method requires a comparison of multiple algorithms. In this study, we demonstrate a data-driven process to select a method for bioregionalization based on community data and test its robustness to data variability following these steps: •We aggregated and curated zooplankton community observations from expeditions in the Northeast Pacific.•We determined the best bioregionalization approach by comparing nine cluster analysis methods using ten goodness of clustering indices.•We evaluated the robustness of the bioregionalization to different sources of sampling and taxonomic variability by comparing the bioregionalization of the overall dataset with bioregionalizations of subsets of the data. The K-means clustering of the log-chord transformed abundance was selected as the optimal method for bioregionalization of the zooplankton dataset. This clustering resulted in the emergence of four bioregions along the cross-shelf gradient: the Offshore, Deep Shelf, Nearshore, and Deep Fjord bioregions. The robustness analyses demonstrated that the bioregionalization was consistent despite variability in the spatial and temporal frequency of sampling, sampling methodology, and taxonomic coverage.

SELECTION OF CITATIONS
SEARCH DETAIL