Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 2.022
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38935069

RESUMO

MOTIVATION: In the past decade, single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal method for transcriptomic profiling in biomedical research. Precise cell-type identification is crucial for subsequent analysis of single-cell data. And the integration and refinement of annotated data are essential for building comprehensive databases. However, prevailing annotation techniques often overlook the hierarchical organization of cell types, resulting in inconsistent annotations. Meanwhile, most existing integration approaches fail to integrate datasets with different annotation depths and none of them can enhance the labels of outdated data with lower annotation resolutions using more intricately annotated datasets or novel biological findings. RESULTS: Here, we introduce scPLAN, a hierarchical computational framework designed for scRNA-seq data analysis. scPLAN excels in annotating unlabeled scRNA-seq data using a reference dataset structured along a hierarchical cell-type tree. It identifies potential novel cell types in a systematic, layer-by-layer manner. Additionally, scPLAN effectively integrates annotated scRNA-seq datasets with varying levels of annotation depth, ensuring consistent refinement of cell-type labels across datasets with lower resolutions. Through extensive annotation and novel cell detection experiments, scPLAN has demonstrated its efficacy. Two case studies have been conducted to showcase how scPLAN integrates datasets with diverse cell-type label resolutions and refine their cell-type labels. AVAILABILITY: https://github.com/michaelGuo1204/scPLAN.


Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Análise de Célula Única , Análise de Célula Única/métodos , Perfilação da Expressão Gênica/métodos , Biologia Computacional/métodos , Humanos , Software , Transcriptoma , Análise de Sequência de RNA/métodos , RNA-Seq/métodos , Anotação de Sequência Molecular/métodos
2.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38888457

RESUMO

Large sample datasets have been regarded as the primary basis for innovative discoveries and the solution to missing heritability in genome-wide association studies. However, their computational complexity cannot consider all comprehensive effects and all polygenic backgrounds, which reduces the effectiveness of large datasets. To address these challenges, we included all effects and polygenic backgrounds in a mixed logistic model for binary traits and compressed four variance components into two. The compressed model combined three computational algorithms to develop an innovative method, called FastBiCmrMLM, for large data analysis. These algorithms were tailored to sample size, computational speed, and reduced memory requirements. To mine additional genes, linkage disequilibrium markers were replaced by bin-based haplotypes, which are analyzed by FastBiCmrMLM, named FastBiCmrMLM-Hap. Simulation studies highlighted the superiority of FastBiCmrMLM over GMMAT, SAIGE and fastGWA-GLMM in identifying dominant, small α (allele substitution effect), and rare variants. In the UK Biobank-scale dataset, we demonstrated that FastBiCmrMLM could detect variants as small as 0.03% and with α ≈ 0. In re-analyses of seven diseases in the WTCCC datasets, 29 candidate genes, with both functional and TWAS evidence, around 36 variants identified only by the new methods, strongly validated the new methods. These methods offer a new way to decipher the genetic architecture of binary traits and address the challenges outlined above.


Assuntos
Algoritmos , Estudo de Associação Genômica Ampla , Estudo de Associação Genômica Ampla/métodos , Humanos , Modelos Logísticos , Estudos de Casos e Controles , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo Único , Genômica/métodos , Simulação por Computador , Haplótipos , Modelos Genéticos
3.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37798252

RESUMO

The emergence of massive datasets exploring the multiple levels of molecular biology has made their analysis and knowledge transfer more complex. Flexible tools to manage big biological datasets could be of great help for standardizing the usage of developed data visualizations and integration methods. Business intelligence (BI) tools have been used in many fields as exploratory tools. They have numerous connectors to link numerous data repositories with a unified graphic interface, offering an overview of data and facilitating interpretation for decision makers. BI tools could be a flexible and user-friendly way of handling molecular biological data with interactive visualizations. However, it is rather uncommon to see such tools used for the exploration of massive and complex datasets in biological fields. We believe that two main obstacles could be the reason. Firstly, we posit that the way to import data into BI tools are not compatible with biological databases. Secondly, BI tools may not be adapted to certain particularities of complex biological data, namely, the size, the variability of datasets and the availability of specialized visualizations. This paper highlights the use of five BI tools (Elastic Kibana, Siren Investigate, Microsoft Power BI, Salesforce Tableau and Apache Superset) onto which the massive data management repository engine called Elasticsearch is compatible. Four case studies will be discussed in which these BI tools were applied on biological datasets with different characteristics. We conclude that the performance of the tools depends on the complexity of the biological questions and the size of the datasets.


Assuntos
Conjuntos de Dados como Assunto , Software , Visualização de Dados
4.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36528804

RESUMO

The rapid progress of machine learning (ML) in predicting molecular properties enables high-precision predictions being routinely achieved. However, many ML models, such as conventional molecular graph, cannot differentiate stereoisomers of certain types, particularly conformational and chiral ones that share the same bonding connectivity but differ in spatial arrangement. Here, we designed a hybrid molecular graph network, Chemical Feature Fusion Network (CFFN), to address the issue by integrating planar and stereo information of molecules in an interweaved fashion. The three-dimensional (3D, i.e., stereo) modality guarantees precision and completeness by providing unabridged information, while the two-dimensional (2D, i.e., planar) modality brings in chemical intuitions as prior knowledge for guidance. The zipper-like arrangement of 2D and 3D information processing promotes cooperativity between them, and their synergy is the key to our model's success. Experiments on various molecules or conformational datasets including a special newly created chiral molecule dataset comprised of various configurations and conformations demonstrate the superior performance of CFFN. The advantage of CFFN is even more significant in datasets made of small samples. Ablation experiments confirm that fusing 2D and 3D molecular graphs as unambiguous molecular descriptors can not only effectively distinguish molecules and their conformations, but also achieve more accurate and robust prediction of quantum chemical properties.


Assuntos
Aprendizado de Máquina , Estereoisomerismo , Conformação Molecular
5.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37200157

RESUMO

Single-cell omics technologies have made it possible to analyze the individual cells within a biological sample, providing a more detailed understanding of biological systems. Accurately determining the cell type of each cell is a crucial goal in single-cell RNA-seq (scRNA-seq) analysis. Apart from overcoming the batch effects arising from various factors, single-cell annotation methods also face the challenge of effectively processing large-scale datasets. With the availability of an increase in the scRNA-seq datasets, integrating multiple datasets and addressing batch effects originating from diverse sources are also challenges in cell-type annotation. In this work, to overcome the challenges, we developed a supervised method called CIForm based on the Transformer for cell-type annotation of large-scale scRNA-seq data. To assess the effectiveness and robustness of CIForm, we have compared it with some leading tools on benchmark datasets. Through the systematic comparisons under various cell-type annotation scenarios, we exhibit that the effectiveness of CIForm is particularly pronounced in cell-type annotation. The source code and data are available at https://github.com/zhanglab-wbgcas/CIForm.


Assuntos
Perfilação da Expressão Gênica , Análise da Expressão Gênica de Célula Única , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Software
6.
Methods ; 224: 1-9, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38295891

RESUMO

The Major Histocompatibility Complex (MHC) is a critical element of the vertebrate cellular immune system, responsible for presenting peptides derived from intracellular proteins. MHC-I presentation is pivotal in the immune response and holds considerable potential in the realms of vaccine development and cancer immunotherapy. This study delves into the limitations of current methods and benchmarks for MHC-I presentation. We introduce a novel benchmark designed to assess generalization properties and the reliability of models on unseen MHC molecules and peptides, with a focus on the Human Leukocyte Antigen (HLA)-a specific subset of MHC genes present in humans. Finally, we introduce HLABERT, a pretrained language model that outperforms previous methods significantly on our benchmark and establishes a new state-of-the-art on existing benchmarks.


Assuntos
Peptídeos , Proteínas , Humanos , Reprodutibilidade dos Testes , Peptídeos/química , Proteínas/metabolismo , Complexo Principal de Histocompatibilidade/genética , Ligação Proteica
7.
Proteomics ; : e2400005, 2024 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-38556628

RESUMO

We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).

8.
BMC Bioinformatics ; 25(1): 183, 2024 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-38724908

RESUMO

BACKGROUND: In recent years, gene clustering analysis has become a widely used tool for studying gene functions, efficiently categorizing genes with similar expression patterns to aid in identifying gene functions. Caenorhabditis elegans is commonly used in embryonic research due to its consistent cell lineage from fertilized egg to adulthood. Biologists use 4D confocal imaging to observe gene expression dynamics at the single-cell level. However, on one hand, the observed tree-shaped time-series datasets have characteristics such as non-pairwise data points between different individuals. On the other hand, the influence of cell type heterogeneity should also be considered during clustering, aiming to obtain more biologically significant clustering results. RESULTS: A biclustering model is proposed for tree-shaped single-cell gene expression data of Caenorhabditis elegans. Detailedly, a tree-shaped piecewise polynomial function is first employed to fit non-pairwise gene expression time series data. Then, four factors are considered in the objective function, including Pearson correlation coefficients capturing gene correlations, p-values from the Kolmogorov-Smirnov test measuring the similarity between cells, as well as gene expression size and bicluster overlapping size. After that, Genetic Algorithm is utilized to optimize the function. CONCLUSION: The results on the small-scale dataset analysis validate the feasibility and effectiveness of our model and are superior to existing classical biclustering models. Besides, gene enrichment analysis is employed to assess the results on the complete real dataset analysis, confirming that the discovered biclustering results hold significant biological relevance.


Assuntos
Caenorhabditis elegans , Análise de Célula Única , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Animais , Análise de Célula Única/métodos , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Algoritmos
9.
BMC Genomics ; 25(1): 318, 2024 Mar 28.
Artigo em Inglês | MEDLINE | ID: mdl-38549092

RESUMO

BACKGROUND: Detecting structural variations (SVs) at the population level using next-generation sequencing (NGS) requires substantial computational resources and processing time. Here, we compared the performances of 11 SV callers: Delly, Manta, GridSS, Wham, Sniffles, Lumpy, SvABA, Canvas, CNVnator, MELT, and INSurVeyor. These SV callers have been recently published and have been widely employed for processing massive whole-genome sequencing datasets. We evaluated the accuracy, sequence depth, running time, and memory usage of the SV callers. RESULTS: Notably, several callers exhibited better calling performance for deletions than for duplications, inversions, and insertions. Among the SV callers, Manta identified deletion SVs with better performance and efficient computing resources, and both Manta and MELT demonstrated relatively good precision regarding calling insertions. We confirmed that the copy number variation callers, Canvas and CNVnator, exhibited better performance in identifying long duplications as they employ the read-depth approach. Finally, we also verified the genotypes inferred from each SV caller using a phased long-read assembly dataset, and Manta showed the highest concordance in terms of the deletions and insertions. CONCLUSIONS: Our findings provide a comprehensive understanding of the accuracy and computational efficiency of SV callers, thereby facilitating integrative analysis of SV profiles in diverse large-scale genomic datasets.


Assuntos
Variações do Número de Cópias de DNA , Genômica , Humanos , Sequenciamento Completo do Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Genoma Humano , Variação Estrutural do Genoma
10.
Neuroimage ; 292: 120603, 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-38588833

RESUMO

Fetal brain development is a complex process involving different stages of growth and organization which are crucial for the development of brain circuits and neural connections. Fetal atlases and labeled datasets are promising tools to investigate prenatal brain development. They support the identification of atypical brain patterns, providing insights into potential early signs of clinical conditions. In a nutshell, prenatal brain imaging and post-processing via modern tools are a cutting-edge field that will significantly contribute to the advancement of our understanding of fetal development. In this work, we first provide terminological clarification for specific terms (i.e., "brain template" and "brain atlas"), highlighting potentially misleading interpretations related to inconsistent use of terms in the literature. We discuss the major structures and neurodevelopmental milestones characterizing fetal brain ontogenesis. Our main contribution is the systematic review of 18 prenatal brain atlases and 3 datasets. We also tangentially focus on clinical, research, and ethical implications of prenatal neuroimaging.


Assuntos
Atlas como Assunto , Encéfalo , Imageamento por Ressonância Magnética , Neuroimagem , Feminino , Humanos , Gravidez , Encéfalo/diagnóstico por imagem , Encéfalo/embriologia , Conjuntos de Dados como Assunto , Desenvolvimento Fetal/fisiologia , Feto/diagnóstico por imagem , Imageamento por Ressonância Magnética/métodos , Neuroimagem/métodos
11.
J Cell Sci ; 135(7)2022 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-35420128

RESUMO

For the past century, the nucleus has been the focus of extensive investigations in cell biology. However, many questions remain about how its shape and size are regulated during development, in different tissues, or during disease and aging. To track these changes, microscopy has long been the tool of choice. Image analysis has revolutionized this field of research by providing computational tools that can be used to translate qualitative images into quantitative parameters. Many tools have been designed to delimit objects in 2D and, eventually, in 3D in order to define their shapes, their number or their position in nuclear space. Today, the field is driven by deep-learning methods, most of which take advantage of convolutional neural networks. These techniques are remarkably adapted to biomedical images when trained using large datasets and powerful computer graphics cards. To promote these innovative and promising methods to cell biologists, this Review summarizes the main concepts and terminologies of deep learning. Special emphasis is placed on the availability of these methods. We highlight why the quality and characteristics of training image datasets are important and where to find them, as well as how to create, store and share image datasets. Finally, we describe deep-learning methods well-suited for 3D analysis of nuclei and classify them according to their level of usability for biologists. Out of more than 150 published methods, we identify fewer than 12 that biologists can use, and we explain why this is the case. Based on this experience, we propose best practices to share deep-learning methods with biologists.


Assuntos
Aprendizado Profundo , Núcleo Celular , Processamento de Imagem Assistida por Computador/métodos , Imageamento Tridimensional , Microscopia/métodos , Redes Neurais de Computação
12.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35849818

RESUMO

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.


Assuntos
Algoritmos , Mineração de Dados , Proteínas , PubMed
13.
Cytometry A ; 105(7): 501-520, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38563259

RESUMO

Deep learning approaches have frequently been used in the classification and segmentation of human peripheral blood cells. The common feature of previous studies was that they used more than one dataset, but used them separately. No study has been found that combines more than two datasets to use together. In classification, five types of white blood cells were identified by using a mixture of four different datasets. In segmentation, four types of white blood cells were determined, and three different neural networks, including CNN (Convolutional Neural Network), UNet and SegNet, were applied. The classification results of the presented study were compared with those of related studies. The balanced accuracy was 98.03%, and the test accuracy of the train-independent dataset was determined to be 97.27%. For segmentation, accuracy rates of 98.9% for train-dependent dataset and 92.82% for train-independent dataset for the proposed CNN were obtained in both nucleus and cytoplasm detection. In the presented study, the proposed method showed that it could detect white blood cells from a train-independent dataset with high accuracy. Additionally, it is promising as a diagnostic tool that can be used in the clinical field, with successful results in classification and segmentation.


Assuntos
Aprendizado Profundo , Leucócitos , Redes Neurais de Computação , Humanos , Leucócitos/citologia , Leucócitos/classificação , Processamento de Imagem Assistida por Computador/métodos , Análise de Dados , Núcleo Celular , Citoplasma
14.
Histopathology ; 85(3): 418-436, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-38719547

RESUMO

BACKGROUND AND OBJECTIVES: Current national or regional guidelines for the pathology reporting on invasive breast cancer differ in certain aspects, resulting in divergent reporting practice and a lack of comparability of data. Here we report on a new international dataset for the pathology reporting of resection specimens with invasive cancer of the breast. The dataset was produced under the auspices of the International Collaboration on Cancer Reporting (ICCR), a global alliance of major (inter-)national pathology and cancer organizations. METHODS AND RESULTS: The established ICCR process for dataset development was followed. An international expert panel consisting of breast pathologists, a surgeon, and an oncologist prepared a draft set of core and noncore data items based on a critical review and discussion of current evidence. Commentary was provided for each data item to explain the rationale for selecting it as a core or noncore element, its clinical relevance, and to highlight potential areas of disagreement or lack of evidence, in which case a consensus position was formulated. Following international public consultation, the document was finalized and ratified, and the dataset, which includes a synoptic reporting guide, was published on the ICCR website. CONCLUSIONS: This first international dataset for invasive cancer of the breast is intended to promote high-quality, standardized pathology reporting. Its widespread adoption will improve consistency of reporting, facilitate multidisciplinary communication, and enhance comparability of data, all of which will help to improve the management of invasive breast cancer patients.


Assuntos
Neoplasias da Mama , Humanos , Neoplasias da Mama/patologia , Feminino , Patologia Clínica/normas , Conjuntos de Dados como Assunto/normas
15.
BMC Cancer ; 24(1): 333, 2024 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-38475762

RESUMO

BACKGROUND: Paucity and low evidence-level data on proton therapy (PT) represent one of the main issues for the establishment of solid indications in the PT setting. Aim of the present registry, the POWER registry, is to provide a tool for systematic, prospective, harmonized, and multidimensional high-quality data collection to promote knowledge in the field of PT with a particular focus on the use of hypofractionation. METHODS: All patients with any type of oncologic disease (benign and malignant disease) eligible for PT at the European Institute of Oncology (IEO), Milan, Italy, will be included in the present registry. Three levels of data collection will be implemented: Level (1) clinical research (patients outcome and toxicity, quality of life, and cost/effectiveness analysis); Level (2) radiological and radiobiological research (radiomic and dosiomic analysis, as well as biological modeling); Level (3) biological and translational research (biological biomarkers and genomic data analysis). Endpoints and outcome measures of hypofractionation schedules will be evaluated in terms of either Treatment Efficacy (tumor response rate, time to progression/percentages of survivors/median survival, clinical, biological, and radiological biomarkers changes, identified as surrogate endpoints of cancer survival/response to treatment) and Toxicity. The study protocol has been approved by the IEO Ethical Committee (IEO 1885). Other than patients treated at IEO, additional PT facilities (equipped with Proteus®ONE or Proteus®PLUS technologies by IBA, Ion Beam Applications, Louvain-la-Neuve, Belgium) are planned to join the registry data collection. Moreover, the registry will be also fully integrated into international PT data collection networks.


Assuntos
Neoplasias , Terapia com Prótons , Humanos , Biomarcadores , Estudos Prospectivos , Qualidade de Vida , Sistema de Registros , Estudos Multicêntricos como Assunto
16.
Mult Scler ; 30(3): 396-418, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38140852

RESUMO

BACKGROUND: As of September 2022, there was no globally recommended set of core data elements for use in multiple sclerosis (MS) healthcare and research. As a result, data harmonisation across observational data sources and scientific collaboration is limited. OBJECTIVES: To define and agree upon a core dataset for real-world data (RWD) in MS from observational registries and cohorts. METHODS: A three-phase process approach was conducted combining a landscaping exercise with dedicated discussions within a global multi-stakeholder task force consisting of 20 experts in the field of MS and its RWD to define the Core Dataset. RESULTS: A core dataset for MS consisting of 44 variables in eight categories was translated into a data dictionary that has been published and disseminated for emerging and existing registries and cohorts to use. Categories include variables on demographics and comorbidities (patient-specific data), disease history, disease status, relapses, magnetic resonance imaging (MRI) and treatment data (disease-specific data). CONCLUSION: The MS Data Alliance Core Dataset guides emerging registries in their dataset definitions and speeds up and supports harmonisation across registries and initiatives. The straight-forward, time-efficient process using a dedicated global multi-stakeholder task force has proven to be effective to define a concise core dataset.


Assuntos
Esclerose Múltipla , Humanos , Sistema de Registros
17.
Mol Pharm ; 2024 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-39132855

RESUMO

We present a novel computational approach for predicting human pharmacokinetics (PK) that addresses the challenges of early stage drug design. Our study introduces and describes a large-scale data set of 11 clinical PK end points, encompassing over 2700 unique chemical structures to train machine learning models. To that end multiple advanced training strategies are compared, including the integration of in vitro data and a novel self-supervised pretraining task. In addition to the predictions, our final model provides meaningful epistemic uncertainties for every data point. This allows us to successfully identify regions of exceptional predictive performance, with an absolute average fold error (AAFE/geometric mean fold error) of less than 2.5 across multiple end points. Together, these advancements represent a significant leap toward actionable PK predictions, which can be utilized early on in the drug design process to expedite development and reduce reliance on nonclinical studies.

18.
Biometrics ; 80(2)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38768225

RESUMO

Conventional supervised learning usually operates under the premise that data are collected from the same underlying population. However, challenges may arise when integrating new data from different populations, resulting in a phenomenon known as dataset shift. This paper focuses on prior probability shift, where the distribution of the outcome varies across datasets but the conditional distribution of features given the outcome remains the same. To tackle the challenges posed by such shift, we propose an estimation algorithm that can efficiently combine information from multiple sources. Unlike existing methods that are restricted to discrete outcomes, the proposed approach accommodates both discrete and continuous outcomes. It also handles high-dimensional covariate vectors through variable selection using an adaptive least absolute shrinkage and selection operator penalty, producing efficient estimates that possess the oracle property. Moreover, a novel semiparametric likelihood ratio test is proposed to check the validity of prior probability shift assumptions by embedding the null conditional density function into Neyman's smooth alternatives (Neyman, 1937) and testing study-specific parameters. We demonstrate the effectiveness of our proposed method through extensive simulations and a real data example. The proposed methods serve as a useful addition to the repertoire of tools for dealing dataset shifts.


Assuntos
Algoritmos , Simulação por Computador , Modelos Estatísticos , Probabilidade , Humanos , Funções Verossimilhança , Biometria/métodos , Interpretação Estatística de Dados , Aprendizado de Máquina Supervisionado
19.
Artigo em Inglês | MEDLINE | ID: mdl-38886295

RESUMO

BACKGROUND: Preterm birth (before 37 completed weeks of gestation) is associated with an increased risk of adverse health and developmental outcomes relative to birth at term. Existing guidelines for data collection in cohort studies of individuals born preterm are either limited in scope, have not been developed using formal consensus methodology, or did not involve a range of stakeholders in their development. Recommendations meeting these criteria would facilitate data pooling and harmonisation across studies. OBJECTIVES: To develop a Core Dataset for use in longitudinal cohort studies of individuals born preterm. METHODS: This work was carried out as part of the RECAP Preterm project. A systematic review of variables included in existing core outcome sets was combined with a scoping exercise conducted with experts on preterm birth. The results were used to generate a draft core dataset. A modified Delphi process was implemented using two stages with three rounds each. Three stakeholder groups participated: RECAP Preterm project partners; external experts in the field; people with lived experience of preterm birth. The Delphi used a 9-point Likert scale. Higher values indicated greater importance for inclusion. Participants also suggested additional variables they considered important for inclusion which were voted on in later rounds. RESULTS: An initial list of 140 data items was generated. Ninety-six participants across 22 countries participated in the Delphi, of which 29% were individuals with lived experience of preterm birth. Consensus was reached on 160 data items covering Antenatal and Birth Information, Neonatal Care, Mortality, Administrative Information, Organisational Level Information, Socio-economic and Demographic information, Physical Health, Education and Learning, Neurodevelopmental Outcomes, Social, Lifestyle and Leisure, Healthcare Utilisation and Quality of Life. CONCLUSIONS: This core dataset includes 160 data items covering antenatal care through outcomes in adulthood. Its use will guide data collection in new studies and facilitate pooling and harmonisation of existing data internationally.

20.
Methods ; 219: 1-7, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37689121

RESUMO

With the increasing availability of large-scale QSAR (Quantitative Structure-Activity Relationship) datasets, collaborative analysis has become a promising approach for drug discovery. Traditional centralized analysis which typically concentrates data on a central server for training faces challenges such as data privacy and security. Distributed analysis such as federated learning offers a solution by enabling collaborative model training without sharing raw data. However, it may fail when the training data in the local devices are non-independent and identically distributed (non-IID). In this paper, we propose a novel framework for collaborative drug discovery using federated learning on non-IID datasets. We address the difficulty of training on non-IID data by globally sharing a small subset of data among all institutions. Our framework allows multiple institutions to jointly train a robust predictive model while preserving the privacy of their individual data. We leverage the federated learning paradigm to distribute the model training process across local devices, eliminating the need for data exchange. The experimental results on 15 benchmark datasets demonstrate that the proposed method achieves competitive predictive accuracy to centralized analysis while respecting data privacy. Moreover, our framework offers benefits such as reduced data transmission and enhanced scalability, making it suitable for large-scale collaborative drug discovery efforts.


Assuntos
Benchmarking , Descoberta de Drogas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA