RESUMO
MOTIVATION: In the past decade, single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal method for transcriptomic profiling in biomedical research. Precise cell-type identification is crucial for subsequent analysis of single-cell data. And the integration and refinement of annotated data are essential for building comprehensive databases. However, prevailing annotation techniques often overlook the hierarchical organization of cell types, resulting in inconsistent annotations. Meanwhile, most existing integration approaches fail to integrate datasets with different annotation depths and none of them can enhance the labels of outdated data with lower annotation resolutions using more intricately annotated datasets or novel biological findings. RESULTS: Here, we introduce scPLAN, a hierarchical computational framework designed for scRNA-seq data analysis. scPLAN excels in annotating unlabeled scRNA-seq data using a reference dataset structured along a hierarchical cell-type tree. It identifies potential novel cell types in a systematic, layer-by-layer manner. Additionally, scPLAN effectively integrates annotated scRNA-seq datasets with varying levels of annotation depth, ensuring consistent refinement of cell-type labels across datasets with lower resolutions. Through extensive annotation and novel cell detection experiments, scPLAN has demonstrated its efficacy. Two case studies have been conducted to showcase how scPLAN integrates datasets with diverse cell-type label resolutions and refine their cell-type labels. AVAILABILITY: https://github.com/michaelGuo1204/scPLAN.
Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Análise de Célula Única , Análise de Célula Única/métodos , Perfilação da Expressão Gênica/métodos , Biologia Computacional/métodos , Humanos , Software , Transcriptoma , Análise de Sequência de RNA/métodos , RNA-Seq/métodos , Anotação de Sequência Molecular/métodosRESUMO
Large sample datasets have been regarded as the primary basis for innovative discoveries and the solution to missing heritability in genome-wide association studies. However, their computational complexity cannot consider all comprehensive effects and all polygenic backgrounds, which reduces the effectiveness of large datasets. To address these challenges, we included all effects and polygenic backgrounds in a mixed logistic model for binary traits and compressed four variance components into two. The compressed model combined three computational algorithms to develop an innovative method, called FastBiCmrMLM, for large data analysis. These algorithms were tailored to sample size, computational speed, and reduced memory requirements. To mine additional genes, linkage disequilibrium markers were replaced by bin-based haplotypes, which are analyzed by FastBiCmrMLM, named FastBiCmrMLM-Hap. Simulation studies highlighted the superiority of FastBiCmrMLM over GMMAT, SAIGE and fastGWA-GLMM in identifying dominant, small α (allele substitution effect), and rare variants. In the UK Biobank-scale dataset, we demonstrated that FastBiCmrMLM could detect variants as small as 0.03% and with α ≈ 0. In re-analyses of seven diseases in the WTCCC datasets, 29 candidate genes, with both functional and TWAS evidence, around 36 variants identified only by the new methods, strongly validated the new methods. These methods offer a new way to decipher the genetic architecture of binary traits and address the challenges outlined above.
Assuntos
Algoritmos , Estudo de Associação Genômica Ampla , Estudo de Associação Genômica Ampla/métodos , Humanos , Modelos Logísticos , Estudos de Casos e Controles , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo Único , Genômica/métodos , Simulação por Computador , Haplótipos , Modelos GenéticosRESUMO
Plant long noncoding RNAs (lncRNAs) exhibit features such as tissue-specific expression, spatiotemporal regulation, and stress responsiveness. Although diverse studies support the regulatory role of lncRNAs in model plants, our knowledge about lncRNAs in crops is limited. We employ a custom pipeline on a dataset of over 1000 RNA-seq samples across nine representative species of the family Cucurbitaceae to predict 91 209 nonredundant lncRNAs. The lncRNAs were characterized according to three confidence levels and classified by their genomic context into intergenic, natural antisense, intronic, and sense-overlapping. Compared with protein-coding genes, lncRNAs were, on average, expressed at low levels and displayed significantly higher specificity when considering tissue, developmental stages, and stress responsiveness. The evolutionary analysis indicates higher positional conservation than sequence conservation, probably linked to the conserved modular motifs within syntenic lncRNAs. Moreover, a positive correlation between the expression of intergenic/natural antisense lncRNAs and their closest/parental gene was observed. For those intergenic, the correlation decreases with the distance to the neighboring gene, supporting that their potential cis-regulatory effect is within a short-range. Furthermore, the analysis of developmental studies showed that a conserved NAT-lncRNA family is differentially expressed in a coordinated way with their cognate sense protein-coding genes. These genes code for proteins associated with phloem development, thus providing insights about the potential involvement of some of the identified lncRNAs in a developmental process. We expect that this extensive inventory will constitute a valuable resource for further research lines focused on elucidating the regulatory mechanisms mediated by lncRNAs in cucurbits.
Assuntos
Regulação da Expressão Gênica de Plantas , RNA Longo não Codificante , RNA de Plantas , RNA Longo não Codificante/genética , RNA de Plantas/genética , Cucurbitaceae/genéticaRESUMO
The emergence of massive datasets exploring the multiple levels of molecular biology has made their analysis and knowledge transfer more complex. Flexible tools to manage big biological datasets could be of great help for standardizing the usage of developed data visualizations and integration methods. Business intelligence (BI) tools have been used in many fields as exploratory tools. They have numerous connectors to link numerous data repositories with a unified graphic interface, offering an overview of data and facilitating interpretation for decision makers. BI tools could be a flexible and user-friendly way of handling molecular biological data with interactive visualizations. However, it is rather uncommon to see such tools used for the exploration of massive and complex datasets in biological fields. We believe that two main obstacles could be the reason. Firstly, we posit that the way to import data into BI tools are not compatible with biological databases. Secondly, BI tools may not be adapted to certain particularities of complex biological data, namely, the size, the variability of datasets and the availability of specialized visualizations. This paper highlights the use of five BI tools (Elastic Kibana, Siren Investigate, Microsoft Power BI, Salesforce Tableau and Apache Superset) onto which the massive data management repository engine called Elasticsearch is compatible. Four case studies will be discussed in which these BI tools were applied on biological datasets with different characteristics. We conclude that the performance of the tools depends on the complexity of the biological questions and the size of the datasets.
Assuntos
Conjuntos de Dados como Assunto , Software , Visualização de DadosRESUMO
The rapid progress of machine learning (ML) in predicting molecular properties enables high-precision predictions being routinely achieved. However, many ML models, such as conventional molecular graph, cannot differentiate stereoisomers of certain types, particularly conformational and chiral ones that share the same bonding connectivity but differ in spatial arrangement. Here, we designed a hybrid molecular graph network, Chemical Feature Fusion Network (CFFN), to address the issue by integrating planar and stereo information of molecules in an interweaved fashion. The three-dimensional (3D, i.e., stereo) modality guarantees precision and completeness by providing unabridged information, while the two-dimensional (2D, i.e., planar) modality brings in chemical intuitions as prior knowledge for guidance. The zipper-like arrangement of 2D and 3D information processing promotes cooperativity between them, and their synergy is the key to our model's success. Experiments on various molecules or conformational datasets including a special newly created chiral molecule dataset comprised of various configurations and conformations demonstrate the superior performance of CFFN. The advantage of CFFN is even more significant in datasets made of small samples. Ablation experiments confirm that fusing 2D and 3D molecular graphs as unambiguous molecular descriptors can not only effectively distinguish molecules and their conformations, but also achieve more accurate and robust prediction of quantum chemical properties.
Assuntos
Aprendizado de Máquina , Estereoisomerismo , Conformação MolecularRESUMO
Single-cell omics technologies have made it possible to analyze the individual cells within a biological sample, providing a more detailed understanding of biological systems. Accurately determining the cell type of each cell is a crucial goal in single-cell RNA-seq (scRNA-seq) analysis. Apart from overcoming the batch effects arising from various factors, single-cell annotation methods also face the challenge of effectively processing large-scale datasets. With the availability of an increase in the scRNA-seq datasets, integrating multiple datasets and addressing batch effects originating from diverse sources are also challenges in cell-type annotation. In this work, to overcome the challenges, we developed a supervised method called CIForm based on the Transformer for cell-type annotation of large-scale scRNA-seq data. To assess the effectiveness and robustness of CIForm, we have compared it with some leading tools on benchmark datasets. Through the systematic comparisons under various cell-type annotation scenarios, we exhibit that the effectiveness of CIForm is particularly pronounced in cell-type annotation. The source code and data are available at https://github.com/zhanglab-wbgcas/CIForm.
Assuntos
Perfilação da Expressão Gênica , Análise da Expressão Gênica de Célula Única , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , SoftwareRESUMO
The Major Histocompatibility Complex (MHC) is a critical element of the vertebrate cellular immune system, responsible for presenting peptides derived from intracellular proteins. MHC-I presentation is pivotal in the immune response and holds considerable potential in the realms of vaccine development and cancer immunotherapy. This study delves into the limitations of current methods and benchmarks for MHC-I presentation. We introduce a novel benchmark designed to assess generalization properties and the reliability of models on unseen MHC molecules and peptides, with a focus on the Human Leukocyte Antigen (HLA)-a specific subset of MHC genes present in humans. Finally, we introduce HLABERT, a pretrained language model that outperforms previous methods significantly on our benchmark and establishes a new state-of-the-art on existing benchmarks.
Assuntos
Peptídeos , Proteínas , Humanos , Reprodutibilidade dos Testes , Peptídeos/química , Proteínas/metabolismo , Complexo Principal de Histocompatibilidade/genética , Ligação ProteicaRESUMO
We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).
RESUMO
BACKGROUND: Literature-based discovery (LBD) aims to help researchers to identify relations between concepts which are worthy of further investigation by text-mining the biomedical literature. While the LBD literature is rich and the field is considered mature, standard practice in the evaluation of LBD methods is methodologically poor and has not progressed on par with the domain. The lack of properly designed and decent-sized benchmark dataset hinders the progress of the field and its development into applications usable by biomedical experts. RESULTS: This work presents a method for mining past discoveries from the biomedical literature. It leverages the impact made by a discovery, using descriptive statistics to detect surges in the prevalence of a relation across time. The validity of the method is tested against a baseline representing the state-of-the-art "time-sliced" method. CONCLUSIONS: This method allows the collection of a large amount of time-stamped discoveries. These can be used for LBD evaluation, alleviating the long-standing issue of inadequate evaluation. It might also pave the way for more fine-grained LBD methods, which could exploit the diversity of these past discoveries to train supervised models. Finally the dataset (or some future version of it inspired by our method) could be used as a methodological tool for systematic reviews. We provide an online exploration tool in this perspective, available at https://brainmend.adaptcentre.ie/ .
Assuntos
Mineração de Dados , Mineração de Dados/métodosRESUMO
BACKGROUND: In recent years, gene clustering analysis has become a widely used tool for studying gene functions, efficiently categorizing genes with similar expression patterns to aid in identifying gene functions. Caenorhabditis elegans is commonly used in embryonic research due to its consistent cell lineage from fertilized egg to adulthood. Biologists use 4D confocal imaging to observe gene expression dynamics at the single-cell level. However, on one hand, the observed tree-shaped time-series datasets have characteristics such as non-pairwise data points between different individuals. On the other hand, the influence of cell type heterogeneity should also be considered during clustering, aiming to obtain more biologically significant clustering results. RESULTS: A biclustering model is proposed for tree-shaped single-cell gene expression data of Caenorhabditis elegans. Detailedly, a tree-shaped piecewise polynomial function is first employed to fit non-pairwise gene expression time series data. Then, four factors are considered in the objective function, including Pearson correlation coefficients capturing gene correlations, p-values from the Kolmogorov-Smirnov test measuring the similarity between cells, as well as gene expression size and bicluster overlapping size. After that, Genetic Algorithm is utilized to optimize the function. CONCLUSION: The results on the small-scale dataset analysis validate the feasibility and effectiveness of our model and are superior to existing classical biclustering models. Besides, gene enrichment analysis is employed to assess the results on the complete real dataset analysis, confirming that the discovered biclustering results hold significant biological relevance.
Assuntos
Caenorhabditis elegans , Análise de Célula Única , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Animais , Análise de Célula Única/métodos , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , AlgoritmosRESUMO
BACKGROUND: Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets. RESULTS: By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features. CONCLUSIONS: We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .
Assuntos
Ligação Proteica , Proteínas , Proteínas/química , Proteínas/metabolismo , Ligantes , Internet , Bases de Dados de Proteínas , Sítios de Ligação , Aprendizado Profundo , SoftwareRESUMO
BACKGROUND: Detecting structural variations (SVs) at the population level using next-generation sequencing (NGS) requires substantial computational resources and processing time. Here, we compared the performances of 11 SV callers: Delly, Manta, GridSS, Wham, Sniffles, Lumpy, SvABA, Canvas, CNVnator, MELT, and INSurVeyor. These SV callers have been recently published and have been widely employed for processing massive whole-genome sequencing datasets. We evaluated the accuracy, sequence depth, running time, and memory usage of the SV callers. RESULTS: Notably, several callers exhibited better calling performance for deletions than for duplications, inversions, and insertions. Among the SV callers, Manta identified deletion SVs with better performance and efficient computing resources, and both Manta and MELT demonstrated relatively good precision regarding calling insertions. We confirmed that the copy number variation callers, Canvas and CNVnator, exhibited better performance in identifying long duplications as they employ the read-depth approach. Finally, we also verified the genotypes inferred from each SV caller using a phased long-read assembly dataset, and Manta showed the highest concordance in terms of the deletions and insertions. CONCLUSIONS: Our findings provide a comprehensive understanding of the accuracy and computational efficiency of SV callers, thereby facilitating integrative analysis of SV profiles in diverse large-scale genomic datasets.
Assuntos
Variações do Número de Cópias de DNA , Genômica , Humanos , Sequenciamento Completo do Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Genoma Humano , Variação Estrutural do GenomaRESUMO
Fetal brain development is a complex process involving different stages of growth and organization which are crucial for the development of brain circuits and neural connections. Fetal atlases and labeled datasets are promising tools to investigate prenatal brain development. They support the identification of atypical brain patterns, providing insights into potential early signs of clinical conditions. In a nutshell, prenatal brain imaging and post-processing via modern tools are a cutting-edge field that will significantly contribute to the advancement of our understanding of fetal development. In this work, we first provide terminological clarification for specific terms (i.e., "brain template" and "brain atlas"), highlighting potentially misleading interpretations related to inconsistent use of terms in the literature. We discuss the major structures and neurodevelopmental milestones characterizing fetal brain ontogenesis. Our main contribution is the systematic review of 18 prenatal brain atlases and 3 datasets. We also tangentially focus on clinical, research, and ethical implications of prenatal neuroimaging.
Assuntos
Atlas como Assunto , Encéfalo , Imageamento por Ressonância Magnética , Neuroimagem , Feminino , Humanos , Gravidez , Encéfalo/diagnóstico por imagem , Encéfalo/embriologia , Conjuntos de Dados como Assunto , Desenvolvimento Fetal/fisiologia , Feto/diagnóstico por imagem , Imageamento por Ressonância Magnética/métodos , Neuroimagem/métodosRESUMO
For the past century, the nucleus has been the focus of extensive investigations in cell biology. However, many questions remain about how its shape and size are regulated during development, in different tissues, or during disease and aging. To track these changes, microscopy has long been the tool of choice. Image analysis has revolutionized this field of research by providing computational tools that can be used to translate qualitative images into quantitative parameters. Many tools have been designed to delimit objects in 2D and, eventually, in 3D in order to define their shapes, their number or their position in nuclear space. Today, the field is driven by deep-learning methods, most of which take advantage of convolutional neural networks. These techniques are remarkably adapted to biomedical images when trained using large datasets and powerful computer graphics cards. To promote these innovative and promising methods to cell biologists, this Review summarizes the main concepts and terminologies of deep learning. Special emphasis is placed on the availability of these methods. We highlight why the quality and characteristics of training image datasets are important and where to find them, as well as how to create, store and share image datasets. Finally, we describe deep-learning methods well-suited for 3D analysis of nuclei and classify them according to their level of usability for biologists. Out of more than 150 published methods, we identify fewer than 12 that biologists can use, and we explain why this is the case. Based on this experience, we propose best practices to share deep-learning methods with biologists.
Assuntos
Aprendizado Profundo , Núcleo Celular , Processamento de Imagem Assistida por Computador/métodos , Imageamento Tridimensional , Microscopia/métodos , Redes Neurais de ComputaçãoRESUMO
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Assuntos
Algoritmos , Mineração de Dados , Proteínas , PubMedRESUMO
In recent years, digital dentistry has increasingly utilized advanced image analysis techniques, such as image classification and disease diagnosis, to improve clinical outcomes. Despite these advances, the lack of comprehensive benchmark datasets is a significant barrier. To address this gap, our research team develop LMCD-OR, a substantial collection of oral radiograph images designed to support extensive artificial intelligence (AI)-driven diagnostics. LMCD-OR comprises 3,818 digital imaging and communications in medicine (DICOM) oral X-ray images from local medical institutions that are meticulously annotated to provide broad category information for both primary dental outpatient services and detailed secondary disease diagnoses. This dataset is engineered to train and validate multiclassification models to improve the precision and scope of oral disease diagnostics. To ensure robust dataset validation, we employ four cutting-edge visual neural network classification models as benchmarks. These models are tested against rigorous performance metrics, demonstrating the ability of the dataset to support advanced image classification and disease diagnosis tasks. LMCD-OR is publicly available at http://dentaldataset.zeroacademy.net .
Assuntos
Redes Neurais de Computação , Humanos , Conjuntos de Dados como Assunto , Inteligência ArtificialRESUMO
Deep learning approaches have frequently been used in the classification and segmentation of human peripheral blood cells. The common feature of previous studies was that they used more than one dataset, but used them separately. No study has been found that combines more than two datasets to use together. In classification, five types of white blood cells were identified by using a mixture of four different datasets. In segmentation, four types of white blood cells were determined, and three different neural networks, including CNN (Convolutional Neural Network), UNet and SegNet, were applied. The classification results of the presented study were compared with those of related studies. The balanced accuracy was 98.03%, and the test accuracy of the train-independent dataset was determined to be 97.27%. For segmentation, accuracy rates of 98.9% for train-dependent dataset and 92.82% for train-independent dataset for the proposed CNN were obtained in both nucleus and cytoplasm detection. In the presented study, the proposed method showed that it could detect white blood cells from a train-independent dataset with high accuracy. Additionally, it is promising as a diagnostic tool that can be used in the clinical field, with successful results in classification and segmentation.
Assuntos
Aprendizado Profundo , Leucócitos , Redes Neurais de Computação , Humanos , Leucócitos/citologia , Leucócitos/classificação , Processamento de Imagem Assistida por Computador/métodos , Análise de Dados , Núcleo Celular , CitoplasmaRESUMO
In this manuscript, attentive dual residual generative adversarial network optimized using wild horse optimization algorithm for brain tumor detection (ADRGAN-WHOA-BTD) is proposed. Here, the input imageries are gathered using BraTS, RemBRANDT, and Figshare datasets. Initially, the images are preprocessed to increase the quality of images and eliminate the unwanted noises. The preprocessing is performed with dual-tree complex wavelet transform (DTCWT). The image features like geodesic data and texture features like contrasts, energy, correlations, homogeneity, and entropy are extracted using multilayer dense net methods. Then, the extracted images are given to attentive dual residual generative adversarial network (ADRGAN) classifier for classifying the brain imageries. The ADRGAN weight parameters are tuned based on wild horse optimization algorithm (WHOA). The proposed method is executed in MATLAB. For the BraTS dataset, the ADRGAN-WHOA-BTD method achieved accuracy, sensitivity, specificity, F-measure, precision, and error rates of 99.85%, 99.82%, 98.92%, 99.76%, 99.45%, and 0.15%, respectively. Then, the proposed technique demonstrated a runtime of 13 s, significantly outperforming existing methods.
RESUMO
BACKGROUND AND OBJECTIVES: Current national or regional guidelines for the pathology reporting on invasive breast cancer differ in certain aspects, resulting in divergent reporting practice and a lack of comparability of data. Here we report on a new international dataset for the pathology reporting of resection specimens with invasive cancer of the breast. The dataset was produced under the auspices of the International Collaboration on Cancer Reporting (ICCR), a global alliance of major (inter-)national pathology and cancer organizations. METHODS AND RESULTS: The established ICCR process for dataset development was followed. An international expert panel consisting of breast pathologists, a surgeon, and an oncologist prepared a draft set of core and noncore data items based on a critical review and discussion of current evidence. Commentary was provided for each data item to explain the rationale for selecting it as a core or noncore element, its clinical relevance, and to highlight potential areas of disagreement or lack of evidence, in which case a consensus position was formulated. Following international public consultation, the document was finalized and ratified, and the dataset, which includes a synoptic reporting guide, was published on the ICCR website. CONCLUSIONS: This first international dataset for invasive cancer of the breast is intended to promote high-quality, standardized pathology reporting. Its widespread adoption will improve consistency of reporting, facilitate multidisciplinary communication, and enhance comparability of data, all of which will help to improve the management of invasive breast cancer patients.
Assuntos
Neoplasias da Mama , Humanos , Neoplasias da Mama/patologia , Feminino , Patologia Clínica/normas , Conjuntos de Dados como Assunto/normasRESUMO
BACKGROUND: Paucity and low evidence-level data on proton therapy (PT) represent one of the main issues for the establishment of solid indications in the PT setting. Aim of the present registry, the POWER registry, is to provide a tool for systematic, prospective, harmonized, and multidimensional high-quality data collection to promote knowledge in the field of PT with a particular focus on the use of hypofractionation. METHODS: All patients with any type of oncologic disease (benign and malignant disease) eligible for PT at the European Institute of Oncology (IEO), Milan, Italy, will be included in the present registry. Three levels of data collection will be implemented: Level (1) clinical research (patients outcome and toxicity, quality of life, and cost/effectiveness analysis); Level (2) radiological and radiobiological research (radiomic and dosiomic analysis, as well as biological modeling); Level (3) biological and translational research (biological biomarkers and genomic data analysis). Endpoints and outcome measures of hypofractionation schedules will be evaluated in terms of either Treatment Efficacy (tumor response rate, time to progression/percentages of survivors/median survival, clinical, biological, and radiological biomarkers changes, identified as surrogate endpoints of cancer survival/response to treatment) and Toxicity. The study protocol has been approved by the IEO Ethical Committee (IEO 1885). Other than patients treated at IEO, additional PT facilities (equipped with Proteus®ONE or Proteus®PLUS technologies by IBA, Ion Beam Applications, Louvain-la-Neuve, Belgium) are planned to join the registry data collection. Moreover, the registry will be also fully integrated into international PT data collection networks.