RESUMEN
Proteogenomics combines proteomic and genetic data to gain new insights in molecular mechanisms. Here, we extend this approach toward structural biology from a tool perspective. The chapter starts with tools which can be used to explore genetic information and then enrich those with proteomic data. Based on the corresponding identifiers, three-dimensional structures of proteins are identified and used to embed them in their molecular environment, here the surrounding membrane. This membrane is then mapped onto the surface of an interpretative three-dimensional cell model. Then, the embedded protein and the cell environment are associated with a metabolic pathway, again based on the identifiers provided by biomedical databases. Accompanying the different chapters, related work is discussed which can alternatively be used. Finally, an outlook toward immersive analytics is given.
Asunto(s)
Proteómica , Proteómica/métodos , Programas Informáticos , Humanos , Biología Computacional/métodos , Modelos Moleculares , Redes y Vías MetabólicasRESUMEN
Integrating diverse measurement platforms can yield profound insights. This study examined Brazilian Canephora coffees from Rondônia (Western Amazon) and Espírito Santo (southeast), hypothesizing that geographical and climatic differences along botanical varieties significantly impact coffee characteristics. To test this, capixaba, indigenous, and non-indigenous Amazonian canephora coffees were analyzed using nine distinct platforms, including both spectroscopic techniques and sensory evaluations, to obtain results that are more informative and complementary than conventional single-method analyses. By applying multi-block Path-ComDim analysis to the multiple data sets, we uncovered crucial correlations between instrumental and sensory measurements. This integrated approach not only confirmed the hypothesis but also demonstrated that combining multiple data sets provides a more nuanced understanding of coffee profiles than traditional single-method analyses. The results underscore the value of multiplatform approaches in enhancing coffee quality evaluation, offering a more detailed and comprehensive view of coffee characteristics that can drive future research and improve industry standards.
RESUMEN
Introduction: Kidney transplantation is the optimal treatment for end-stage kidney disease; however, premature allograft loss remains a serious issue. While many high-throughput omics studies have analyzed patient allograft biospecimens, integration of these datasets is challenging, which represents a considerable barrier to advancing our understanding of the mechanisms of allograft loss. Methods: To facilitate integration, we have created a curated database containing all open-access high-throughput datasets from human kidney transplant studies, termed NephroDIP (Nephrology Data Integration Portal). PubMed was searched for high-throughput transcriptomic, proteomic, single nucleotide variant, metabolomic, and epigenomic studies in kidney transplantation, which yielded 9,964 studies. Results: From these, 134 studies with available data detailing 260 comparisons and 83,262 molecules were included in NephroDIP v1.0. To illustrate the capabilities of NephroDIP, we have used the database to identify common gene, protein, and microRNA networks that are disrupted in patients with chronic antibody-mediated rejection, the most important cause of late allograft loss. We have also explored the role of an immunomodulatory protein galectin-1 (LGALS1), along with its interactors and transcriptional regulators, in kidney allograft injury. We highlight the pathways enriched among LGALS1 interactors and transcriptional regulators in kidney fibrosis and during immunosuppression. Discussion: NephroDIP is an open access data portal that facilitates data visualization and will help provide new insights into existing kidney transplant data through integration of distinct studies and modules (https://ophid.utoronto.ca/NephroDIP).
Asunto(s)
Rechazo de Injerto , Trasplante de Riñón , Humanos , Trasplante de Riñón/efectos adversos , Rechazo de Injerto/inmunología , Rechazo de Injerto/genética , Aloinjertos/inmunología , Bases de Datos Factuales , Riñón/metabolismo , Riñón/patología , Riñón/inmunología , Proteómica/métodosRESUMEN
Despite recent advances in chronic obstructive pulmonary disease (COPD) research, few studies have identified the potential therapeutic targets systematically by integrating multiple-omics datasets. This project aimed to develop a systems biology pipeline to identify biologically relevant genes and potential therapeutic targets that could be exploited to discover novel COPD treatments via drug repurposing or de novo drug discovery. A computational method was implemented by integrating multi-omics COPD data from unpaired human samples of more than half a million subjects. The outcomes from genome, transcriptome, proteome, and metabolome COPD studies were included, followed by an in silico interactome and drug-target information analysis. The potential candidate genes were ranked by a distance-based network computational model. Ninety-two genes were identified as COPD signature genes based on their overall proximity to signature genes on all omics levels. They are genes encoding proteins involved in extracellular matrix structural constituent, collagen binding, protease binding, actin-binding proteins, and other functions. Among them, 70 signature genes were determined to be druggable targets. The in silico validation identified that the knockout or over-expression of SPP1, APOA1, CTSD, TIMP1, RXFP1, and SMAD3 genes may drive the cell transcriptomics to a status similar to or contrasting with COPD. While some genes identified in our pipeline have been previously associated with COPD pathology, others represent possible new targets for COPD therapy development. In conclusion, we have identified promising therapeutic targets for COPD. This hypothesis-generating pipeline was supported by unbiased information from available omics datasets and took into consideration disease relevance and development feasibility.
Asunto(s)
Reposicionamiento de Medicamentos , Enfermedad Pulmonar Obstructiva Crónica , Enfermedad Pulmonar Obstructiva Crónica/tratamiento farmacológico , Enfermedad Pulmonar Obstructiva Crónica/genética , Enfermedad Pulmonar Obstructiva Crónica/metabolismo , Humanos , Reposicionamiento de Medicamentos/métodos , Transcriptoma , Biología Computacional/métodos , Proteoma/metabolismo , Proteína smad3/metabolismo , Proteína smad3/genética , Genómica/métodos , Inhibidor Tisular de Metaloproteinasa-1/genética , Inhibidor Tisular de Metaloproteinasa-1/metabolismo , Receptores Acoplados a Proteínas G/genética , Receptores Acoplados a Proteínas G/metabolismo , MultiómicaRESUMEN
Multi-omics data integration is a term that refers to the process of combining and analyzing data from different omic experimental sources, such as genomics, transcriptomics, methylation assays, and microRNA sequencing, among others. Such data integration approaches have the potential to provide a more comprehensive functional understanding of biological systems and has numerous applications in areas such as disease diagnosis, prognosis and therapy. However, quantitative integration of multi-omic data is a complex task that requires the use of highly specialized methods and approaches. Here, we discuss a number of data integration methods that have been developed with multi-omics data in view, including statistical methods, machine learning approaches, and network-based approaches. We also discuss the challenges and limitations of such methods and provide examples of their applications in the literature. Overall, this review aims to provide an overview of the current state of the field and highlight potential directions for future research.
RESUMEN
This article investigates the application of an improved three-dimensional convolutional neural network (3D CNN) for sparse data-based reconstruction of radiation fields. Sparse radiation data points are consolidated into structured three-dimensional matrices and fed into a self-attention integrated CNN, enabling the network to interpolate and produce complete radiation distribution grids. The model's validity is assessed through experiments with randomly sourced radiation in scenarios both with and without shielding, as well as in refined grid configurations. Results indicate that in unshielded environments, a mere 5%(15 points) sampling yields an average relative error of 4%, while in shielded settings, a 7% (21 points) sampling maintains the error around 11%. In refined grid contexts, a 2% sampling rate suffices to limit the error to 6.58%. Thus, the improved 3D CNN is demonstrated to be highly effective for precise three-dimensional radiation field reconstruction in sparse data scenarios.
RESUMEN
Transcriptomic data is often expensive and difficult to generate in large cohorts relative to genomic data; therefore, it is often important to integrate multiple transcriptomic datasets from both microarray- and next generation sequencing (NGS)-based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including reannotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically reannotating the data and removing the batch effect. The removal of the batch effect is verified with principal component analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. These transcriptomic datasets were from multiple high-throughput platforms, namely, array-based Affymetrix and Agilent platforms, and NGS-based Illumina paired-end RNA-seq platform. Furthermore, we compared the GEDI package to existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration for downstream genomic and bioinformatics applications. © 2024 The Author(s). Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: ReadGE, a function to import gene expression datasets Basic Protocol 2: GEDI, a function to reannotate and merge gene expression datasets Basic Protocol 3: BatchCorrection, a function to remove batch effects from gene expression data Basic Protocol 4: VerifyGEDI, a function to confirm successful integration of gene expression data.
Asunto(s)
Biología Computacional , Perfilación de la Expresión Génica , Transcriptoma , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Programas Informáticos , Animales , Secuenciación de Nucleótidos de Alto Rendimiento , BovinosRESUMEN
The management and exchange of electronic health records (EHRs) remain critical challenges in healthcare, with fragmented systems, varied standards, and security concerns hindering seamless interoperability. These challenges compromise patient care and operational efficiency. This paper proposes a novel solution to address these issues by leveraging distributed ledger technology (DLT), including blockchain, to enhance data security, integrity, and transparency in healthcare systems. The decentralized and immutable nature of DLT enables more efficient and secure information exchange across platforms, improving decision-making and coordination of care. This paper outlines a strategic implementation approach, detailing timelines, resource requirements, and stakeholder involvement while addressing crucial privacy and security concerns like encryption and access control. In addition, it explores standards and protocols necessary for achieving interoperability, offering case studies that demonstrate the framework's effectiveness. This work contributes by introducing a DLT-based solution to the persistent issue of EHR interoperability, providing a novel pathway to secure and efficient health data exchanges. It also identifies the standards and protocols essential for integrating DLT with existing health information systems, thereby facilitating a smoother transition toward enhanced interoperability.
RESUMEN
Population-based cancer registry databases are critical resources to bridge the information gap that results from a lack of sufficient statistical power from primary cohort data with small to moderate sample size. Although comprehensive data associated with tumor biomarkers often remain either unavailable or inconsistently measured in these registry databases, aggregate survival information sourced from these repositories has been well documented and publicly accessible. An appealing option is to integrate the aggregate survival information from the registry data with the primary cohort to enhance the evaluation of treatment impacts or prediction of survival outcomes across distinct tumor subtypes. Nevertheless, for rare types of cancer, even the sample sizes of cancer registries remain modest. The variability linked to the aggregated statistics could be non-negligible compared with the sample variation of the primary cohort. In response, we propose an externally informed likelihood approach, which facilitates the linkage between the primary cohort and external aggregate data, with consideration of the variation from aggregate information. We establish the asymptotic properties of the estimators and evaluate the finite sample performance via simulation studies. Through the application of our proposed method, we integrate data from the cohort of inflammatory breast cancer (IBC) patients at the University of Texas MD Anderson Cancer Center with aggregate survival data from the National Cancer Data Base, enabling us to appraise the effect of tri-modality treatment on survival across various tumor subtypes of IBC.
Asunto(s)
Sistema de Registros , Humanos , Análisis de Supervivencia , Sistema de Registros/estadística & datos numéricos , Funciones de Verosimilitud , Simulación por Computador , Femenino , Neoplasias de la Mama/mortalidad , Incertidumbre , Modelos Estadísticos , Interpretación Estadística de Datos , Estudios de CohortesRESUMEN
BACKGROUND: Cannabis use has been causally linked to violent behaviors in experimental and case studies, but its association with homicide victimization has not been rigorously assessed through epidemiologic research. METHODS: We performed a case-control analysis using two national data systems. Cases were homicide victims from the National Violent Death Reporting System (NVDRS), and controls were participants from the National Survey on Drug Use and Health (NSDUH). While the NVDRS contained toxicological testing data on cannabis use, the NSDUH only collected self-reported data, and thus the potential misclassification in the self-reported data needed to be corrected. We took a data fusion approach by concatenating the NSDUH with a third data system, the National Roadside Survey of Alcohol and Drug Use by Drivers (NRS), which collected toxicological testing and self-reported data on cannabis use for drivers. The data fusion approach provided multiple imputations (MIs) of toxicological testing results on cannabis use for the participants in the NSDUH, which were then used in the case-control analysis. Bootstrap was used to obtain valid statistical inference. RESULTS: The analyses revealed that cannabis use was associated with 3.55-fold (95% CI: 2.75-4.35) increased odds of homicide victimization. Alcohol use, being Black, male, aged 21-34 years, and having less than a high school education were also significantly associated with increased odds of homicide victimization. CONCLUSIONS: Cannabis use is a major risk factor for homicide victimization. The data fusion with MI method is useful in integrative data analysis for harmonizing measures between different data sources.
RESUMEN
Spatial transcriptomics technologies have been widely applied to decode cellular distribution by resolving gene expression profiles in tissue. However, sequencing techniques still limit the ability to create a fine-resolved spatial cell-type map. To this end, we develop a novel deep-learning-based approach, STASCAN, to predict the spatial cellular distribution of captured or uncharted areas where only histology images are available by cell feature learning integrating gene expression profiles and histology images. STASCAN is successfully applied across diverse datasets from different spatial transcriptomics technologies and displays significant advantages in deciphering higher-resolution cellular distribution and resolving enhanced organizational structures.
Asunto(s)
Aprendizaje Profundo , Transcriptoma , Humanos , Perfilación de la Expresión Génica/métodos , AnimalesRESUMEN
Precision medicine has gained considerable popularity since the "one-size-fits-all" approach did not seem very effective or reflective of the complexity of the human body. Subsequently, since single-omics does not reflect the complexity of the human body's inner workings, it did not result in the expected advancement in the medical field. Therefore, the multi-omics approach has emerged. The multi-omics approach involves integrating data from different omics technologies, such as DNA sequencing, RNA sequencing, mass spectrometry, and others, using computational methods and then analyzing the integrated result for different downstream analysis applications such as survival analysis, cancer classification, or biomarker identification. Most of the recent reviews were constrained to discussing one aspect of the multi-omics analysis pipeline, such as the dimensionality reduction step, the integration methods, or the interpretability aspect; however, very few provide a comprehensive review of every step of the analysis. This study aims to give an overview of the multi-omics analysis pipeline, starting with the most popular multi-omics databases used in recent literature, dimensionality reduction techniques, details the different types of data integration techniques and their downstream analysis applications, describes the most commonly used evaluation metrics, highlights the importance of model interpretability, and lastly discusses the challenges and potential future work for multi-omics data integration in precision medicine.
RESUMEN
There are myriad types of biomedical data-molecular, clinical images, and others. When a group of patients with the same underlying disease exhibits similarities across multiple types of data, this is called a subtype. Existing subtyping approaches struggle to handle diverse data types with missing information. To improve subtype discovery, we exploited changes in the correlation-structure between different data types to create iSubGen, an algorithm for integrative subtype generation. iSubGen can accommodate any feature that can be compared with a similarity metric to create subtypes versatilely. It can combine arbitrary data types for subtype discovery, such as merging genetic, transcriptomic, proteomic, and pathway data. iSubGen recapitulates known subtypes across multiple cancers even with substantial missing data and identifies subtypes with distinct clinical behaviors. It performs equally with or superior to other subtyping methods, offering greater stability and robustness to missing data and flexibility to new data types. It is available at https://cran.r-project.org/web/packages/iSubGen.
RESUMEN
In this perspective paper, we propose a novel tech-driven method to evaluate body representations (BRs) in autistic individuals. Our goal is to deepen understanding of this complex condition by gaining continuous and real-time insights through digital phenotyping into the behavior of autistic adults. Our innovative method combines cross-sectional and longitudinal data gathering techniques to investigate and identify digital phenotypes related to BRs in autistic adults, diverging from traditional approaches. We incorporate ecological momentary assessment and time series data to capture the dynamic nature of real-life events for these individuals. Statistical techniques, including multivariate regression, time series analysis, and machine learning algorithms, offer a detailed comprehension of the complex elements that influence BRs. Ethical considerations and participant involvement in the development of this method are emphasized, while challenges, such as varying technological adoption rates and usability concerns, are acknowledged. This innovative method not only introduces a novel vision for evaluating BRs but also shows promise in integrating traditional and dynamic assessment approaches, fostering a more supportive atmosphere for autistic individuals during assessments compared to conventional methods.
Asunto(s)
Trastorno Autístico , Aprendizaje Automático , Humanos , Trastorno Autístico/fisiopatología , Trastorno Autístico/psicología , Adulto , Fenotipo , Imagen Corporal/psicología , Algoritmos , Estudios Transversales , MasculinoRESUMEN
Technical or biologically irrelevant differences caused by different experiments, times, or sequencing platforms can generate batch effects that mask the true biological information. Therefore, batch effects are typically removed when analyzing single-cell RNA sequencing (scRNA-seq) datasets for downstream tasks. Existing batch correction methods usually mitigate batch effects by reducing the data from different batches to a lower dimensional space before clustering, potentially leading to the loss of rare cell types. To address this problem, we introduce a novel single-cell data batch effect correction model using Biological-noise Decoupling Autoencoder (BDA) and Central-cross Loss termed BDACL. The model initially reconstructs raw data using an auto-encoder and conducts preliminary clustering. We then construct a similarity matrix and a hierarchical clustering tree to delineate relationships within and between different batches. Finally, we introduce a Central-cross Loss (CL). This loss leverages cross-entropy loss to prompt the model to better distinguish between different cluster labels. Additionally, it employs the Central Loss to encourage samples to form more compact clusters in the embedding space, thereby enhancing the consistency and interpretability of clustering results to mitigate differences between different batches. The primary innovation of this model lies in reconstructing data with an auto-encoder and gradually merging smaller clusters into larger ones using a hierarchical clustering tree. By using reallocated cluster labels as training labels and employing the Central-cross Loss, the model effectively eliminates batch effects in an unsupervised manner. Compared to current methods, BDACL can mitigate batch effects without losing rare cell types.
RESUMEN
Exclusive enteral nutrition (EEN) is a first-line therapy for pediatric Crohn's disease (CD), but protective mechanisms remain unknown. We established a prospective pediatric cohort to characterize the function of fecal microbiota and metabolite changes of treatment-naive CD patients in response to EEN (German Clinical Trials DRKS00013306). Integrated multi-omics analysis identified network clusters from individually variable microbiome profiles, with Lachnospiraceae and medium-chain fatty acids as protective features. Bioorthogonal non-canonical amino acid tagging selectively identified bacterial species in response to medium-chain fatty acids. Metagenomic analysis identified high strain-level dynamics in response to EEN. Functional changes in diet-exposed fecal microbiota were further validated using gut chemostat cultures and microbiota transfer into germ-free Il10-deficient mice. Dietary model conditions induced individual patient-specific strain signatures to prevent or cause inflammatory bowel disease (IBD)-like inflammation in gnotobiotic mice. Hence, we provide evidence that EEN therapy operates through explicit functional changes of temporally and individually variable microbiome profiles.
RESUMEN
Background: With its high and increasing lifetime prevalence, back pain represents a contemporary challenge for patients and healthcare providers. Monitored exercise therapy is a commonly prescribed treatment to relieve pain and functional limitations. However, the benefits of exercise are often gradual, subtle, and evaluated by subjective self-reported scores. Back pain pathogenesis is interlinked with epigenetically mediated processes that modify gene expression without altering the DNA sequence. Therefore, we hypothesize that therapy effects can be objectively evaluated by measurable epigenetic histone posttranslational modifications and proteome expression. Because epigenetic modifications are dynamic and responsive to environmental exposure, lifestyle choices-such as physical activity-can alter epigenetic profiles, subsequent gene expression, and health traits. Instead of invasive sampling (e.g., muscle biopsy), we collect easily accessible buccal swabs and plasma. The plasma proteome provides a systemic understanding of a person's current health state and is an ideal snapshot of downstream, epigenetically regulated, changes upon therapy. This study investigates how molecular profiles evolve in response to standardized sport therapy and non-controlled lifestyle choices. Results: We report that the therapy improves agility, attenuates back pain, and triggers healthier habits. We find that a subset of participants' histone methylation and acetylation profiles cluster samples according to their therapy status, before or after therapy. Integrating epigenetic reprogramming of both buccal cells and peripheral blood mononuclear cells (PBMCs) reveals that these concomitant changes are concordant with higher levels of self-rated back pain improvement and agility gain. Additionally, epigenetic changes correlate with changes in immune response plasma factors, reflecting their comparable ability to rate therapy effects at the molecular level. We also performed an exploratory analysis to confirm the usability of molecular profiles in (1) mapping lifestyle choices and (2) evaluating the distance of a given participant to an optimal health state. Conclusion: This pre-post cohort study highlights the potential of integrated molecular profiles to score therapy efficiency. Our findings reflect the complex interplay of an individual's background and lifestyle upon therapeutic exposure. Future studies are needed to provide mechanistic insights into back pain pathogenesis and lifestyle-based epigenetic reprogramming upon sport therapy intervention to maintain therapeutic effects in the long run.
RESUMEN
Omics data generated from high-throughput technologies and clinical features jointly impact many complex human diseases. Identifying key biomarkers and clinical risk factors is essential for understanding disease mechanisms and advancing early disease diagnosis and precision medicine. However, the high-dimensionality and intricate associations between disease outcomes and omics profiles present significant analytical challenges. To address these, we propose an ensemble data-driven biomarker identification tool, Hybrid Feature Screening (HFS), to construct a candidate feature set for downstream advanced machine learning models. The pre-screened candidate features from HFS are further refined using a computationally efficient permutation-based feature importance test, forming the comprehensive High-dimensional Feature Importance Test (HiFIT) framework. Through extensive numerical simulations and real-world applications, we demonstrate HiFIT's superior performance in both outcome prediction and feature importance identification. An R package implementing HiFIT is available on GitHub (https://github.com/BZou-lab/HiFIT).
RESUMEN
Erythropoiesis is a finely regulated and complex process that involves multiple transformations from hematopoietic stem cells to mature red blood cells at hematopoietic sites from the embryonic to the adult stages. Investigations into its molecular mechanisms have generated a wealth of expression data, including bulk and single-cell RNA sequencing data. A comprehensively integrated and well-curated erythropoiesis-specific database will greatly facilitate the mining of gene expression data and enable large-scale research of erythropoiesis and erythroid-related diseases. Here, we present EryDB, an open-access and comprehensive database dedicated to the collection, integration, analysis, and visualization of transcriptomic data for erythropoiesis and erythroid-related diseases. Currently, the database includes expertly curated quality-assured data of 3803 samples and 1,187,119 single cells derived from 107 public studies of three species (Homo sapiens, Mus musculus, and Danio rerio), nine tissue types, and five diseases. EryDB provides users with the ability to not only browse the molecular features of erythropoiesis between tissues and species, but also perform computational analyses of single-cell and bulk RNA sequencing data, thus serving as a convenient platform for customized queries and analyses. EryDB v1.0 is freely accessible at https://ngdc.cncb.ac.cn/EryDB/home.
RESUMEN
This paper introduces a novel deep learning model for grape disease detection that integrates multimodal data and parallel heterogeneous activation functions, significantly enhancing detection accuracy and robustness. Through experiments, the model demonstrated excellent performance in grape disease detection, achieving an accuracy of 91%, a precision of 93%, a recall of 90%, a mean average precision (mAP) of 91%, and 56 frames per second (FPS), outperforming traditional deep learning models such as YOLOv3, YOLOv5, DEtection TRansformer (DETR), TinySegformer, and Tranvolution-GAN. To meet the demands of rapid on-site detection, this study also developed a lightweight model for mobile devices, successfully deployed on the iPhone 15. Techniques such as structural pruning, quantization, and depthwise separable convolution were used to significantly reduce the model's computational complexity and resource consumption, ensuring efficient operation and real-time performance. These achievements not only advance the development of smart agricultural technologies but also provide new technical solutions and practical tools for disease detection.