Pesquisa | Portal Regional da BVS

1.

A New Paradigm for High-dimensional Data: Distance-Based Semiparametric Feature Aggregation Framework via Between-Subject Attributes.

Liu, Jinyuan; Zhang, Xinlian; Lin, Tuo; Chen, Ruohui; Zhong, Yuan; Chen, Tian; Wu, Tsungchin; Liu, Chenyu; Huang, Anna; Nguyen, Tanya T; Lee, Ellen E; Jeste, Dilip V; Tu, Xin M.

Scand Stat Theory Appl ; 51(2): 672-696, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-39101047

RESUMO

This article proposes a distance-based framework incentivized by the paradigm shift towards feature aggregation for high-dimensional data, which does not rely on the sparse-feature assumption or the permutation-based inference. Focusing on distance-based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high-dimensional variables using pairwise outcomes of between-subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U-statistics-based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root-n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands.

2.

Enhancing spatio-temporal environmental analyses: A machine learning superpixel-based approach.

Estefania-Salazar, Enrique; Iglesias, Eva.

Heliyon ; 10(14): e34711, 2024 Jul 30.

Artigo em Inglês | MEDLINE | ID: mdl-39130414

RESUMO

The progressive evolution of the spatial and temporal resolutions of Earth observation satellites has brought multiple benefits to scientific research. The increasing volume of data with higher frequencies and spatial resolutions offers precise and timely information, making it an invaluable tool for environmental analysis and enhanced decision-making. However, this presents a formidable challenge for large-scale environmental analyses and socioeconomic applications based on spatial time series, often compelling researchers to resort to lower-resolution imagery, which can introduce uncertainty and impact results. In response to this, our key contribution is a novel machine learning approach for dense geospatial time series rooted in superpixel segmentation, which serves as a preliminary step in mitigating the high dimensionality of data in large-scale applications. This approach, while effectively reducing dimensionality, preserves valuable information to the maximum extent, thereby substantially enhancing data accuracy and subsequent environmental analyses. This method was empirically applied within the context of a comprehensive case study encompassing the 2002-2022 period with 8-d-frequency-normalized difference vegetation index data at 250-m resolution in an area spanning 43,470 km2. The efficacy of this methodology was assessed through a comparative analysis, comparing our results with those derived from 1000-m-resolution satellite data and an existing superpixel algorithm for time series data. An evaluation of the time-series deviations revealed that using coarser-resolution pixels introduced an error that exceeded that of the proposed algorithm by 25 % and that the proposed methodology outperformed other algorithms by more than 9 %. Notably, this methodological innovation concurrently facilitates the aggregation of pixels sharing similar land-cover classifications, thus mitigating subpixel heterogeneity within the dataset. Further, the proposed methodology, which is used as a preprocessing step, improves the clustering of pixels according to their time series and can enhance large-scale environmental analyses across a wide range of applications.

3.

Riemannian geometry for efficient analysis of protein dynamics data.

Diepeveen, Willem; Esteve-Yagüe, Carlos; Lellmann, Jan; Öktem, Ozan; Schönlieb, Carola-Bibiane.

Proc Natl Acad Sci U S A ; 121(33): e2318951121, 2024 Aug 13.

Artigo em Inglês | MEDLINE | ID: mdl-39121160

RESUMO

An increasingly common viewpoint is that protein dynamics datasets reside in a nonlinear subspace of low conformational energy. Ideal data analysis tools should therefore account for such nonlinear geometry. The Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of geometries that can be modeled after an energy landscape. Second, many standard data analysis tools developed for data in Euclidean space can be generalized to Riemannian manifolds. In the context of protein dynamics, a conceptual challenge comes from the lack of guidelines for constructing a smooth Riemannian structure based on an energy landscape. In addition, computational feasibility in computing geodesics and related mappings poses a major challenge. This work considers these challenges. The first part of the paper develops a local approximation technique for computing geodesics and related mappings on Riemannian manifolds in a computationally feasible manner. The second part constructs a smooth manifold and a Riemannian structure that is based on an energy landscape for protein conformations. The resulting Riemannian geometry is tested on several data analysis tasks relevant for protein dynamics data. In particular, the geodesics with given start- and end-points approximately recover corresponding molecular dynamics trajectories for proteins that undergo relatively ordered transitions with medium-sized deformations. The Riemannian protein geometry also gives physically realistic summary statistics and retrieves the underlying dimension even for large-sized deformations within seconds on a laptop.

Assuntos

Conformação Proteica , Proteínas , Proteínas/química , Algoritmos , Simulação de Dinâmica Molecular

4.

Hilbert-curve assisted structure embedding method.

Zahoránszky-Kohalmi, Gergely; Wan, Kanny K; Godfrey, Alexander G.

J Cheminform ; 16(1): 87, 2024 Jul 29.

Artigo em Inglês | MEDLINE | ID: mdl-39075547

RESUMO

MOTIVATION: Chemical space embedding methods are widely utilized in various research settings for dimensional reduction, clustering and effective visualization. The maps generated by the embedding process can provide valuable insight to medicinal chemists in terms of the relationships between structural, physicochemical and biological properties of compounds. However, these maps are known to be difficult to interpret, and the ''landscape'' on the map is prone to ''rearrangement'' when embedding different sets of compounds. RESULTS: In this study we present the Hilbert-Curve Assisted Space Embedding (HCASE) method which was designed to create maps by organizing structures according to a logic familiar to medicinal chemists. First, a chemical space is created with the help of a set of ''reference scaffolds''. These scaffolds are sorted according to the medicinal chemistry inspired Scaffold-Key algorithm found in prior art. Next, the ordered scaffolds are mapped to a line which is folded into a higher dimensional (here: 2D) space. The intricately folded line is referred to as a pseudo-Hilbert-Curve. The embedding of a compound happens by locating its most similar reference scaffold in the pseudo-Hilbert-Curve and assuming the respective position. Through a series of experiments, we demonstrate the properties of the maps generated by the HCASE method. Subjects of embeddings were compounds of the DrugBank and CANVASS libraries, and the chemical spaces were defined by scaffolds extracted from the ChEMBL database. SCIENTIFIC CONTRIBUTION: The novelty of HCASE method lies in generating robust and intuitive chemical space embeddings that are reflective of a medicinal chemist's reasoning, and the precedential use of space filling (Hilbert) curve in the process. AVAILABILITY: https://github.com/ncats/hcase.

5.

A novel deep machine learning algorithm with dimensionality and size reduction approaches for feature elimination: thyroid cancer diagnoses with randomly missing data.

Tutsoy, Onder; Sumbul, Hilmi Erdem.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-39007597

RESUMO

Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.

Assuntos

Algoritmos , Aprendizado Profundo , Neoplasias da Glândula Tireoide , Neoplasias da Glândula Tireoide/diagnóstico , Humanos , Aprendizado de Máquina , Análise por Conglomerados

6.

Bayesian estimation of covariate assisted principal regression for brain functional connectivity.

Park, Hyung G.

Biostatistics ; 2024 Jul 09.

Artigo em Inglês | MEDLINE | ID: mdl-38981041

RESUMO

This paper presents a Bayesian reformulation of covariate-assisted principal regression for covariance matrix outcomes to identify low-dimensional components in the covariance associated with covariates. By introducing a geometric approach to the covariance matrices and leveraging Euclidean geometry, we estimate dimension reduction parameters and model covariance heterogeneity based on covariates. This method enables joint estimation and uncertainty quantification of relevant model parameters associated with heteroscedasticity. We demonstrate our approach through simulation studies and apply it to analyze associations between covariates and brain functional connectivity using data from the Human Connectome Project.

7.

Similarity measure method of near-infrared spectrum combined with multi-attribute information.

Zhang, Jinfeng; Qin, Yuhua; Tian, Rongkun; Bai, Xiaoli; Liu, Jing.

Spectrochim Acta A Mol Biomol Spectrosc ; 322: 124783, 2024 Jul 04.

Artigo em Inglês | MEDLINE | ID: mdl-38972098

RESUMO

Due to the high-dimensionality, redundancy, and non-linearity of the near-infrared (NIR) spectra data, as well as the influence of attributes such as producing area and grade of the sample, which can all affect the similarity measure between samples. This paper proposed a t-distributed stochastic neighbor embedding algorithm based on Sinkhorn distance (St-SNE) combined with multi-attribute data information. Firstly, the Sinkhorn distance was introduced which can solve problems such as KL divergence asymmetry and sparse data distribution in high-dimensional space, thereby constructing probability distributions that make low-dimensional space similar to high-dimensional space. In addition, to address the impact of multi-attribute features of samples on similarity measure, a multi-attribute distance matrix was constructed using information entropy, and then combined with the numerical matrix of spectral data to obtain a mixed data matrix. In order to validate the effectiveness of the St-SNE algorithm, dimensionality reduction projection was performed on NIR spectral data and compared with PCA, LPP, and t-SNE algorithms. The results demonstrated that the St-SNE algorithm effectively distinguishes samples with different attribute information, and produced more distinct projection boundaries of sample category in low-dimensional space. Then we tested the classification performance of St-SNE for different attributes by using the tobacco and mango datasets, and compared it with LPP, t-SNE, UMAP, and Fisher t-SNE algorithms. The results showed that St-SNE algorithm had the highest classification accuracy for different attributes. Finally, we compared the results of searching the most similar sample with the target tobacco for cigarette formulas, and experiments showed that the St-SNE had the highest consistency with the recommendation of the experts than that of the other algorithms. It can provide strong support for the maintenance and design of the product formula.

8.

Multi-layer Bundling as a New Approach for Determining Multi-scale Correlations Within a High-Dimensional Dataset.

Fazli, Mehran; Bertram, Richard; Striegel, Deborah A.

Bull Math Biol ; 86(9): 105, 2024 Jul 12.

Artigo em Inglês | MEDLINE | ID: mdl-38995438

RESUMO

The growing complexity of biological data has spurred the development of innovative computational techniques to extract meaningful information and uncover hidden patterns within vast datasets. Biological networks, such as gene regulatory networks and protein-protein interaction networks, hold critical insights into biological features' connections and functions. Integrating and analyzing high-dimensional data, particularly in gene expression studies, stands prominent among the challenges in deciphering these networks. Clustering methods play a crucial role in addressing these challenges, with spectral clustering emerging as a potent unsupervised technique considering intrinsic geometric structures. However, spectral clustering's user-defined cluster number can lead to inconsistent and sometimes orthogonal clustering regimes. We propose the Multi-layer Bundling (MLB) method to address this limitation, combining multiple prominent clustering regimes to offer a comprehensive data view. We call the outcome clusters "bundles". This approach refines clustering outcomes, unravels hierarchical organization, and identifies bridge elements mediating communication between network components. By layering clustering results, MLB provides a global-to-local view of biological feature clusters enabling insights into intricate biological systems. Furthermore, the method enhances bundle network predictions by integrating the bundle co-cluster matrix with the affinity matrix. The versatility of MLB extends beyond biological networks, making it applicable to various domains where understanding complex relationships and patterns is needed.

Assuntos

Algoritmos , Biologia Computacional , Redes Reguladoras de Genes , Conceitos Matemáticos , Mapas de Interação de Proteínas , Análise por Conglomerados , Humanos , Modelos Biológicos , Perfilação da Expressão Gênica/estatística & dados numéricos , Perfilação da Expressão Gênica/métodos

9.

nipalsMCIA: Flexible Multi-Block Dimensionality Reduction in R via Non-linear Iterative Partial Least Squares.

Mattessich, Max; Reyna, Joaquin; Aron, Edel; Ay, Ferhat; Kilmer, Misha; Kleinstein, Steven H; Konstorum, Anna.

bioRxiv ; 2024 Jun 10.

Artigo em Inglês | MEDLINE | ID: mdl-38915554

RESUMO

Motivation: With the increased reliance on multi-omics data for bulk and single cell analyses, the availability of robust approaches to perform unsupervised analysis for clustering, visualization, and feature selection is imperative. Joint dimensionality reduction methods can be applied to multi-omics datasets to derive a global sample embedding analogous to single-omic techniques such as Principal Components Analysis (PCA). Multiple co-inertia analysis (MCIA) is a method for joint dimensionality reduction that maximizes the covariance between block- and global-level embeddings. Current implementations for MCIA are not optimized for large datasets such such as those arising from single cell studies, and lack capabilities with respect to embedding new data. Results: We introduce nipalsMCIA, an MCIA implementation that solves the objective function using an extension to Non-linear Iterative Partial Least Squares (NIPALS), and shows significant speed-up over earlier implementations that rely on eigendecompositions for single cell multi-omics data. It also removes the dependence on an eigendecomposition for calculating the variance explained, and allows users to perform out-of-sample embedding for new data. nipalsMCIA provides users with a variety of pre-processing and parameter options, as well as ease of functionality for down-stream analysis of single-omic and global-embedding factors. Availability: nipalsMCIA is available as a BioConductor package at https://bioconductor.org/packages/release/bioc/html/nipalsMCIA.html, and includes detailed documentation and application vignettes. Supplementary Materials are available online.

10.

An encoding generative modeling approach to dimension reduction and covariate adjustment in causal inference with observational studies.

Liu, Qiao; Chen, Zhongren; Wong, Wing Hung.

Proc Natl Acad Sci U S A ; 121(23): e2322376121, 2024 Jun 04.

Artigo em Inglês | MEDLINE | ID: mdl-38809705

RESUMO

In this article, we develop CausalEGM, a deep learning framework for nonlinear dimension reduction and generative modeling of the dependency among covariate features affecting treatment and response. CausalEGM can be used for estimating causal effects in both binary and continuous treatment settings. By learning a bidirectional transformation between the high-dimensional covariate space and a low-dimensional latent space and then modeling the dependencies of different subsets of the latent variables on the treatment and response, CausalEGM can extract the latent covariate features that affect both treatment and response. By conditioning on these features, one can mitigate the confounding effect of the high dimensional covariate on the estimation of the causal relation between treatment and response. In a series of experiments, the proposed method is shown to achieve superior performance over existing methods in both binary and continuous treatment settings. The improvement is substantial when the sample size is large and the covariate is of high dimension. Finally, we established excess risk bounds and consistency results for our method, and discuss how our approach is related to and improves upon other dimension reduction approaches in causal inference.

11.

A comprehensive benchmarking of machine learning algorithms and dimensionality reduction methods for drug sensitivity prediction.

Eckhart, Lea; Lenhof, Kerstin; Rolli, Lisa-Marie; Lenhof, Hans-Peter.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38797968

RESUMO

A major challenge of precision oncology is the identification and prioritization of suitable treatment options based on molecular biomarkers of the considered tumor. In pursuit of this goal, large cancer cell line panels have successfully been studied to elucidate the relationship between cellular features and treatment response. Due to the high dimensionality of these datasets, machine learning (ML) is commonly used for their analysis. However, choosing a suitable algorithm and set of input features can be challenging. We performed a comprehensive benchmarking of ML methods and dimension reduction (DR) techniques for predicting drug response metrics. Using the Genomics of Drug Sensitivity in Cancer cell line panel, we trained random forests, neural networks, boosting trees and elastic nets for 179 anti-cancer compounds with feature sets derived from nine DR approaches. We compare the results regarding statistical performance, runtime and interpretability. Additionally, we provide strategies for assessing model performance compared with a simple baseline model and measuring the trade-off between models of different complexity. Lastly, we show that complex ML models benefit from using an optimized DR strategy, and that standard models-even when using considerably fewer features-can still be superior in performance.

Assuntos

Algoritmos , Antineoplásicos , Benchmarking , Aprendizado de Máquina , Humanos , Antineoplásicos/farmacologia , Antineoplásicos/uso terapêutico , Neoplasias/tratamento farmacológico , Neoplasias/genética , Redes Neurais de Computação , Linhagem Celular Tumoral

12.

Knowledge-guided learning methods for integrative analysis of multi-omics data.

Li, Wenrui; Ballard, Jenna; Zhao, Yize; Long, Qi.

Comput Struct Biotechnol J ; 23: 1945-1950, 2024 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-38736693

RESUMO

Integrative analysis of multi-omics data has the potential to yield valuable and comprehensive insights into the molecular mechanisms underlying complex diseases such as cancer and Alzheimer's disease. However, a number of analytical challenges complicate multi-omics data integration. For instance, -omics data are usually high-dimensional, and sample sizes in multi-omics studies tend to be modest. Furthermore, when genes in an important pathway have relatively weak signal, it can be difficult to detect them individually. There is a growing body of literature on knowledge-guided learning methods that can address these challenges by incorporating biological knowledge such as functional genomics and functional proteomics into multi-omics data analysis. These methods have been shown to outperform their counterparts that do not utilize biological knowledge in tasks including prediction, feature selection, clustering, and dimension reduction. In this review, we survey recently developed methods and applications of knowledge-guided multi-omics data integration methods and discuss future research directions.

13.

Lost in space: what single-cell RNA sequencing cannot tell you.

Adema, Kelvin; Schon, Michael A; Nodine, Michael D; Kohlen, Wouter.

Trends Plant Sci ; 2024 Apr 02.

Artigo em Inglês | MEDLINE | ID: mdl-38570278

RESUMO

Plant scientists are rapidly integrating single-cell RNA sequencing (scRNA-seq) into their workflows. Maximizing the potential of scRNA-seq requires a proper understanding of the spatiotemporal context of cells. However, positional information is inherently lost during scRNA-seq, limiting its potential to characterize complex biological systems. In this review we highlight how current single-cell analysis pipelines cannot completely recover spatial information, which confounds biological interpretation. Various strategies exist to identify the location of RNA, from classical RNA in situ hybridization to spatial transcriptomics. Herein we discuss the possibility of utilizing this spatial information to supervise single-cell analyses. An integrative approach will maximize the potential of each technology, and lead to insights which go beyond the capability of each individual technology.

14.

Optimizing Driving Parameters of the Jumbo Drill Efficiently with XGBoost-DRWIACO Framework: Applied to Increase the Feed Speed.

Guo, Hao; Lin, Lin; Wu, Jinlei; Lv, Yancheng; Tong, Changsheng.

Sensors (Basel) ; 24(8)2024 Apr 18.

Artigo em Inglês | MEDLINE | ID: mdl-38676217

RESUMO

The jumbo drill is a commonly used driving equipment in tunnel engineering. One of the key decision-making issues for reducing tunnel construction costs is to optimize the main driving parameters to increase the feed speed of the jumbo drill. The optimization of the driving parameters is supposed to meet the requirements of high reliability and efficiency due to the high risk and complex working conditions in tunnel engineering. The flaws of the existing optimization algorithms for driving parameter optimization lie in the low accuracy of the evaluation functions under complex working conditions and the low efficiency of the algorithms. To address the above problems, a driving parameter optimization method based on the XGBoost-DRWIACO framework with high accuracy and efficiency is proposed. A data-driven prediction model for feed speed based on XGBoost is established as the evaluation function, which has high accuracy under complex working conditions and ensures the high reliability of the optimized results. Meanwhile, an improved ant colony algorithm based on dimension reduction while iterating strategy (DRWIACO) is proposed. DRWIACO is supposed to improve efficiency by resolving inefficient iterations of the ant colony algorithm (ACO), which is manifested as falling into local optimum, converging slowly and converging with a slight fluctuation in a certain dimension. Experimental results show that the error by the proposed framework is less than 10%, and the efficiency is increased by over 30% compared with the comparison methods, which meets the requirements of high reliability and efficiency for tunnel construction. More importantly, the construction cost is reduced by 19% compared with the actual feed speed, which improves the economic benefits.

15.

Analysis and Visualization of Single-Cell Sequencing Data with Scanpy and MetaCell: A Tutorial.

Li, Yanjun; Sun, Chaoyue; Romanova, Daria Y; Wu, Dapeng O; Fang, Ruogu; Moroz, Leonid L.

Methods Mol Biol ; 2757: 383-445, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38668977

RESUMO

The emergence and development of single-cell RNA sequencing (scRNA-seq) techniques enable researchers to perform large-scale analysis of the transcriptomic profiling at cell-specific resolution. Unsupervised clustering of scRNA-seq data is central for most studies, which is essential to identify novel cell types and their gene expression logics. Although an increasing number of algorithms and tools are available for scRNA-seq analysis, a practical guide for users to navigate the landscape remains underrepresented. This chapter presents an overview of the scRNA-seq data analysis pipeline, quality control, batch effect correction, data standardization, cell clustering and visualization, cluster correlation analysis, and marker gene identification. Taking the two broadly used analysis packages, i.e., Scanpy and MetaCell, as examples, we provide a hands-on guideline and comparison regarding the best practices for the above essential analysis steps and data visualization. Additionally, we compare both packages and algorithms using a scRNA-seq dataset of the ctenophore Mnemiopsis leidyi, which is representative of one of the earliest animal lineages, critical to understanding the origin and evolution of animal novelties. This pipeline can also be helpful for analyses of other taxa, especially prebilaterian animals, where these tools are under development (e.g., placozoan and Porifera).

Assuntos

Algoritmos , Perfilação da Expressão Gênica , Análise de Célula Única , Software , Análise de Célula Única/métodos , Animais , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Biologia Computacional/métodos , Análise por Conglomerados , Transcriptoma/genética

16.

An enriched approach to combining high-dimensional genomic and low-dimensional phenotypic data.

Cabrera, Javier; Emir, Birol; Cheng, Ge; Duan, Yajie; Alemayehu, Demissie; Cherkas, Yauheniya.

J Biopharm Stat ; : 1-7, 2024 Apr 05.

Artigo em Inglês | MEDLINE | ID: mdl-38578223

RESUMO

We describe an approach for combining and analyzing high-dimensional genomic and low-dimensional phenotypic data. The approach leverages a scheme of weights applied to the variables instead of observations and, hence, permits incorporation of the information provided by the low dimensional data source. It can also be incorporated into commonly used downstream techniques, such as random forest or penalized regression. Finally, the simulated lupus studies involving genetic and clinical data are used to illustrate the overall idea and show that the proposed enriched penalized method can select significant genetic variables while keeping several important clinical variables in the final model.

17.

Multiple phenotype association tests based on sliced inverse regression.

Sun, Wenyuan; Jon, Kyongson; Zhu, Wensheng.

BMC Bioinformatics ; 25(1): 144, 2024 Apr 04.

Artigo em Inglês | MEDLINE | ID: mdl-38575890

RESUMO

BACKGROUND: Joint analysis of multiple phenotypes in studies of biological systems such as Genome-Wide Association Studies is critical to revealing the functional interactions between various traits and genetic variants, but growth of data in dimensionality has become a very challenging problem in the widespread use of joint analysis. To handle the excessiveness of variables, we consider the sliced inverse regression (SIR) method. Specifically, we propose a novel SIR-based association test that is robust and powerful in testing the association between multiple predictors and multiple outcomes. RESULTS: We conduct simulation studies in both low- and high-dimensional settings with various numbers of Single-Nucleotide Polymorphisms and consider the correlation structure of traits. Simulation results show that the proposed method outperforms the existing methods. We also successfully apply our method to the genetic association study of ADNI dataset. Both the simulation studies and real data analysis show that the SIR-based association test is valid and achieves a higher efficiency compared with its competitors. CONCLUSION: Several scenarios with low- and high-dimensional responses and genotypes are considered in this paper. Our SIR-based method controls the estimated type I error at the pre-specified level α .

Assuntos

Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Genótipo , Simulação por Computador , Estudos de Associação Genética , Modelos Genéticos

18.

EpiDiP/NanoDiP: a versatile unsupervised machine learning edge computing platform for epigenomic tumour diagnostics.

Hench, Jürgen; Hultschig, Claus; Brugger, Jon; Mariani, Luigi; Guzman, Raphael; Soleman, Jehuda; Leu, Severina; Benton, Miles; Stec, Irenäus Maria; Hench, Ivana Bratic; Hoffmann, Per; Harter, Patrick; Weber, Katharina J; Albers, Anne; Thomas, Christian; Hasselblatt, Martin; Schüller, Ulrich; Restelli, Lisa; Capper, David; Hewer, Ekkehard; Diebold, Joachim; Kolenc, Danijela; Schneider, Ulf C; Rushing, Elisabeth; Della Monica, Rosa; Chiariotti, Lorenzo; Sill, Martin; Schrimpf, Daniel; von Deimling, Andreas; Sahm, Felix; Kölsche, Christian; Tolnay, Markus; Frank, Stephan.

Acta Neuropathol Commun ; 12(1): 51, 2024 Apr 04.

Artigo em Inglês | MEDLINE | ID: mdl-38576030

RESUMO

DNA methylation analysis based on supervised machine learning algorithms with static reference data, allowing diagnostic tumour typing with unprecedented precision, has quickly become a new standard of care. Whereas genome-wide diagnostic methylation profiling is mostly performed on microarrays, an increasing number of institutions additionally employ nanopore sequencing as a faster alternative. In addition, methylation-specific parallel sequencing can generate methylation and genomic copy number data. Given these diverse approaches to methylation profiling, to date, there is no single tool that allows (1) classification and interpretation of microarray, nanopore and parallel sequencing data, (2) direct control of nanopore sequencers, and (3) the integration of microarray-based methylation reference data. Furthermore, no software capable of entirely running in routine diagnostic laboratory environments lacking high-performance computing and network infrastructure exists. To overcome these shortcomings, we present EpiDiP/NanoDiP as an open-source DNA methylation and copy number profiling suite, which has been benchmarked against an established supervised machine learning approach using in-house routine diagnostics data obtained between 2019 and 2021. Running locally on portable, cost- and energy-saving system-on-chip as well as gpGPU-augmented edge computing devices, NanoDiP works in offline mode, ensuring data privacy. It does not require the rigid training data annotation of supervised approaches. Furthermore, NanoDiP is the core of our public, free-of-charge EpiDiP web service which enables comparative methylation data analysis against an extensive reference data collection. We envision this versatile platform as a useful resource not only for neuropathologists and surgical pathologists but also for the tumour epigenetics research community. In daily diagnostic routine, analysis of native, unfixed biopsies by NanoDiP delivers molecular tumour classification in an intraoperative time frame.

Assuntos

Epigenômica , Neoplasias , Humanos , Aprendizado de Máquina não Supervisionado , Computação em Nuvem , Neoplasias/diagnóstico , Neoplasias/genética , Metilação de DNA

19.

Extended Poisson Gaussian-Process Latent Variable Model for Unsupervised Neural Decoding.

Luo, Della Daiyi; Giri, Bapun; Diba, Kamran; Kemere, Caleb.

bioRxiv ; 2024 Mar 07.

Artigo em Inglês | MEDLINE | ID: mdl-38496669

RESUMO

Dimension reduction on neural activity paves a way for unsupervised neural decoding by dissociating the measurement of internal neural state repetition from the measurement of external variable tuning. With assumptions only on the smoothness of latent dynamics and of internal tuning curves, the Poisson Gaussian-process latent variable model (P-GPLVM) (Wu et al., 2017) is a powerful tool to discover the low-dimensional latent structure for high-dimensional spike trains. However, when given novel neural data, the original model lacks a method to infer their latent trajectories in the learned latent space, limiting its ability for estimating the internal state repetition. Here, we extend the P-GPLVM to enable the latent variable inference of new data constrained by previously learned smoothness and mapping information. We also describe a principled approach for the constrained latent variable inference for temporally-compressed patterns of activity, such as those found in population burst events (PBEs) during hippocampal sharp-wave ripples, as well as metrics for assessing whether the inferred new latent variables are congruent with a previously learned manifold in the latent space. Applying these approaches to hippocampal ensemble recordings during active maze exploration, we replicate the result that P-GPLVM learns a latent space encoding the animal's position. We further demonstrate that this latent space can differentiate one maze context from another. By inferring the latent variables of new neural data during running, certain internal neural states are observed to repeat, which is in accordance with the similarity of experiences encoded by its nearby neural trajectories in the training data manifold. Finally, repetition of internal neural states can be estimated for neural activity during PBEs as well, allowing the identification for replay events of versatile behaviors and more general experiences. Thus, our extension of the P-GPLVM framework for unsupervised analysis of neural activity can be used to answer critical questions related to scientific discovery.

20.

A Fast and Efficient Approach to Strength Prediction for Carbon/Epoxy Composites with Resin-Missing Defects.

Li, Hongfeng; Li, Feng; Zhu, Lingxue.

Polymers (Basel) ; 16(6)2024 Mar 08.

Artigo em Inglês | MEDLINE | ID: mdl-38543347

RESUMO

A novel method is proposed to quickly predict the tensile strength of carbon/epoxy composites with resin-missing defects. The univariate Chebyshev prediction model (UCPM) was developed using the dimension reduction method and Chebyshev polynomials. To enhance the computational efficiency and reduce the manual modeling workload, a parameterization script for the finite element model was established using Python during the model construction process. To validate the model, specimens with different defect sizes were prepared using the vacuum assistant resin infusion (VARI) process, the mechanical properties of the specimens were tested, and the model predictions were analyzed in comparison with the experimental results. Additionally, the impact of the order (second-ninth) on the predictive accuracy of the UCPM was examined, and the performance of the model was evaluated using statistical errors. The results demonstrate that the prediction model has a high prediction accuracy, with a maximum prediction error of 5.20% compared to the experimental results. A low order resulted in underfitting, while increasing the order can improve the prediction accuracy of the UCPM. However, if the order is too high, overfitting may occur, leading to a decrease in the prediction accuracy.

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA