Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 8.642
Filtrar
Más filtros

Colección SES
Publication year range
1.
Nat Methods ; 21(8): 1501-1513, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38783067

RESUMEN

Spatially resolved transcriptomics (SRT) technologies have significantly advanced biomedical research, but their data analysis remains challenging due to the discrete nature of the data and the high levels of noise, compounded by complex spatial dependencies. Here, we propose spaVAE, a dependency-aware, deep generative spatial variational autoencoder model that probabilistically characterizes count data while capturing spatial correlations. spaVAE introduces a hybrid embedding combining a Gaussian process prior with a Gaussian prior to explicitly capture spatial correlations among spots. It then optimizes the parameters of deep neural networks to approximate the distributions underlying the SRT data. With the approximated distributions, spaVAE can contribute to several analytical tasks that are essential for SRT data analysis, including dimensionality reduction, visualization, clustering, batch integration, denoising, differential expression, spatial interpolation, resolution enhancement and identification of spatially variable genes. Moreover, we have extended spaVAE to spaPeakVAE and spaMultiVAE to characterize spatial ATAC-seq (assay for transposase-accessible chromatin using sequencing) data and spatial multi-omics data, respectively.


Asunto(s)
Algoritmos , Humanos , Redes Neurales de la Computación , Aprendizaje Profundo , Perfilación de la Expresión Génica/métodos , Secuenciación de Inmunoprecipitación de Cromatina/métodos , Transcriptoma , Distribución Normal , Análisis por Conglomerados , Biología Computacional/métodos
2.
Nat Methods ; 20(9): 1379-1387, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-37592182

RESUMEN

Spatially resolved genomic technologies have allowed us to study the physical organization of cells and tissues, and promise an understanding of local interactions between cells. However, it remains difficult to precisely align spatial observations across slices, samples, scales, individuals and technologies. Here, we propose a probabilistic model that aligns spatially-resolved samples onto a known or unknown common coordinate system (CCS) with respect to phenotypic readouts (for example, gene expression). Our method, Gaussian Process Spatial Alignment (GPSA), consists of a two-layer Gaussian process: the first layer maps observed samples' spatial locations onto a CCS, and the second layer maps from the CCS to the observed readouts. Our approach enables complex downstream spatially aware analyses that are impossible or inaccurate with unaligned data, including an analysis of variance, creation of a dense three-dimensional (3D) atlas from sparse two-dimensional (2D) slices or association tests across data modalities.


Asunto(s)
Genómica , Modelos Estadísticos , Humanos , Distribución Normal
3.
Nat Methods ; 20(10): 1581-1592, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37723246

RESUMEN

Here we report SUPPORT (statistically unbiased prediction utilizing spatiotemporal information in imaging data), a self-supervised learning method for removing Poisson-Gaussian noise in voltage imaging data. SUPPORT is based on the insight that a pixel value in voltage imaging data is highly dependent on its spatiotemporal neighboring pixels, even when its temporally adjacent frames alone do not provide useful information for statistical prediction. Such dependency is captured and used by a convolutional neural network with a spatiotemporal blind spot to accurately denoise voltage imaging data in which the existence of the action potential in a time frame cannot be inferred by the information in other frames. Through simulations and experiments, we show that SUPPORT enables precise denoising of voltage imaging data and other types of microscopy image while preserving the underlying dynamics within the scene.


Asunto(s)
Microscopía , Redes Neurales de la Computación , Relación Señal-Ruido , Distribución Normal , Procesamiento de Imagen Asistido por Computador/métodos
4.
Brief Bioinform ; 25(6)2024 Sep 23.
Artículo en Inglés | MEDLINE | ID: mdl-39428128

RESUMEN

We introduce a groundbreaking approach: the minimum free energy-based Gaussian Self-Benchmarking (MFE-GSB) framework, designed to combat the myriad of biases inherent in RNA-seq data. Central to our methodology is the MFE concept, facilitating the adoption of a Gaussian distribution model tailored to effectively mitigate all co-existing biases within a k-mer counting scheme. The MFE-GSB framework operates on a sophisticated dual-model system, juxtaposing modeling data of uniform k-mer distribution against the real, observed sequencing data characterized by nonuniform k-mer distributions. The framework applies a Gaussian function, guided by the predetermined parameters-mean and SD-derived from modeling data, to fit unknown sequencing data. This dual comparison allows for the accurate prediction of k-mer abundances across MFE categories, enabling simultaneous correction of biases at the single k-mer level. Through validation with both engineered RNA constructs and human tissue RNA samples, its wide-ranging efficacy and applicability are demonstrated.


Asunto(s)
RNA-Seq , Humanos , RNA-Seq/métodos , Benchmarking , Análisis de Secuencia de ARN/métodos , ARN/química , ARN/genética , Algoritmos , Distribución Normal , Biología Computacional/métodos , Sesgo
5.
Proc Natl Acad Sci U S A ; 120(35): e1813976120, 2023 08 29.
Artículo en Inglés | MEDLINE | ID: mdl-37624752

RESUMEN

We investigated whether celebrated cases of evolutionary radiations of passerine birds on islands have produced exceptional morphological diversity relative to comparable-aged radiations globally. Based on eight external measurements, we calculated the disparity in size and shape within clades, each of which was classified as being tropical or temperate and as having diversified in a continental or an island/archipelagic setting. We found that the distribution of disparity among all clades does not differ substantively from a normal distribution, which would be consistent with a common underlying process of morphological diversification that is largely independent of latitude and occurrence on islands. Disparity is slightly greater in island clades than in those from continents or clades consisting of island and noninsular taxa, revealing a small, but significant, effect of island occurrence on evolutionary divergence. Nonetheless, the number of highly disparate clades overall is no greater than expected from a normal distribution, calling into question the need to invoke key innovations, ecological opportunity, or other factors as stimuli for adaptive radiations in passerine birds.


Asunto(s)
Evolución Biológica , Passeriformes , Animales , Distribución Normal , Passeriformes/genética
6.
Biostatistics ; 25(4): 962-977, 2024 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-38669589

RESUMEN

There is an increasing interest in the use of joint models for the analysis of longitudinal and survival data. While random effects models have been extensively studied, these models can be hard to implement and the fixed effect regression parameters must be interpreted conditional on the random effects. Copulas provide a useful alternative framework for joint modeling. One advantage of using copulas is that practitioners can directly specify marginal models for the outcomes of interest. We develop a joint model using a Gaussian copula to characterize the association between multivariate longitudinal and survival outcomes. Rather than using an unstructured correlation matrix in the copula model to characterize dependence structure as is common, we propose a novel decomposition that allows practitioners to impose structure (e.g., auto-regressive) which provides efficiency gains in small to moderate sample sizes and reduces computational complexity. We develop a Markov chain Monte Carlo model fitting procedure for estimation. We illustrate the method's value using a simulation study and present a real data analysis of longitudinal quality of life and disease-free survival data from an International Breast Cancer Study Group trial.


Asunto(s)
Teorema de Bayes , Modelos Estadísticos , Humanos , Estudios Longitudinales , Análisis de Supervivencia , Cadenas de Markov , Neoplasias de la Mama/mortalidad , Método de Montecarlo , Distribución Normal , Femenino , Interpretación Estadística de Datos , Bioestadística/métodos
7.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36592058

RESUMEN

The progress of single-cell RNA sequencing (scRNA-seq) has led to a large number of scRNA-seq data, which are widely used in biomedical research. The noise in the raw data and tens of thousands of genes pose a challenge to capture the real structure and effective information of scRNA-seq data. Most of the existing single-cell analysis methods assume that the low-dimensional embedding of the raw data belongs to a Gaussian distribution or a low-dimensional nonlinear space without any prior information, which limits the flexibility and controllability of the model to a great extent. In addition, many existing methods need high computational cost, which makes them difficult to be used to deal with large-scale datasets. Here, we design and develop a depth generation model named Gaussian mixture adversarial autoencoders (scGMAAE), assuming that the low-dimensional embedding of different types of cells follows different Gaussian distributions, integrating Bayesian variational inference and adversarial training, as to give the interpretable latent representation of complex data and discover the statistical distribution of different types of cells. The scGMAAE is provided with good controllability, interpretability and scalability. Therefore, it can process large-scale datasets in a short time and give competitive results. scGMAAE outperforms existing methods in several ways, including dimensionality reduction visualization, cell clustering, differential expression analysis and batch effect removal. Importantly, compared with most deep learning methods, scGMAAE requires less iterations to generate the best results.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de Expresión Génica de una Sola Célula , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Distribución Normal , Teorema de Bayes , Análisis de la Célula Individual/métodos , Análisis por Conglomerados
8.
PLoS Comput Biol ; 20(9): e1011632, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39331673

RESUMEN

Thermal proteome profiling (TPP) is a proteome wide technology that enables unbiased detection of protein drug interactions as well as changes in post-translational state of proteins between different biological conditions. Statistical analysis of temperature range TPP (TPP-TR) datasets relies on comparing protein melting curves, describing the amount of non-denatured proteins as a function of temperature, between different conditions (e.g. presence or absence of a drug). However, state-of-the-art models are restricted to sigmoidal melting behaviours while unconventional melting curves, representing up to 50% of TPP-TR datasets, have recently been shown to carry important biological information. We present a novel statistical framework, based on hierarchical Gaussian process models and named GPMelt, to make TPP-TR datasets analysis unbiased with respect to the melting profiles of proteins. GPMelt scales to multiple conditions, and extension of the model to deeper hierarchies (i.e. with additional sub-levels) allows to deal with complex TPP-TR protocols. Collectively, our statistical framework extends the analysis of TPP-TR datasets for both protein and peptide level melting curves, offering access to thousands of previously excluded melting curves and thus substantially increasing the coverage and the ability of TPP to uncover new biology.


Asunto(s)
Proteoma , Proteoma/metabolismo , Distribución Normal , Biología Computacional/métodos , Proteómica/métodos , Modelos Estadísticos , Algoritmos
9.
PLoS Comput Biol ; 20(9): e1012448, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39259748

RESUMEN

Large-scale studies of gene expression are commonly influenced by biological and technical sources of expression variation, including batch effects, sample characteristics, and environmental impacts. Learning the causal relationships between observable variables may be challenging in the presence of unobserved confounders. Furthermore, many high-dimensional regression techniques may perform worse. In fact, controlling for unobserved confounding variables is essential, and many deconfounding methods have been suggested for application in a variety of situations. The main contribution of this article is the development of a two-stage deconfounding procedure based on Bow-free Acyclic Paths (BAP) search developed into the framework of Structural Equation Models (SEM), called SEMbap(). In the first stage, an exhaustive search of missing edges with significant covariance is performed via Shipley d-separation tests; then, in the second stage, a Constrained Gaussian Graphical Model (CGGM) is fitted or a low dimensional representation of bow-free edges structure is obtained via Graph Laplacian Principal Component Analysis (gLPCA). We compare four popular deconfounding methods to BAP search approach with applications on simulated and observed expression data. In the former, different structures of the hidden covariance matrix have been replicated. Compared to existing methods, BAP search algorithm is able to correctly identify hidden confounding whilst controlling false positive rate and achieving good fitting and perturbation metrics.


Asunto(s)
Algoritmos , Biología Computacional , Biología Computacional/métodos , Humanos , Análisis de Componente Principal , Simulación por Computador , Perfilación de la Expresión Génica/métodos , Perfilación de la Expresión Génica/estadística & datos numéricos , Modelos Estadísticos , Correlación de Datos , Distribución Normal
10.
Proc Natl Acad Sci U S A ; 119(32): e2204453119, 2022 08 09.
Artículo en Inglés | MEDLINE | ID: mdl-35914159

RESUMEN

Changes in the geometry and topology of self-assembled membranes underlie diverse processes across cellular biology and engineering. Similar to lipid bilayers, monolayer colloidal membranes have in-plane fluid-like dynamics and out-of-plane bending elasticity. Their open edges and micrometer-length scale provide a tractable system to study the equilibrium energetics and dynamic pathways of membrane assembly and reconfiguration. Here, we find that doping colloidal membranes with short miscible rods transforms disk-shaped membranes into saddle-shaped surfaces with complex edge structures. The saddle-shaped membranes are well approximated by Enneper's minimal surfaces. Theoretical modeling demonstrates that their formation is driven by increasing the positive Gaussian modulus, which in turn, is controlled by the fraction of short rods. Further coalescence of saddle-shaped surfaces leads to diverse topologically distinct structures, including shapes similar to catenoids, trinoids, four-noids, and higher-order structures. At long timescales, we observe the formation of a system-spanning, sponge-like phase. The unique features of colloidal membranes reveal the topological transformations that accompany coalescence pathways in real time. We enhance the functionality of these membranes by making their shape responsive to external stimuli. Our results demonstrate a pathway toward control of thin elastic sheets' shape and topology-a pathway driven by the emergent elasticity induced by compositional heterogeneity.


Asunto(s)
Membrana Dobles de Lípidos , Elasticidad , Membrana Dobles de Lípidos/química , Membranas/metabolismo , Distribución Normal
11.
PLoS Genet ; 18(4): e1010151, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35442943

RESUMEN

With the advent of high throughput genetic data, there have been attempts to estimate heritability from genome-wide SNP data on a cohort of distantly related individuals using linear mixed model (LMM). Fitting such an LMM in a large scale cohort study, however, is tremendously challenging due to its high dimensional linear algebraic operations. In this paper, we propose a new method named PredLMM approximating the aforementioned LMM motivated by the concepts of genetic coalescence and Gaussian predictive process. PredLMM has substantially better computational complexity than most of the existing LMM based methods and thus, provides a fast alternative for estimating heritability in large scale cohort studies. Theoretically, we show that under a model of genetic coalescence, the limiting form of our approximation is the celebrated predictive process approximation of large Gaussian process likelihoods that has well-established accuracy standards. We illustrate our approach with extensive simulation studies and use it to estimate the heritability of multiple quantitative traits from the UK Biobank cohort.


Asunto(s)
Estudio de Asociación del Genoma Completo , Modelos Genéticos , Estudios de Cohortes , Estudio de Asociación del Genoma Completo/métodos , Humanos , Modelos Lineales , Distribución Normal , Fenotipo , Polimorfismo de Nucleótido Simple/genética
12.
J Proteome Res ; 23(10): 4467-4479, 2024 Oct 04.
Artículo en Inglés | MEDLINE | ID: mdl-39262370

RESUMEN

Complexome profiling is an experimental approach to identify interactions by integrating native separation of protein complexes and quantitative mass spectrometry. In a typical complexome profile, thousands of proteins are detected across typically ≤100 fractions. This relatively low resolution leads to similar abundance profiles between proteins that are not necessarily interaction partners. To address this challenge, we introduce the Gaussian Interaction Profiler (GIP), a Gaussian mixture modeling-based clustering workflow that assigns protein clusters by modeling the migration profile of each cluster. Uniquely, the GIP offers a way to prioritize actual interactors over spuriously comigrating proteins. Using previously analyzed human fibroblast complexome profiles, we show good performance of the GIP compared to other state-of-the-art tools. We further demonstrate GIP utility by applying it to complexome profiles from the transmissible lifecycle stage of malaria parasites. We unveil promising novel associations for future experimental verification, including an interaction between the vaccine target Pfs47 and the hypothetical protein PF3D7_0417000. Taken together, the GIP provides methodological advances that facilitate more accurate and automated detection of protein complexes, setting the stage for more varied and nuanced analyses in the field of complexome profiling. The complexome profiling data have been deposited to the ProteomeXchange Consortium with the dataset identifier PXD050751.


Asunto(s)
Plasmodium falciparum , Proteínas Protozoarias , Plasmodium falciparum/metabolismo , Plasmodium falciparum/química , Proteínas Protozoarias/química , Proteínas Protozoarias/metabolismo , Proteínas Protozoarias/análisis , Humanos , Proteómica/métodos , Distribución Normal , Espectrometría de Masas/métodos , Mapeo de Interacción de Proteínas/métodos , Análisis por Conglomerados , Proteoma/análisis
13.
J Cell Mol Med ; 28(19): e18590, 2024 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-39347925

RESUMEN

Long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) are two typical types of non-coding RNAs that interact and play important regulatory roles in many animal organisms. Exploring the unknown interactions between lncRNAs and miRNAs contributes to a better understanding of their functional involvement. Currently, studying the interactions between lncRNAs and miRNAs heavily relies on laborious biological experiments. Therefore, it is necessary to design a computational method for predicting lncRNA-miRNA interactions. In this work, we propose a method called MPGK-LMI, which utilizes a graph attention network (GAT) to predict lncRNA-miRNA interactions in animals. First, we construct a meta-path similarity matrix based on known lncRNA-miRNA interaction information. Then, we use GAT to aggregate the constructed meta-path similarity matrix and the computed Gaussian kernel similarity matrix to update the feature matrix with neighbourhood information. Finally, a scoring module is used for prediction. By comparing with three state-of-the-art algorithms, MPGK-LMI achieves the best results in terms of performance, with AUC value of 0.9077, AUPR of 0.9327, ACC of 0.9080, F1-score of 0.9143 and precision of 0.8739. These results validate the effectiveness and reliability of MPGK-LMI. Additionally, we conduct detailed case studies to demonstrate the effectiveness and feasibility of our approach in practical applications. Through these empirical results, we gain deeper insights into the functional roles and mechanisms of lncRNA-miRNA interactions, providing significant breakthroughs and advancements in this field of research. In summary, our method not only outperforms others in terms of performance but also establishes its practicality and reliability in biological research through real-case analysis, offering strong support and guidance for future studies and applications.


Asunto(s)
Algoritmos , Biología Computacional , MicroARNs , ARN Largo no Codificante , ARN Largo no Codificante/genética , MicroARNs/genética , Biología Computacional/métodos , Animales , Humanos , Redes Reguladoras de Genes , Distribución Normal
14.
Proteins ; 92(9): 1113-1126, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-38687146

RESUMEN

An explicit analytic solution is given for the Langevin equation applied to the Gaussian Network Model of a protein subjected to both a random and a deterministic periodic force. Synchronous and asynchronous components of time correlation functions are derived and an expression for phase differences in the time correlations of residue pairs is obtained. The synchronous component enables the determination of dynamic communities within the protein structure. The asynchronous component reveals causality, where the time correlation function between residues i and j differs depending on whether i is observed before j or vice versa, resulting in directional information flow. Driver and driven residues in the allosteric process of cyclophilin A and human NAD-dependent isocitrate dehydrogenase are determined by a perturbation-scanning technique. Factors affecting phase differences between fluctuations of residues, such as network topology, connectivity, and residue centrality, are identified. Within the constraints of the isotropic Gaussian Network Model, our results show that asynchronicity increases with viscosity and distance between residues, decreases with increasing connectivity, and decreases with increasing levels of eigenvector centrality.


Asunto(s)
Ciclofilina A , Humanos , Ciclofilina A/química , Ciclofilina A/metabolismo , Isocitrato Deshidrogenasa/química , Isocitrato Deshidrogenasa/metabolismo , Isocitrato Deshidrogenasa/genética , Regulación Alostérica , Proteínas/química , Proteínas/metabolismo , Modelos Moleculares , Conformación Proteica , Distribución Normal
15.
BMC Genomics ; 25(1): 904, 2024 Sep 30.
Artículo en Inglés | MEDLINE | ID: mdl-39350040

RESUMEN

BACKGROUND: RNA sequencing is a vital technique for analyzing RNA behavior in cells, but it often suffers from various biases that distort the data. Traditional methods to address these biases are typically empirical and handle them individually, limiting their effectiveness. Our study introduces the Gaussian Self-Benchmarking (GSB) framework, a novel approach that leverages the natural distribution patterns of guanine (G) and cytosine (C) content in RNA to mitigate multiple biases simultaneously. This method is grounded in a theoretical model, organizing k-mers based on their GC content and applying a Gaussian model for alignment to ensure empirical sequencing data closely match their theoretical distribution. RESULTS: The GSB framework demonstrated superior performance in mitigating sequencing biases compared to existing methods. Testing with synthetic RNA constructs and real human samples showed that the GSB approach not only addresses individual biases more effectively but also manages co-existing biases jointly. The framework's reliance on accurately pre-determined parameters like mean and standard deviation of GC content distribution allows for a more precise representation of RNA samples. This results in improved accuracy and reliability of RNA sequencing data, enhancing our understanding of RNA behavior in health and disease. CONCLUSIONS: The GSB framework presents a significant advancement in RNA sequencing analysis by providing a well-validated, multi-bias mitigation strategy. It functions independently from previously identified dataset flaws and sets a new standard for unbiased RNA sequencing results. This development enhances the reliability of RNA studies, broadening the potential for scientific breakthroughs in medicine and biology, particularly in genetic disease research and the development of targeted treatments.


Asunto(s)
Composición de Base , RNA-Seq , Humanos , RNA-Seq/métodos , Distribución Normal , Análisis de Secuencia de ARN/métodos , Sesgo , ARN/genética
16.
Hum Brain Mapp ; 45(7): e26692, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38712767

RESUMEN

In neuroimaging studies, combining data collected from multiple study sites or scanners is becoming common to increase the reproducibility of scientific discoveries. At the same time, unwanted variations arise by using different scanners (inter-scanner biases), which need to be corrected before downstream analyses to facilitate replicable research and prevent spurious findings. While statistical harmonization methods such as ComBat have become popular in mitigating inter-scanner biases in neuroimaging, recent methodological advances have shown that harmonizing heterogeneous covariances results in higher data quality. In vertex-level cortical thickness data, heterogeneity in spatial autocorrelation is a critical factor that affects covariance heterogeneity. Our work proposes a new statistical harmonization method called spatial autocorrelation normalization (SAN) that preserves homogeneous covariance vertex-level cortical thickness data across different scanners. We use an explicit Gaussian process to characterize scanner-invariant and scanner-specific variations to reconstruct spatially homogeneous data across scanners. SAN is computationally feasible, and it easily allows the integration of existing harmonization methods. We demonstrate the utility of the proposed method using cortical thickness data from the Social Processes Initiative in the Neurobiology of the Schizophrenia(s) (SPINS) study. SAN is publicly available as an R package.


Asunto(s)
Corteza Cerebral , Imagen por Resonancia Magnética , Esquizofrenia , Humanos , Imagen por Resonancia Magnética/normas , Imagen por Resonancia Magnética/métodos , Esquizofrenia/diagnóstico por imagen , Esquizofrenia/patología , Corteza Cerebral/diagnóstico por imagen , Corteza Cerebral/anatomía & histología , Neuroimagen/métodos , Neuroimagen/normas , Procesamiento de Imagen Asistido por Computador/métodos , Procesamiento de Imagen Asistido por Computador/normas , Masculino , Femenino , Adulto , Distribución Normal , Grosor de la Corteza Cerebral
17.
Brief Bioinform ; 23(5)2022 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-35953081

RESUMEN

Posttranslational modification of lysine residues, K-PTM, is one of the most popular PTMs. Some lysine residues in proteins can be continuously or cascaded covalently modified, such as acetylation, crotonylation, methylation and succinylation modification. The covalent modification of lysine residues may have some special functions in basic research and drug development. Although many computational methods have been developed to predict lysine PTMs, up to now, the K-PTM prediction methods have been modeled and learned a single class of K-PTM modification. In view of this, this study aims to fill this gap by building a multi-label computational model that can be directly used to predict multiple K-PTMs in proteins. In this study, a multi-label prediction model, MLysPRED, is proposed to identify multiple lysine sites using features generated from human protein sequences. In MLysPRED, three kinds of multi-label sequence encoding algorithms (MLDBPB, MLPSDAAP, MLPSTAAP) are proposed and combined with three encoding strategies (CHHAA, DR and Kmer) to convert preprocessed lysine sequences into effective numerical features. A multidimensional normal distribution oversampling technique and graph-based multi-view clustering under-sampling algorithm were first proposed and incorporated to reduce the proportion of the original training samples, and multi-label nearest neighbor algorithm is used for classification. It is observed that MLysPRED achieved an Aiming of 92.21%, Coverage of 94.98%, Accuracy of 89.63%, Absolute-True of 81.46% and Absolute-False of 0.0682 on the independent datasets. Additionally, comparison of results with five existing predictors also indicated that MLysPRED is very promising and encouraging to predict multiple K-PTMs in proteins. For the convenience of the experimental scientists, 'MLysPRED' has been deployed as a user-friendly web-server at http://47.100.136.41:8181.


Asunto(s)
Lisina , Proteínas , Algoritmos , Análisis por Conglomerados , Biología Computacional/métodos , Humanos , Lisina/metabolismo , Distribución Normal , Procesamiento Proteico-Postraduccional , Proteínas/química
18.
Bioinformatics ; 39(5)2023 05 04.
Artículo en Inglés | MEDLINE | ID: mdl-37137236

RESUMEN

MOTIVATION: There is a need for easily accessible implementations that measure the strength of both linear and non-linear relationships between metabolites in biological systems as an approach for data-driven network development. While multiple tools implement linear Pearson and Spearman methods, there are no such tools that assess distance correlation. RESULTS: We present here SIgned Distance COrrelation (SiDCo). SiDCo is a GUI platform for calculation of distance correlation in omics data, measuring linear and non-linear dependencies between variables, as well as correlation between vectors of different lengths, e.g. different sample sizes. By combining the sign of the overall trend from Pearson's correlation with distance correlation values, we further provide a novel "signed distance correlation" of particular use in metabolomic and lipidomic analyses. Distance correlations can be selected as one-to-one or one-to-all correlations, showing relationships between each feature and all other features one at a time or in combination. Additionally, we implement "partial distance correlation," calculated using the Gaussian Graphical model approach adapted to distance covariance. Our platform provides an easy-to-use software implementation that can be applied to the investigation of any dataset. AVAILABILITY AND IMPLEMENTATION: The SiDCo software application is freely available at https://complimet.ca/sidco. Supplementary help pages are provided at https://complimet.ca/sidco. Supplementary Material shows an example of an application of SiDCo in metabolomics.


Asunto(s)
Metabolómica , Programas Informáticos , Lipidómica , Distribución Normal , Tamaño de la Muestra
19.
Bioinformatics ; 39(9)2023 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-37572301

RESUMEN

MOTIVATION: Learning low-dimensional representations of single-cell transcriptomics has become instrumental to its downstream analysis. The state of the art is currently represented by neural network models, such as variational autoencoders, which use a variational approximation of the likelihood for inference. RESULTS: We here present the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori estimation. The DGD handles complex parameterized latent distributions naturally unlike variational autoencoders, which typically use a fixed Gaussian distribution, because of the complexity of adding other types. We first show its general functionality on a commonly used benchmark set, Fashion-MNIST. Secondly, we apply the model to multiple single-cell datasets. Here, the DGD learns low-dimensional, meaningful, and well-structured latent representations with sub-clustering beyond the provided labels. The advantages of this approach are its simplicity and its capability to provide representations of much smaller dimensionality than a comparable variational autoencoder. AVAILABILITY AND IMPLEMENTATION: scDGD is available as a python package at https://github.com/Center-for-Health-Data-Science/scDGD. The remaining code is made available here: https://github.com/Center-for-Health-Data-Science/dgd.


Asunto(s)
Redes Neurales de la Computación , ARN , Perfilación de la Expresión Génica , Probabilidad , Distribución Normal , Análisis de la Célula Individual
20.
Bioinformatics ; 39(5)2023 05 04.
Artículo en Inglés | MEDLINE | ID: mdl-37018147

RESUMEN

MOTIVATION: Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks. RESULTS: In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. We propose three different frameworks for parameter estimation: a Markov chain Monte Carlo-based approach, a variational Gaussian approximation-based approach, and a hybrid approach. Various information criteria are used for model selection. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure in both cases. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery. AVAILABILITY AND IMPLEMENTATION: The GitHub R package for this work is available at https://github.com/anjalisilva/mixMVPLN and is released under the open source MIT license.


Asunto(s)
Transcriptoma , Distribución Normal , Simulación por Computador , Distribuciones Estadísticas , Análisis de Secuencia de ARN
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda