Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
1.
PLoS Genet ; 19(5): e1010760, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37200393

RESUMO

Heterozygous variants in the glucocerebrosidase (GBA) gene are common and potent risk factors for Parkinson's disease (PD). GBA also causes the autosomal recessive lysosomal storage disorder (LSD), Gaucher disease, and emerging evidence from human genetics implicates many other LSD genes in PD susceptibility. We have systemically tested 86 conserved fly homologs of 37 human LSD genes for requirements in the aging adult Drosophila brain and for potential genetic interactions with neurodegeneration caused by α-synuclein (αSyn), which forms Lewy body pathology in PD. Our screen identifies 15 genetic enhancers of αSyn-induced progressive locomotor dysfunction, including knockdown of fly homologs of GBA and other LSD genes with independent support as PD susceptibility factors from human genetics (SCARB2, SMPD1, CTSD, GNPTAB, SLC17A5). For several genes, results from multiple alleles suggest dose-sensitivity and context-dependent pleiotropy in the presence or absence of αSyn. Homologs of two genes causing cholesterol storage disorders, Npc1a / NPC1 and Lip4 / LIPA, were independently confirmed as loss-of-function enhancers of αSyn-induced retinal degeneration. The enzymes encoded by several modifier genes are upregulated in αSyn transgenic flies, based on unbiased proteomics, revealing a possible, albeit ineffective, compensatory response. Overall, our results reinforce the important role of lysosomal genes in brain health and PD pathogenesis, and implicate several metabolic pathways, including cholesterol homeostasis, in αSyn-mediated neurotoxicity.


Assuntos
Doença de Parkinson , alfa-Sinucleína , Animais , Humanos , alfa-Sinucleína/genética , alfa-Sinucleína/metabolismo , Animais Geneticamente Modificados , Drosophila/genética , Drosophila/metabolismo , Glucosilceramidase/genética , Glucosilceramidase/metabolismo , Lisossomos/metabolismo , Doença de Parkinson/patologia , Transferases (Outros Grupos de Fosfato Substituídos)/metabolismo , Envelhecimento/metabolismo
2.
Bioinformatics ; 39(10)2023 10 03.
Artigo em Inglês | MEDLINE | ID: mdl-37792497

RESUMO

MOTIVATION: Nuclear magnetic resonance spectroscopy (NMR) is widely used to analyze metabolites in biological samples, but the analysis requires specific expertise, it is time-consuming, and can be inaccurate. Here, we present a powerful automate tool, SPatial clustering Algorithm-Statistical TOtal Correlation SpectroscopY (SPA-STOCSY), which overcomes challenges faced when analyzing NMR data and identifies metabolites in a sample with high accuracy. RESULTS: As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset. It first investigates the covariance pattern among datapoints and then calculates the optimal threshold with which to cluster datapoints belonging to the same structural unit, i.e. the metabolite. Generated clusters are then automatically linked to a metabolite library to identify candidates. To assess SPA-STOCSY's efficiency and accuracy, we applied it to synthesized spectra and spectra acquired on Drosophila melanogaster tissue and human embryonic stem cells. In the synthesized spectra, SPA outperformed Statistical Recoupling of Variables (SRV), an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the biological data, SPA-STOCSY performed comparably to the operator-based Chenomx analysis while avoiding operator bias, and it required <7 min of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. It may thus accelerate the use of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making. AVAILABILITY AND IMPLEMENTATION: The codes of SPA-STOCSY are available at https://github.com/LiuzLab/SPA-STOCSY.


Assuntos
Drosophila melanogaster , Imageamento por Ressonância Magnética , Animais , Humanos , Espectroscopia de Ressonância Magnética/métodos , Análise por Conglomerados , Metabolômica/métodos
3.
PLoS Comput Biol ; 18(10): e1010577, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36191044

RESUMO

Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which lead to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches.


Assuntos
Algoritmos , Biologia Computacional , Análise por Conglomerados , Biologia Computacional/métodos , Consenso , Reprodutibilidade dos Testes
4.
Biometrics ; 79(4): 3846-3858, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-36950906

RESUMO

Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.


Assuntos
Doença de Alzheimer , Humanos , Idoso , Doença de Alzheimer/genética , Genômica , Análise por Conglomerados
5.
Neuroimage ; 197: 330-343, 2019 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-31029870

RESUMO

Advanced brain imaging techniques make it possible to measure individuals' structural connectomes in large cohort studies non-invasively. Given the availability of large scale data sets, it is extremely interesting and important to build a set of advanced tools for structural connectome extraction and statistical analysis that emphasize both interpretability and predictive power. In this paper, we developed and integrated a set of toolboxes, including an advanced structural connectome extraction pipeline and a novel tensor network principal components analysis (TN-PCA) method, to study relationships between structural connectomes and various human traits such as alcohol and drug use, cognition and motion abilities. The structural connectome extraction pipeline produces a set of connectome features for each subject that can be organized as a tensor network, and TN-PCA maps the high-dimensional tensor network data to a lower-dimensional Euclidean space. Combined with classical hypothesis testing, canonical correlation analysis and linear discriminant analysis techniques, we analyzed over 1100 scans of 1076 subjects from the Human Connectome Project (HCP) and the Sherbrooke test-retest data set, as well as 175 human traits measuring different domains including cognition, substance use, motor, sensory and emotion. The test-retest data validated the developed algorithms. With the HCP data, we found that structural connectomes are associated with a wide range of traits, e.g., fluid intelligence, language comprehension, and motor skills are associated with increased cortical-cortical brain structural connectivity, while the use of alcohol, tobacco, and marijuana are associated with decreased cortical-cortical connectivity. We also demonstrated that our extracted structural connectomes and analysis method can give superior prediction accuracies compared with alternative connectome constructions and other tensor and network regression methods.


Assuntos
Encéfalo/anatomia & histologia , Conectoma/métodos , Imagem de Tensor de Difusão/métodos , Processamento de Imagem Assistida por Computador/métodos , Personalidade/fisiologia , Encéfalo/diagnóstico por imagem , Interpretação Estatística de Dados , Feminino , Humanos , Masculino , Modelos Neurológicos , Vias Neurais/anatomia & histologia , Análise de Componente Principal
6.
Bioinformatics ; 34(7): 1141-1147, 2018 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-29617963

RESUMO

Motivation: Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest. Results: We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects. Availability and implementation: DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC. Contact: zhanghan@nankai.edu.cn or zhandonl@bcm.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Controle de Qualidade , Projetos de Pesquisa , Análise de Sequência de RNA/métodos , Glioblastoma/genética , Humanos , Topotecan/farmacologia
7.
BMC Bioinformatics ; 18(Suppl 11): 405, 2017 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-28984189

RESUMO

The 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) was held on December 8-10, 2016 in Houston, Texas, USA. ICIBM included eight scientific sessions, four tutorials, one poster session, four highlighted talks and four keynotes that covered topics on 3D genomics structural analysis, next generation sequencing (NGS) analysis, computational drug discovery, medical informatics, cancer genomics, and systems biology. Here, we present a summary of the nine research articles selected from ICIBM 2016 program for publishing in BMC Bioinformatics.


Assuntos
Biologia , Congressos como Assunto , Internacionalidade , Medicina , Estatística como Assunto , Algoritmos , Variações do Número de Cópias de DNA/genética , Humanos , Aprendizado de Máquina , Redes Neurais de Computação , Splicing de RNA/genética , Análise de Sequência de RNA
8.
BMC Genomics ; 18(Suppl 6): 703, 2017 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-28984207

RESUMO

In this editorial, we first summarize the 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) that was held on December 8-10, 2016 in Houston, Texas, USA, and then briefly introduce the ten research articles included in this supplement issue. ICIBM 2016 included four workshops or tutorials, four keynote lectures, four conference invited talks, eight concurrent scientific sessions and a poster session for 53 accepted abstracts, covering current topics in bioinformatics, systems biology, intelligent computing, and biomedical informatics. Through our call for papers, a total of 77 original manuscripts were submitted to ICIBM 2016. After peer review, 11 articles were selected in this special issue, covering topics such as single cell RNA-seq analysis method, genome sequence and variation analysis, bioinformatics method for vaccine development, and cancer genomics.


Assuntos
Genômica , Invenções , Medicina
9.
Bioinformatics ; 32(6): 952-4, 2016 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-26568634

RESUMO

MOTIVATION: Massive amounts of high-throughput genomics data profiled from tumor samples were made publicly available by the Cancer Genome Atlas (TCGA). RESULTS: We have developed an open source software package, TCGA2STAT, to obtain the TCGA data, wrangle it, and pre-process it into a format ready for multivariate and integrated statistical analysis in the R environment. In a user-friendly format with one single function call, our package downloads and fully processes the desired TCGA data to be seamlessly integrated into a computational analysis pipeline. No further technical or biological knowledge is needed to utilize our software, thus making TCGA data easily accessible to data scientists without specific domain knowledge. AVAILABILITY AND IMPLEMENTATION: TCGA2STAT is available from the https://cran.r-project.org/web/packages/TCGA2STAT/index.html SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: zhandong.liu@bcm.edu.


Assuntos
Software , Genômica , Humanos , Neoplasias
10.
Biometrics ; 73(1): 10-19, 2017 03.
Artigo em Inglês | MEDLINE | ID: mdl-27163413

RESUMO

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high-dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees-features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.


Assuntos
Análise por Conglomerados , Interpretação Estatística de Dados , Redes Reguladoras de Genes , Algoritmos , Biologia Computacional/métodos , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos
11.
Alzheimers Dement ; 12(6): 645-53, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-27079753

RESUMO

Identifying accurate biomarkers of cognitive decline is essential for advancing early diagnosis and prevention therapies in Alzheimer's disease. The Alzheimer's disease DREAM Challenge was designed as a computational crowdsourced project to benchmark the current state-of-the-art in predicting cognitive outcomes in Alzheimer's disease based on high dimensional, publicly available genetic and structural imaging data. This meta-analysis failed to identify a meaningful predictor developed from either data modality, suggesting that alternate approaches should be considered for prediction of cognitive performance.


Assuntos
Doença de Alzheimer/complicações , Transtornos Cognitivos/diagnóstico , Transtornos Cognitivos/etiologia , Doença de Alzheimer/genética , Apolipoproteínas E/genética , Biomarcadores , Transtornos Cognitivos/genética , Biologia Computacional , Bases de Dados Bibliográficas/estatística & dados numéricos , Humanos , Valor Preditivo dos Testes
12.
Hum Brain Mapp ; 36(11): 4566-81, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26304096

RESUMO

Neurofibromatosis type I (NF1) is a genetic disorder caused by mutations in the neurofibromin 1 gene at locus 17q11.2. Individuals with NF1 have an increased incidence of learning disabilities, attention deficits, and autism spectrum disorders. As a single-gene disorder, NF1 represents a valuable model for understanding gene-brain-behavior relationships. While mouse models have elucidated molecular and cellular mechanisms underlying learning deficits associated with this mutation, little is known about functional brain architecture in human subjects with NF1. To address this question, we used resting state functional connectivity magnetic resonance imaging (rs-fcMRI) to elucidate the intrinsic network structure of 30 NF1 participants compared with 30 healthy demographically matched controls during an eyes-open rs-fcMRI scan. Novel statistical methods were employed to quantify differences in local connectivity (edge strength) and modularity structure, in combination with traditional global graph theory applications. Our findings suggest that individuals with NF1 have reduced anterior-posterior connectivity, weaker bilateral edges, and altered modularity clustering relative to healthy controls. Further, edge strength and modular clustering indices were correlated with IQ and internalizing symptoms. These findings suggest that Ras signaling disruption may lead to abnormal functional brain connectivity; further investigation into the functional consequences of these alterations in both humans and in animal models is warranted.


Assuntos
Encéfalo/fisiopatologia , Neuroimagem Funcional/métodos , Rede Nervosa/fisiopatologia , Neurofibromatose 1/fisiopatologia , Adolescente , Adulto , Criança , Feminino , Humanos , Imageamento por Ressonância Magnética/métodos , Masculino , Pessoa de Meia-Idade , Adulto Jovem
13.
Biometrics ; 71(4): 905-17, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26295449

RESUMO

Technological advances have led to a proliferation of structured big data that have matrix-valued covariates. We are specifically motivated to build predictive models for multi-subject neuroimaging data based on each subject's brain imaging scans. This is an ultra-high-dimensional problem that consists of a matrix of covariates (brain locations by time points) for each subject; few methods currently exist to fit supervised models directly to this tensor data. We propose a novel modeling and algorithmic strategy to apply generalized linear models (GLMs) to this massive tensor data in which one set of variables is associated with locations. Our method begins by fitting GLMs to each location separately, and then builds an ensemble by blending information across locations through regularization with what we term an aggregating penalty. Our so called, Local-Aggregate Model, can be fit in a completely distributed manner over the locations using an Alternating Direction Method of Multipliers (ADMM) strategy, and thus greatly reduces the computational burden. Furthermore, we propose to select the appropriate model through a novel sequence of faster algorithmic solutions that is similar to regularization paths. We will demonstrate both the computational and predictive modeling advantages of our methods via simulations and an EEG classification problem.


Assuntos
Neuroimagem/estatística & dados numéricos , Algoritmos , Biometria/métodos , Simulação por Computador , Eletroencefalografia/estatística & dados numéricos , Humanos , Modelos Lineares , Aprendizado de Máquina/estatística & dados numéricos , Análise de Regressão
14.
J Neurosci ; 33(35): 14098-106, 2013 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-23986245

RESUMO

Synesthesia is a condition in which normal stimuli can trigger anomalous associations. In this study, we exploit synesthesia to understand how the synesthetic experience can be explained by subtle changes in network properties. Of the many forms of synesthesia, we focus on colored sequence synesthesia, a form in which colors are associated with overlearned sequences, such as numbers and letters (graphemes). Previous studies have characterized synesthesia using resting-state connectivity or stimulus-driven analyses, but it remains unclear how network properties change as synesthetes move from one condition to another. To address this gap, we used functional MRI in humans to identify grapheme-specific brain regions, thereby constructing a functional "synesthetic" network. We then explored functional connectivity of color and grapheme regions during a synesthesia-inducing fMRI paradigm involving rest, auditory grapheme stimulation, and audiovisual grapheme stimulation. Using Markov networks to represent direct relationships between regions, we found that synesthetes had more connections during rest and auditory conditions. We then expanded the network space to include 90 anatomical regions, revealing that synesthetes tightly cluster in visual regions, whereas controls cluster in parietal and frontal regions. Together, these results suggest that synesthetes have increased connectivity between grapheme and color regions, and that synesthetes use visual regions to a greater extent than controls when presented with dynamic grapheme stimulation. These data suggest that synesthesia is better characterized by studying global network dynamics than by individual properties of a single brain region.


Assuntos
Percepção de Cores , Rede Nervosa/fisiopatologia , Transtornos da Percepção/fisiopatologia , Estimulação Acústica , Adulto , Encéfalo/fisiopatologia , Mapeamento Encefálico , Estudos de Casos e Controles , Feminino , Humanos , Idioma , Imageamento por Ressonância Magnética , Masculino , Cadeias de Markov , Estimulação Luminosa , Sinestesia
15.
J Am Stat Assoc ; 119(547): 2282-2293, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39328784

RESUMO

In this paper, we investigate the Gaussian graphical model inference problem in a novel setting that we call erose measurements, referring to irregularly measured or observed data. For graphs, this results in different node pairs having vastly different sample sizes which frequently arises in data integration, genomics, neuroscience, and sensor networks. Existing works characterize the graph selection performance using the minimum pairwise sample size, which provides little insights for erosely measured data, and no existing inference method is applicable. We aim to fill in this gap by proposing the first inference method that characterizes the different uncertainty levels over the graph caused by the erose measurements, named GI-JOE (Graph Inference when Joint Observations are Erose). Specifically, we develop an edge-wise inference method and an affiliated FDR control procedure, where the variance of each edge depends on the sample sizes associated with corresponding neighbors. We prove statistical validity under erose measurements, thanks to careful localized edge-wise analysis and disentangling the dependencies across the graph. Finally, through simulation studies and a real neuroscience data example, we demonstrate the advantages of our inference methods for graph selection from erosely measured data.

16.
BMC Genomics ; 14 Suppl 8: S7, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564637

RESUMO

BACKGROUND: Selecting genes and pathways indicative of disease is a central problem in computational biology. This problem is especially challenging when parsing multi-dimensional genomic data. A number of tools, such as L1-norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this challenge. However, these approaches tend to ignore the vast amount of a priori biological network information curated in the literature. RESULTS: We propose the use of graph Laplacian regularized logistic regression to integrate biological networks into disease classification and pathway association problems. Simulation studies demonstrate that the performance of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by the Cancer Genome Atlas (TCGA) consortium. Many of the protein-protein interaction modules identified by our approach are further supported by evidence published in the literature. Source code of the proposed algorithm is freely available at http://www.github.com/zhandong/Logit-Lapnet. CONCLUSION: Logistic regression with graph Laplacian regularization is an effective algorithm for identifying key pathways and modules associated with disease subtypes. With the rapid expansion of our knowledge of biological regulatory networks, this approach will become more accurate and increasingly useful for mining transcriptomic, epi-genomic, and other types of genome wide association studies.


Assuntos
Algoritmos , Biomarcadores Tumorais/metabolismo , Neoplasias da Mama/metabolismo , Biologia Computacional/métodos , Redes Reguladoras de Genes , Simulação por Computador , Feminino , Humanos , Modelos Logísticos , Modelos Biológicos , Reprodutibilidade dos Testes
17.
bioRxiv ; 2023 Feb 22.
Artigo em Inglês | MEDLINE | ID: mdl-36865102

RESUMO

Nuclear Magnetic Resonance (NMR) spectroscopy is widely used to analyze metabolites in biological samples, but the analysis can be cumbersome and inaccurate. Here, we present a powerful automated tool, SPA-STOCSY (Spatial Clustering Algorithm - Statistical Total Correlation Spectroscopy), which overcomes the challenges by identifying metabolites in each sample with high accuracy. As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset, first investigating the covariance pattern and then calculating the optimal threshold with which to cluster data points belonging to the same structural unit, i.e. metabolite. The generated clusters are then automatically linked to a compound library to identify candidates. To assess SPA-STOCSY’s efficiency and accuracy, we applied it to synthesized and real NMR data obtained from Drosophila melanogaster brains and human embryonic stem cells. In the synthesized spectra, SPA outperforms Statistical Recoupling of Variables, an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the real spectra, SPA-STOCSY performs comparably to operator-based Chenomx analysis but avoids operator bias and performs the analyses in less than seven minutes of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. As such, it might accelerate the utilization of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making.

18.
Bioinformatics ; 27(21): 3029-35, 2011 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-21930672

RESUMO

MOTIVATION: Nuclear magnetic resonance (NMR) spectroscopy has been used to study mixtures of metabolites in biological samples. This technology produces a spectrum for each sample depicting the chemical shifts at which an unknown number of latent metabolites resonate. The interpretation of this data with common multivariate exploratory methods such as principal components analysis (PCA) is limited due to high-dimensionality, non-negativity of the underlying spectra and dependencies at adjacent chemical shifts. RESULTS: We develop a novel modification of PCA that is appropriate for analysis of NMR data, entitled Sparse Non-Negative Generalized PCA. This method yields interpretable principal components and loading vectors that select important features and directly account for both the non-negativity of the underlying spectra and dependencies at adjacent chemical shifts. Through the reanalysis of experimental NMR data on five purified neural cell types, we demonstrate the utility of our methods for dimension reduction, pattern recognition, sample exploration and feature selection. Our methods lead to the identification of novel metabolites that reflect the differences between these cell types. AVAILABILITY: www.stat.rice.edu/~gallen/software.html. CONTACT: gallen@rice.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Espectroscopia de Ressonância Magnética , Metabolômica/métodos , Análise de Componente Principal , Algoritmos
19.
J Comput Biol ; 29(5): 465-482, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35325552

RESUMO

Recent advances in single-cell RNA sequencing (scRNA-seq) technologies have yielded a powerful tool to measure gene expression of individual cells. One major challenge of the scRNA-seq data is that it usually contains a large amount of zero expression values, which often impairs the effectiveness of downstream analyses. Numerous data imputation methods have been proposed to deal with these "dropout" events, but this is a difficult task for such high-dimensional and sparse data. Furthermore, there have been debates on the nature of the sparsity, about whether the zeros are due to technological limitations or represent actual biology. To address these challenges, we propose Single-cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information (SCENA), a novel approach that imputes the correlation matrix of the data of interest instead of the data itself. SCENA obtains a gene-by-gene correlation estimate by ensembling various individual estimates, some of which are based on known auxiliary information about gene expression networks. Our approach is a reliable method that makes no assumptions on the nature of sparsity in scRNA-seq data or the data distribution. By extensive simulation studies and real data applications, we demonstrate that SCENA is not only superior in gene correlation estimation, but also improves the accuracy and reliability of downstream analyses, including cell clustering, dimension reduction, and graphical model estimation to learn the gene expression network.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Análise por Conglomerados , Simulação por Computador , RNA-Seq , Reprodutibilidade dos Testes , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos
20.
J Mach Learn Res ; 222021 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-34744522

RESUMO

In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA