Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 37
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
PLoS Genet ; 19(5): e1010760, 2023 05.
Artículo en Inglés | MEDLINE | ID: mdl-37200393

RESUMEN

Heterozygous variants in the glucocerebrosidase (GBA) gene are common and potent risk factors for Parkinson's disease (PD). GBA also causes the autosomal recessive lysosomal storage disorder (LSD), Gaucher disease, and emerging evidence from human genetics implicates many other LSD genes in PD susceptibility. We have systemically tested 86 conserved fly homologs of 37 human LSD genes for requirements in the aging adult Drosophila brain and for potential genetic interactions with neurodegeneration caused by α-synuclein (αSyn), which forms Lewy body pathology in PD. Our screen identifies 15 genetic enhancers of αSyn-induced progressive locomotor dysfunction, including knockdown of fly homologs of GBA and other LSD genes with independent support as PD susceptibility factors from human genetics (SCARB2, SMPD1, CTSD, GNPTAB, SLC17A5). For several genes, results from multiple alleles suggest dose-sensitivity and context-dependent pleiotropy in the presence or absence of αSyn. Homologs of two genes causing cholesterol storage disorders, Npc1a / NPC1 and Lip4 / LIPA, were independently confirmed as loss-of-function enhancers of αSyn-induced retinal degeneration. The enzymes encoded by several modifier genes are upregulated in αSyn transgenic flies, based on unbiased proteomics, revealing a possible, albeit ineffective, compensatory response. Overall, our results reinforce the important role of lysosomal genes in brain health and PD pathogenesis, and implicate several metabolic pathways, including cholesterol homeostasis, in αSyn-mediated neurotoxicity.


Asunto(s)
Enfermedad de Parkinson , alfa-Sinucleína , Animales , Humanos , alfa-Sinucleína/genética , alfa-Sinucleína/metabolismo , Animales Modificados Genéticamente , Drosophila/genética , Drosophila/metabolismo , Glucosilceramidasa/genética , Glucosilceramidasa/metabolismo , Lisosomas/metabolismo , Enfermedad de Parkinson/patología , Transferasas (Grupos de Otros Fosfatos Sustitutos)/metabolismo , Envejecimiento/metabolismo
2.
Bioinformatics ; 39(10)2023 10 03.
Artículo en Inglés | MEDLINE | ID: mdl-37792497

RESUMEN

MOTIVATION: Nuclear magnetic resonance spectroscopy (NMR) is widely used to analyze metabolites in biological samples, but the analysis requires specific expertise, it is time-consuming, and can be inaccurate. Here, we present a powerful automate tool, SPatial clustering Algorithm-Statistical TOtal Correlation SpectroscopY (SPA-STOCSY), which overcomes challenges faced when analyzing NMR data and identifies metabolites in a sample with high accuracy. RESULTS: As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset. It first investigates the covariance pattern among datapoints and then calculates the optimal threshold with which to cluster datapoints belonging to the same structural unit, i.e. the metabolite. Generated clusters are then automatically linked to a metabolite library to identify candidates. To assess SPA-STOCSY's efficiency and accuracy, we applied it to synthesized spectra and spectra acquired on Drosophila melanogaster tissue and human embryonic stem cells. In the synthesized spectra, SPA outperformed Statistical Recoupling of Variables (SRV), an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the biological data, SPA-STOCSY performed comparably to the operator-based Chenomx analysis while avoiding operator bias, and it required <7 min of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. It may thus accelerate the use of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making. AVAILABILITY AND IMPLEMENTATION: The codes of SPA-STOCSY are available at https://github.com/LiuzLab/SPA-STOCSY.


Asunto(s)
Drosophila melanogaster , Imagen por Resonancia Magnética , Animales , Humanos , Espectroscopía de Resonancia Magnética/métodos , Análisis por Conglomerados , Metabolómica/métodos
3.
PLoS Comput Biol ; 18(10): e1010577, 2022 10.
Artículo en Inglés | MEDLINE | ID: mdl-36191044

RESUMEN

Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which lead to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches.


Asunto(s)
Algoritmos , Biología Computacional , Análisis por Conglomerados , Biología Computacional/métodos , Consenso , Reproducibilidad de los Resultados
4.
Biometrics ; 79(4): 3846-3858, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-36950906

RESUMEN

Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.


Asunto(s)
Enfermedad de Alzheimer , Humanos , Anciano , Enfermedad de Alzheimer/genética , Genómica , Análisis por Conglomerados
5.
Neuroimage ; 197: 330-343, 2019 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-31029870

RESUMEN

Advanced brain imaging techniques make it possible to measure individuals' structural connectomes in large cohort studies non-invasively. Given the availability of large scale data sets, it is extremely interesting and important to build a set of advanced tools for structural connectome extraction and statistical analysis that emphasize both interpretability and predictive power. In this paper, we developed and integrated a set of toolboxes, including an advanced structural connectome extraction pipeline and a novel tensor network principal components analysis (TN-PCA) method, to study relationships between structural connectomes and various human traits such as alcohol and drug use, cognition and motion abilities. The structural connectome extraction pipeline produces a set of connectome features for each subject that can be organized as a tensor network, and TN-PCA maps the high-dimensional tensor network data to a lower-dimensional Euclidean space. Combined with classical hypothesis testing, canonical correlation analysis and linear discriminant analysis techniques, we analyzed over 1100 scans of 1076 subjects from the Human Connectome Project (HCP) and the Sherbrooke test-retest data set, as well as 175 human traits measuring different domains including cognition, substance use, motor, sensory and emotion. The test-retest data validated the developed algorithms. With the HCP data, we found that structural connectomes are associated with a wide range of traits, e.g., fluid intelligence, language comprehension, and motor skills are associated with increased cortical-cortical brain structural connectivity, while the use of alcohol, tobacco, and marijuana are associated with decreased cortical-cortical connectivity. We also demonstrated that our extracted structural connectomes and analysis method can give superior prediction accuracies compared with alternative connectome constructions and other tensor and network regression methods.


Asunto(s)
Encéfalo/anatomía & histología , Conectoma/métodos , Imagen de Difusión Tensora/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Personalidad/fisiología , Encéfalo/diagnóstico por imagen , Interpretación Estadística de Datos , Femenino , Humanos , Masculino , Modelos Neurológicos , Vías Nerviosas/anatomía & histología , Análisis de Componente Principal
6.
Bioinformatics ; 34(7): 1141-1147, 2018 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-29617963

RESUMEN

Motivation: Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest. Results: We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects. Availability and implementation: DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC. Contact: zhanghan@nankai.edu.cn or zhandonl@bcm.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Perfilación de la Expresión Génica/métodos , Control de Calidad , Proyectos de Investigación , Análisis de Secuencia de ARN/métodos , Glioblastoma/genética , Humanos , Topotecan/farmacología
7.
BMC Bioinformatics ; 18(Suppl 11): 405, 2017 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-28984189

RESUMEN

The 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) was held on December 8-10, 2016 in Houston, Texas, USA. ICIBM included eight scientific sessions, four tutorials, one poster session, four highlighted talks and four keynotes that covered topics on 3D genomics structural analysis, next generation sequencing (NGS) analysis, computational drug discovery, medical informatics, cancer genomics, and systems biology. Here, we present a summary of the nine research articles selected from ICIBM 2016 program for publishing in BMC Bioinformatics.


Asunto(s)
Biología , Congresos como Asunto , Internacionalidad , Medicina , Estadística como Asunto , Algoritmos , Variaciones en el Número de Copia de ADN/genética , Humanos , Aprendizaje Automático , Redes Neurales de la Computación , Empalme del ARN/genética , Análisis de Secuencia de ARN
8.
BMC Genomics ; 18(Suppl 6): 703, 2017 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-28984207

RESUMEN

In this editorial, we first summarize the 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) that was held on December 8-10, 2016 in Houston, Texas, USA, and then briefly introduce the ten research articles included in this supplement issue. ICIBM 2016 included four workshops or tutorials, four keynote lectures, four conference invited talks, eight concurrent scientific sessions and a poster session for 53 accepted abstracts, covering current topics in bioinformatics, systems biology, intelligent computing, and biomedical informatics. Through our call for papers, a total of 77 original manuscripts were submitted to ICIBM 2016. After peer review, 11 articles were selected in this special issue, covering topics such as single cell RNA-seq analysis method, genome sequence and variation analysis, bioinformatics method for vaccine development, and cancer genomics.


Asunto(s)
Genómica , Invenciones , Medicina
9.
Bioinformatics ; 32(6): 952-4, 2016 03 15.
Artículo en Inglés | MEDLINE | ID: mdl-26568634

RESUMEN

MOTIVATION: Massive amounts of high-throughput genomics data profiled from tumor samples were made publicly available by the Cancer Genome Atlas (TCGA). RESULTS: We have developed an open source software package, TCGA2STAT, to obtain the TCGA data, wrangle it, and pre-process it into a format ready for multivariate and integrated statistical analysis in the R environment. In a user-friendly format with one single function call, our package downloads and fully processes the desired TCGA data to be seamlessly integrated into a computational analysis pipeline. No further technical or biological knowledge is needed to utilize our software, thus making TCGA data easily accessible to data scientists without specific domain knowledge. AVAILABILITY AND IMPLEMENTATION: TCGA2STAT is available from the https://cran.r-project.org/web/packages/TCGA2STAT/index.html SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: zhandong.liu@bcm.edu.


Asunto(s)
Programas Informáticos , Genómica , Humanos , Neoplasias
10.
Biometrics ; 73(1): 10-19, 2017 03.
Artículo en Inglés | MEDLINE | ID: mdl-27163413

RESUMEN

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high-dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees-features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.


Asunto(s)
Análisis por Conglomerados , Interpretación Estadística de Datos , Redes Reguladoras de Genes , Algoritmos , Biología Computacional/métodos , Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos
11.
Alzheimers Dement ; 12(6): 645-53, 2016 06.
Artículo en Inglés | MEDLINE | ID: mdl-27079753

RESUMEN

Identifying accurate biomarkers of cognitive decline is essential for advancing early diagnosis and prevention therapies in Alzheimer's disease. The Alzheimer's disease DREAM Challenge was designed as a computational crowdsourced project to benchmark the current state-of-the-art in predicting cognitive outcomes in Alzheimer's disease based on high dimensional, publicly available genetic and structural imaging data. This meta-analysis failed to identify a meaningful predictor developed from either data modality, suggesting that alternate approaches should be considered for prediction of cognitive performance.


Asunto(s)
Enfermedad de Alzheimer/complicaciones , Trastornos del Conocimiento/diagnóstico , Trastornos del Conocimiento/etiología , Enfermedad de Alzheimer/genética , Apolipoproteínas E/genética , Biomarcadores , Trastornos del Conocimiento/genética , Biología Computacional , Bases de Datos Bibliográficas/estadística & datos numéricos , Humanos , Valor Predictivo de las Pruebas
12.
Hum Brain Mapp ; 36(11): 4566-81, 2015 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-26304096

RESUMEN

Neurofibromatosis type I (NF1) is a genetic disorder caused by mutations in the neurofibromin 1 gene at locus 17q11.2. Individuals with NF1 have an increased incidence of learning disabilities, attention deficits, and autism spectrum disorders. As a single-gene disorder, NF1 represents a valuable model for understanding gene-brain-behavior relationships. While mouse models have elucidated molecular and cellular mechanisms underlying learning deficits associated with this mutation, little is known about functional brain architecture in human subjects with NF1. To address this question, we used resting state functional connectivity magnetic resonance imaging (rs-fcMRI) to elucidate the intrinsic network structure of 30 NF1 participants compared with 30 healthy demographically matched controls during an eyes-open rs-fcMRI scan. Novel statistical methods were employed to quantify differences in local connectivity (edge strength) and modularity structure, in combination with traditional global graph theory applications. Our findings suggest that individuals with NF1 have reduced anterior-posterior connectivity, weaker bilateral edges, and altered modularity clustering relative to healthy controls. Further, edge strength and modular clustering indices were correlated with IQ and internalizing symptoms. These findings suggest that Ras signaling disruption may lead to abnormal functional brain connectivity; further investigation into the functional consequences of these alterations in both humans and in animal models is warranted.


Asunto(s)
Encéfalo/fisiopatología , Neuroimagen Funcional/métodos , Red Nerviosa/fisiopatología , Neurofibromatosis 1/fisiopatología , Adolescente , Adulto , Niño , Femenino , Humanos , Imagen por Resonancia Magnética/métodos , Masculino , Persona de Mediana Edad , Adulto Joven
13.
Biometrics ; 71(4): 905-17, 2015 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-26295449

RESUMEN

Technological advances have led to a proliferation of structured big data that have matrix-valued covariates. We are specifically motivated to build predictive models for multi-subject neuroimaging data based on each subject's brain imaging scans. This is an ultra-high-dimensional problem that consists of a matrix of covariates (brain locations by time points) for each subject; few methods currently exist to fit supervised models directly to this tensor data. We propose a novel modeling and algorithmic strategy to apply generalized linear models (GLMs) to this massive tensor data in which one set of variables is associated with locations. Our method begins by fitting GLMs to each location separately, and then builds an ensemble by blending information across locations through regularization with what we term an aggregating penalty. Our so called, Local-Aggregate Model, can be fit in a completely distributed manner over the locations using an Alternating Direction Method of Multipliers (ADMM) strategy, and thus greatly reduces the computational burden. Furthermore, we propose to select the appropriate model through a novel sequence of faster algorithmic solutions that is similar to regularization paths. We will demonstrate both the computational and predictive modeling advantages of our methods via simulations and an EEG classification problem.


Asunto(s)
Neuroimagen/estadística & datos numéricos , Algoritmos , Biometría/métodos , Simulación por Computador , Electroencefalografía/estadística & datos numéricos , Humanos , Modelos Lineales , Aprendizaje Automático/estadística & datos numéricos , Análisis de Regresión
14.
J Neurosci ; 33(35): 14098-106, 2013 Aug 28.
Artículo en Inglés | MEDLINE | ID: mdl-23986245

RESUMEN

Synesthesia is a condition in which normal stimuli can trigger anomalous associations. In this study, we exploit synesthesia to understand how the synesthetic experience can be explained by subtle changes in network properties. Of the many forms of synesthesia, we focus on colored sequence synesthesia, a form in which colors are associated with overlearned sequences, such as numbers and letters (graphemes). Previous studies have characterized synesthesia using resting-state connectivity or stimulus-driven analyses, but it remains unclear how network properties change as synesthetes move from one condition to another. To address this gap, we used functional MRI in humans to identify grapheme-specific brain regions, thereby constructing a functional "synesthetic" network. We then explored functional connectivity of color and grapheme regions during a synesthesia-inducing fMRI paradigm involving rest, auditory grapheme stimulation, and audiovisual grapheme stimulation. Using Markov networks to represent direct relationships between regions, we found that synesthetes had more connections during rest and auditory conditions. We then expanded the network space to include 90 anatomical regions, revealing that synesthetes tightly cluster in visual regions, whereas controls cluster in parietal and frontal regions. Together, these results suggest that synesthetes have increased connectivity between grapheme and color regions, and that synesthetes use visual regions to a greater extent than controls when presented with dynamic grapheme stimulation. These data suggest that synesthesia is better characterized by studying global network dynamics than by individual properties of a single brain region.


Asunto(s)
Percepción de Color , Red Nerviosa/fisiopatología , Trastornos de la Percepción/fisiopatología , Estimulación Acústica , Adulto , Encéfalo/fisiopatología , Mapeo Encefálico , Estudios de Casos y Controles , Femenino , Humanos , Lenguaje , Imagen por Resonancia Magnética , Masculino , Cadenas de Markov , Estimulación Luminosa , Sinestesia
15.
BMC Genomics ; 14 Suppl 8: S7, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24564637

RESUMEN

BACKGROUND: Selecting genes and pathways indicative of disease is a central problem in computational biology. This problem is especially challenging when parsing multi-dimensional genomic data. A number of tools, such as L1-norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this challenge. However, these approaches tend to ignore the vast amount of a priori biological network information curated in the literature. RESULTS: We propose the use of graph Laplacian regularized logistic regression to integrate biological networks into disease classification and pathway association problems. Simulation studies demonstrate that the performance of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by the Cancer Genome Atlas (TCGA) consortium. Many of the protein-protein interaction modules identified by our approach are further supported by evidence published in the literature. Source code of the proposed algorithm is freely available at http://www.github.com/zhandong/Logit-Lapnet. CONCLUSION: Logistic regression with graph Laplacian regularization is an effective algorithm for identifying key pathways and modules associated with disease subtypes. With the rapid expansion of our knowledge of biological regulatory networks, this approach will become more accurate and increasingly useful for mining transcriptomic, epi-genomic, and other types of genome wide association studies.


Asunto(s)
Algoritmos , Biomarcadores de Tumor/metabolismo , Neoplasias de la Mama/metabolismo , Biología Computacional/métodos , Redes Reguladoras de Genes , Simulación por Computador , Femenino , Humanos , Modelos Logísticos , Modelos Biológicos , Reproducibilidad de los Resultados
16.
bioRxiv ; 2023 Feb 22.
Artículo en Inglés | MEDLINE | ID: mdl-36865102

RESUMEN

Nuclear Magnetic Resonance (NMR) spectroscopy is widely used to analyze metabolites in biological samples, but the analysis can be cumbersome and inaccurate. Here, we present a powerful automated tool, SPA-STOCSY (Spatial Clustering Algorithm - Statistical Total Correlation Spectroscopy), which overcomes the challenges by identifying metabolites in each sample with high accuracy. As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset, first investigating the covariance pattern and then calculating the optimal threshold with which to cluster data points belonging to the same structural unit, i.e. metabolite. The generated clusters are then automatically linked to a compound library to identify candidates. To assess SPA-STOCSY’s efficiency and accuracy, we applied it to synthesized and real NMR data obtained from Drosophila melanogaster brains and human embryonic stem cells. In the synthesized spectra, SPA outperforms Statistical Recoupling of Variables, an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the real spectra, SPA-STOCSY performs comparably to operator-based Chenomx analysis but avoids operator bias and performs the analyses in less than seven minutes of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. As such, it might accelerate the utilization of NMR for scientific discoveries, medical diagnostics, and patient-specific decision making.

17.
Bioinformatics ; 27(21): 3029-35, 2011 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-21930672

RESUMEN

MOTIVATION: Nuclear magnetic resonance (NMR) spectroscopy has been used to study mixtures of metabolites in biological samples. This technology produces a spectrum for each sample depicting the chemical shifts at which an unknown number of latent metabolites resonate. The interpretation of this data with common multivariate exploratory methods such as principal components analysis (PCA) is limited due to high-dimensionality, non-negativity of the underlying spectra and dependencies at adjacent chemical shifts. RESULTS: We develop a novel modification of PCA that is appropriate for analysis of NMR data, entitled Sparse Non-Negative Generalized PCA. This method yields interpretable principal components and loading vectors that select important features and directly account for both the non-negativity of the underlying spectra and dependencies at adjacent chemical shifts. Through the reanalysis of experimental NMR data on five purified neural cell types, we demonstrate the utility of our methods for dimension reduction, pattern recognition, sample exploration and feature selection. Our methods lead to the identification of novel metabolites that reflect the differences between these cell types. AVAILABILITY: www.stat.rice.edu/~gallen/software.html. CONTACT: gallen@rice.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Espectroscopía de Resonancia Magnética , Metabolómica/métodos , Análisis de Componente Principal , Algoritmos
18.
J Comput Biol ; 29(5): 465-482, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-35325552

RESUMEN

Recent advances in single-cell RNA sequencing (scRNA-seq) technologies have yielded a powerful tool to measure gene expression of individual cells. One major challenge of the scRNA-seq data is that it usually contains a large amount of zero expression values, which often impairs the effectiveness of downstream analyses. Numerous data imputation methods have been proposed to deal with these "dropout" events, but this is a difficult task for such high-dimensional and sparse data. Furthermore, there have been debates on the nature of the sparsity, about whether the zeros are due to technological limitations or represent actual biology. To address these challenges, we propose Single-cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information (SCENA), a novel approach that imputes the correlation matrix of the data of interest instead of the data itself. SCENA obtains a gene-by-gene correlation estimate by ensembling various individual estimates, some of which are based on known auxiliary information about gene expression networks. Our approach is a reliable method that makes no assumptions on the nature of sparsity in scRNA-seq data or the data distribution. By extensive simulation studies and real data applications, we demonstrate that SCENA is not only superior in gene correlation estimation, but also improves the accuracy and reliability of downstream analyses, including cell clustering, dimension reduction, and graphical model estimation to learn the gene expression network.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de la Célula Individual , Análisis por Conglomerados , Simulación por Computador , RNA-Seq , Reproducibilidad de los Resultados , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos
19.
J Mach Learn Res ; 222021 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-34744522

RESUMEN

In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.

20.
Artículo en Inglés | MEDLINE | ID: mdl-34734115

RESUMEN

Boosting methods are among the best general-purpose and off-the-shelf machine learning approaches, gaining widespread popularity. In this paper, we seek to develop a boosting method that yields comparable accuracy to popular AdaBoost and gradient boosting methods, yet is faster computationally and whose solution is more interpretable. We achieve this by developing MP-Boost, an algorithm loosely based on AdaBoost that learns by adaptively selecting small subsets of instances and features, or what we term minipatches (MP), at each iteration. By sequentially learning on tiny subsets of the data, our approach is computationally faster than other classic boosting algorithms. Also as it progresses, MP-Boost adaptively learns a probability distribution on the features and instances that upweight the most important features and challenging instances, hence adaptively selecting the most relevant minipatches for learning. These learned probability distributions also aid in interpretation of our method. We empirically demonstrate the interpretability, comparative accuracy, and computational time of our approach on a variety of binary classification tasks.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA