Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Biometrics ; 74(4): 1301-1310, 2018 12.
Artículo en Inglés | MEDLINE | ID: mdl-29738627

RESUMEN

In many applications, non-Gaussian data such as binary or count are observed over a continuous domain and there exists a smooth underlying structure for describing such data. We develop a new functional data method to deal with this kind of data when the data are regularly spaced on the continuous domain. Our method, referred to as Exponential Family Functional Principal Component Analysis (EFPCA), assumes the data are generated from an exponential family distribution, and the matrix of the canonical parameters has a low-rank structure. The proposed method flexibly accommodates not only the standard one-way functional data, but also two-way (or bivariate) functional data. In addition, we introduce a new cross validation method for estimating the latent rank of a generalized data matrix. We demonstrate the efficacy of the proposed methods using a comprehensive simulation study. The proposed method is also applied to a real application of the UK mortality study, where data are binomially distributed and two-way functional across age groups and calendar years. The results offer novel insights into the underlying mortality pattern.


Asunto(s)
Biometría/métodos , Simulación por Computador/estadística & datos numéricos , Análisis de Componente Principal/métodos , Factores de Edad , Calendarios como Asunto/estadística & datos numéricos , Humanos , Mortalidad , Reino Unido
2.
Biostatistics ; 16(4): 754-71, 2015 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-25987650

RESUMEN

Motivated by data recording the effects of an exercise intervention on subjects' physical activity over time, we develop a model to assess the effects of a treatment when the data are functional with 3 levels (subjects, weeks and days in our application) and possibly incomplete. We develop a model with 3-level mean structure effects, all stratified by treatment and subject random effects, including a general subject effect and nested effects for the 3 levels. The mean and random structures are specified as smooth curves measured at various time points. The association structure of the 3-level data is induced through the random curves, which are summarized using a few important principal components. We use penalized splines to model the mean curves and the principal component curves, and cast the proposed model into a mixed effects model framework for model fitting, prediction and inference. We develop an algorithm to fit the model iteratively with the Expectation/Conditional Maximization Either (ECME) version of the EM algorithm and eigenvalue decompositions. Selection of the number of principal components and handling incomplete data issues are incorporated into the algorithm. The performance of the Wald-type hypothesis test is also discussed. The method is applied to the physical activity data and evaluated empirically by a simulation study.


Asunto(s)
Algoritmos , Ensayos Clínicos como Asunto/estadística & datos numéricos , Terapia por Ejercicio/estadística & datos numéricos , Modelos Estadísticos , Evaluación de Resultado en la Atención de Salud/estadística & datos numéricos , Proyectos de Investigación/estadística & datos numéricos , Humanos
3.
Pattern Recognit ; 60: 681-691, 2016 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-28066030

RESUMEN

We propose a Sparse exponential family Principal Component Analysis (SePCA) method suitable for any type of data following exponential family distributions, to achieve simultaneous dimension reduction and variable selection for better interpretation of the results. Because of the generality of exponential family distributions, the method can be applied to a wide range of applications, in particular when analyzing high dimensional next-generation sequencing data and genetic mutation data in genomics. The use of sparsity-inducing penalty helps produce sparse principal component loading vectors such that the principal components can focus on informative variables. By using an equivalent dual form of the formulated optimization problem for SePCA, we derive optimal solutions with efficient iterative closed-form updating rules. The results from both simulation experiments and real-world applications have demonstrated the superiority of our SePCA in reconstruction accuracy and computational efficiency over traditional exponential family PCA (ePCA), the existing Sparse PCA (SPCA) and Sparse Logistic PCA (SLPCA) algorithms.

4.
Brief Bioinform ; 14(6): 724-36, 2013 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-22926831

RESUMEN

Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence-structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations.


Asunto(s)
Proteínas/química , Cadenas de Markov , Conformación Proteica
5.
BMC Bioinformatics ; 15 Suppl 15: S4, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25474163

RESUMEN

BACKGROUND: Protein-ligand binding is important for some proteins to perform their functions. Protein-ligand binding sites are the residues of proteins that physically bind to ligands. Despite of the recent advances in computational prediction for protein-ligand binding sites, the state-of-the-art methods search for similar, known structures of the query and predict the binding sites based on the solved structures. However, such structural information is not commonly available. RESULTS: In this paper, we propose a sequence-based approach to identify protein-ligand binding residues. We propose a combination technique to reduce the effects of different sliding residue windows in the process of encoding input feature vectors. Moreover, due to the highly imbalanced samples between the ligand-binding sites and non ligand-binding sites, we construct several balanced data sets, for each of which a random forest (RF)-based classifier is trained. The ensemble of these RF classifiers forms a sequence-based protein-ligand binding site predictor. CONCLUSIONS: Experimental results on CASP9 and CASP8 data sets demonstrate that our method compares favorably with the state-of-the-art protein-ligand binding site prediction methods.


Asunto(s)
Inteligencia Artificial , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Aminoácidos/química , Sitios de Unión , Ligandos , Conformación Proteica
6.
BMC Genomics ; 15 Suppl 1: S10, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24564304

RESUMEN

In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes. In this paper, we derive a general Supervised Categorical Principal Component Analysis (SCPCA), which explicitly models categorical SNP data without imposing any risk effect model assumption. We have evaluated the efficacy of SCPCA with the comparison to a traditional Supervised PCA (SPCA) and a previously developed Supervised Logistic Principal Component Analysis (SLPCA) based on both the simulated genotype data by HAPGEN2 and the genotype data of Crohn's Disease (CD) from Wellcome Trust Case Control Consortium (WTCCC). Our preliminary results have demonstrated the superiority of SCPCA over both SPCA and SLPCA due to its modeling explicitly designed for categorical SNP data as well as its flexibility on the risk effect model assumption.


Asunto(s)
Enfermedad de Crohn/genética , Polimorfismo de Nucleótido Simple , Análisis de Componente Principal/métodos , Algoritmos , Variación Genética , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Desequilibrio de Ligamiento , Modelos Genéticos
7.
Proteins ; 81(8): 1351-62, 2013 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-23504705

RESUMEN

Hot spot residues of proteins are fundamental interface residues that help proteins perform their functions. Detecting hot spots by experimental methods is costly and time-consuming. Sequential and structural information has been widely used in the computational prediction of hot spots. However, structural information is not always available. In this article, we investigated the problem of identifying hot spots using only physicochemical characteristics extracted from amino acid sequences. We first extracted 132 relatively independent physicochemical features from a set of the 544 properties in AAindex1, an amino acid index database. Each feature was utilized to train a classification model with a novel encoding schema for hot spot prediction by the IBk algorithm, an extension of the K-nearest neighbor algorithm. The combinations of the individual classifiers were explored and the classifiers that appeared frequently in the top performing combinations were selected. The hot spot predictor was built based on an ensemble of these classifiers and to work in a voting manner. Experimental results demonstrated that our method effectively exploited the feature space and allowed flexible weights of features for different queries. On the commonly used hot spot benchmark sets, our method significantly outperformed other machine learning algorithms and state-of-the-art hot spot predictors. The program is available at http://sfb.kaust.edu.sa/pages/software.aspx.


Asunto(s)
Proteínas/química , Proteínas/metabolismo , Algoritmos , Secuencia de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Animales , Inteligencia Artificial , Bases de Datos de Proteínas , Drosophila/química , Drosophila/metabolismo , Proteínas de Drosophila/química , Proteínas de Drosophila/metabolismo , Humanos , Hormonas Juveniles/química , Hormonas Juveniles/metabolismo , Modelos Moleculares , Mapas de Interacción de Proteínas , Receptores de Eritropoyetina/química , Receptores de Eritropoyetina/metabolismo
8.
Biometrics ; 68(3): 784-92, 2012 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-22834966

RESUMEN

Gene expression index estimation is an essential step in analyzing multiple probe microarray data. Various modeling methods have been proposed in this area. Amidst all, a popular method proposed in Li and Wong (2001) is based on a multiplicative model, which is similar to the additive model discussed in Irizarry et al. (2003a) at the logarithm scale. Along this line, Hu et al. (2006) proposed data transformation to improve expression index estimation based on an ad hoc entropy criteria and naive grid search approach. In this work, we re-examined this problem using a new profile likelihood-based transformation estimation approach that is more statistically elegant and computationally efficient. We demonstrate the applicability of the proposed method using a benchmark Affymetrix U95A spiked-in experiment. Moreover, We introduced a new multivariate expression index and used the empirical study to shows its promise in terms of improving model fitting and power of detecting differential expression over the commonly used univariate expression index. As the other important content of the work, we discussed two generally encountered practical issues in application of gene expression index: normalization and summary statistic used for detecting differential expression. Our empirical study shows somewhat different findings from the MAQC project (MAQC, 2006).


Asunto(s)
Perfilación de la Expresión Génica/estadística & datos numéricos , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/estadística & datos numéricos , Biometría , Humanos , Funciones de Verosimilitud , Modelos Genéticos , Análisis Multivariante
9.
Stat Probab Lett ; 82(10): 1807-1814, 2012 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-22904586

RESUMEN

We consider the problem of estimation in semiparametric varying coefficient models where the covariate modifying the varying coefficients is functional and is modeled nonparametrically. We develop a kernel-based estimator of the nonparametric component and a profiling estimator of the parametric component of the model and derive their asymptotic properties. Specifically, we show the consistency of the nonparametric functional estimates and derive the asymptotic expansion of the estimates of the parametric component. We illustrate the performance of our methodology using a simulation study and a real data application.

10.
Patterns (N Y) ; 3(3): 100434, 2022 Mar 11.
Artículo en Inglés | MEDLINE | ID: mdl-35510185

RESUMEN

Gene knockout (KO) experiments are a proven, powerful approach for studying gene function. However, systematic KO experiments targeting a large number of genes are usually prohibitive due to the limit of experimental and animal resources. Here, we present scTenifoldKnk, an efficient virtual KO tool that enables systematic KO investigation of gene function using data from single-cell RNA sequencing (scRNA-seq). In scTenifoldKnk analysis, a gene regulatory network (GRN) is first constructed from scRNA-seq data of wild-type samples, and a target gene is then virtually deleted from the constructed GRN. Manifold alignment is used to align the resulting reduced GRN to the original GRN to identify differentially regulated genes, which are used to infer target gene functions in analyzed cells. We demonstrate that the scTenifoldKnk-based virtual KO analysis recapitulates the main findings of real-animal KO experiments and recovers the expected functions of genes in relevant cell types.

11.
Biometrics ; 66(4): 1087-95, 2010 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-20163403

RESUMEN

Sparse singular value decomposition (SSVD) is proposed as a new exploratory analysis tool for biclustering or identifying interpretable row-column associations within high-dimensional data matrices. SSVD seeks a low-rank, checkerboard structured matrix approximation to data matrices. The desired checkerboard structure is achieved by forcing both the left- and right-singular vectors to be sparse, that is, having many zero entries. By interpreting singular vectors as regression coefficient vectors for certain linear regressions, sparsity-inducing regularization penalties are imposed to the least squares regression to produce sparse singular vectors. An efficient iterative algorithm is proposed for computing the sparse singular vectors, along with some discussion of penalty parameter selection. A lung cancer microarray dataset and a food nutrition dataset are used to illustrate SSVD as a biclustering method. SSVD is also compared with some existing biclustering methods using simulated datasets.


Asunto(s)
Análisis por Conglomerados , Algoritmos , Bases de Datos Factuales , Humanos , Modelos Lineales , Neoplasias Pulmonares , Fenómenos Fisiológicos de la Nutrición
12.
J Phys Chem A ; 114(17): 5596-600, 2010 May 06.
Artículo en Inglés | MEDLINE | ID: mdl-20392101

RESUMEN

Precise morphological control of nanoparticles (NPs) has been impeded by the lack of in situ techniques enabling the observation of instantaneous growth steps. Fundamentally, understanding in NP nucleation and growth kinetics has yet to achieve. In the present research, morphological characterization is demonstrated using a novel image detection statistical approach for gold NPs. This multivariate statistical technique enhances the recognition of NPs by successfully identifying their morphology in addition to their growth stages. Thermodynamic analysis of those stages is presented relating surface energies to the growth kinetics. Preferred growth of NPs was seen to take place on specific crystallographic surfaces in a correlated manner. Furthermore, the growth steps are dominated by the adsorption of surfactants and the local surface energies. The present approach enabled detailed observation of NP growth kinetics and can be applied to other metallic NPs.

13.
Patterns (N Y) ; 1(9): 100139, 2020 Dec 11.
Artículo en Inglés | MEDLINE | ID: mdl-33336197

RESUMEN

We present scTenifoldNet-a machine learning workflow built upon principal-component regression, low-rank tensor approximation, and manifold alignment-for constructing and comparing single-cell gene regulatory networks (scGRNs) using data from single-cell RNA sequencing. scTenifoldNet reveals regulatory changes in gene expression between samples by comparing the constructed scGRNs. With real data, scTenifoldNet identifies specific gene expression programs associated with different biological processes, providing critical insights into the underlying mechanism of regulatory networks governing cellular transcriptional activities.

14.
Econom Stat ; 9: 140-155, 2019 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-30740554

RESUMEN

A semiparametric varying-coefficient mixed regressive spatial autoregressive model is used to study covariate effects on spatially dependent responses, where the effects of some covariates are allowed to vary with other variables. A semiparametric series-based least squares estimating procedure is proposed with the introduction of instrumental variables and series approximations of the conditional expectations. The estimators for both the nonparametric and parametric components of the model are shown to be consistent and their asymptotic distributions are derived. The proposed estimators perform well in simulations. The proposed method is applied to analyze a data set on teen pregnancy to investigate effects of neighborhood as well as other social and economic factors on the teen pregnancy rate.

15.
Cells ; 9(1)2019 12 19.
Artículo en Inglés | MEDLINE | ID: mdl-31861624

RESUMEN

As single-cell RNA sequencing (scRNA-seq) data becomes widely available, cell-to-cell variability in gene expression, or single-cell expression variability (scEV), has been increasingly appreciated. However, it remains unclear whether this variability is functionally important and, if so, what are its implications for multi-cellular organisms. Here, we analyzed multiple scRNA-seq data sets from lymphoblastoid cell lines (LCLs), lung airway epithelial cells (LAECs), and dermal fibroblasts (DFs) and, for each cell type, selected a group of homogenous cells with highly similar expression profiles. We estimated the scEV levels for genes after correcting the mean-variance dependency in that data and identified 465, 466, and 364 highly variable genes (HVGs) in LCLs, LAECs, and DFs, respectively. Functions of these HVGs were found to be enriched with those biological processes precisely relevant to the corresponding cell type's function, from which the scRNA-seq data used to identify HVGs were generated-e.g., cytokine signaling pathways were enriched in HVGs identified in LCLs, collagen formation in LAECs, and keratinization in DFs. We repeated the same analysis with scRNA-seq data from induced pluripotent stem cells (iPSCs) and identified only 79 HVGs with no statistically significant enriched functions; the overall scEV in iPSCs was of negligible magnitude. Our results support the "variation is function" hypothesis, arguing that scEV is required for cell type-specific, higher-level system function. Thus, quantifying and characterizing scEV are of importance for our understating of normal and pathological cellular processes.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Redes Reguladoras de Genes , Análisis de la Célula Individual/métodos , Algoritmos , Línea Celular , Regulación de la Expresión Génica , Humanos , Especificidad de Órganos , Análisis de Secuencia de ARN/métodos
16.
Comput Struct Biotechnol J ; 15: 243-254, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-28280526

RESUMEN

Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.

17.
IEEE Trans Image Process ; 25(12): 5713-5726, 2016 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-28114064

RESUMEN

This paper studies the problem of detecting the presence of nanoparticles in noisy transmission electron microscopic (TEM) images and then fitting each nanoparticle with an elliptic shape model. In order to achieve robustness while handling low contrast and high noise in the TEM images, we propose an approach to fuse two kinds of complementary image information, namely, the pixel intensity and the gradient (the first derivative in intensity). Our approach entails two main steps: 1) the first step is to, after necessary pre-processing, employ both intensity-based information and gradient-based information to process the same TEM image and produce two independent sets of results and 2) the subsequent step is to formulate a binary integer programming (BIP) problem for conflict resolution among the two sets of results. Solving the BIP problem determines the final nanoparticle identification. We apply our method to a set of TEM images taken under different microscopic resolutions and noise levels. The empirical results show the merit of the proposed method. It can process a TEM image of 1024×1024 pixels in a few minutes, and the processed outcomes appear rather robust.

18.
J Comput Graph Stat ; 24(1): 84-103, 2015 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-25914514

RESUMEN

Principal component analysis (PCA) is a popular dimension reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples.

19.
J Am Stat Assoc ; 109(508): 1355-1367, 2014 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-25642005

RESUMEN

In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this paper, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection.

20.
J Biol Rhythms ; 29(4): 231-42, 2014 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-25238853

RESUMEN

Identification of circadian-regulated genes based on temporal transcriptome data is important for studying the regulation mechanism of the circadian system. However, various computational methods adopting different strategies for the identification of cycling transcripts usually yield inconsistent results even for the same dataset, making it challenging to choose the optimal method for a specific circadian study. To address this challenge, we evaluate 5 popular methods, including ARSER (ARS), COSOPT (COS), Fisher's G test (FIS), HAYSTACK (HAY), and JTK_CYCLE (JTK), based on both simulated and empirical datasets. Our results show that increasing the number of total samples (through improving sampling frequency or lengthening the sampling time window) is beneficial for computational methods to accurately identify circadian transcripts and measure circadian phase. For a given number of total samples, higher sampling frequency is more important for HAY and JTK, and the longer sampling time window is more crucial for ARS and COS, as testified on simulated and empirical datasets from which circadian signals are computationally identified. In addition, the preference of higher sampling frequency or the longer sampling time window is also obvious for JTK, ARS, and COS in estimating circadian phases of simulated periodic profiles. Our results also indicate that attention should be paid to the significance threshold that is used for each method in selecting circadian genes, especially when analyzing the same empirical dataset with 2 or more methods. To summarize, for any study involving genome-wide identification of circadian genes from transcriptome data, our evaluation results provide suggestions for the selection of an optimal method based on specific goal and experimental design.


Asunto(s)
Ritmo Circadiano/genética , Estudio de Asociación del Genoma Completo/métodos , Genoma/genética , Transcriptoma/genética , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA