Pesquisa | Secretaria de Estado da Saúde

1.

Learning High-dimensional Generalized Linear Autoregressive Models.

Hall, Eric C; Raskutti, Garvesh; Willett, Rebecca M.

IEEE Trans Inf Theory ; 65(4): 2401-2422, 2019 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-31839683

RESUMO

Vector autoregressive models characterize a variety of time series in which linear combinations of current and past observations can be used to accurately predict future observations. For instance, each element of an observation vector could correspond to a different node in a network, and the parameters of an autoregressive model would correspond to the impact of the network structure on the time series evolution. Often these models are used successfully in practice to learn the structure of social, epidemiological, financial, or biological neural networks. However, little is known about statistical guarantees on estimates of such models in non-Gaussian settings. This paper addresses the inference of the autoregressive parameters and associated network structure within a generalized linear model framework that includes Poisson and Bernoulli autoregressive processes. At the heart of this analysis is a sparsity-regularized maximum likelihood estimator. While sparsity-regularization is well-studied in the statistics and machine learning communities, those analysis methods cannot be applied to autoregressive generalized linear models because of the correlations and potential heteroscedasticity inherent in the observations. Sample complexity bounds are derived using a combination of martingale concentration inequalities and modern empirical process techniques for dependent random variables. These bounds, which are supported by several simulation studies, characterize the impact of various network parameters on estimator performance.

2.

Minimizing Negative Transfer of Knowledge in Multivariate Gaussian Processes: A Scalable and Regularized Approach.

Kontar, Raed; Raskutti, Garvesh; Zhou, Shiyu.

IEEE Trans Pattern Anal Mach Intell ; 43(10): 3508-3522, 2021 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-32305903

RESUMO

Recently there has been an increasing interest in the multivariate Gaussian process (MGP) which extends the Gaussian process (GP) to deal with multiple outputs. One approach to construct the MGP and account for non-trivial commonalities amongst outputs employs a convolution process (CP). The CP is based on the idea of sharing latent functions across several convolutions. Despite the elegance of the CP construction, it provides new challenges that need yet to be tackled. First, even with a moderate number of outputs, model building is extremely prohibitive due to the huge increase in computational demands and number of parameters to be estimated. Second, the negative transfer of knowledge may occur when some outputs do not share commonalities. In this paper we address these issues. We propose a regularized pairwise modeling approach for the MGP established using CP. The key feature of our approach is to distribute the estimation of the full multivariate model into a group of bivariate GPs which are individually built. Interestingly pairwise modeling turns out to possess unique characteristics, which allows us to tackle the challenge of negative transfer through penalizing the latent function that facilitates information sharing in each bivariate model. Predictions are then made through combining predictions from the bivariate models within a Bayesian framework. The proposed method has excellent scalability when the number of outputs is large and minimizes the negative transfer of knowledge between uncorrelated outputs. Statistical guarantees for the proposed method are studied and its advantageous features are demonstrated through numerical studies.

3.

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning.

Song, Hyebin; Bremer, Bennett J; Hinds, Emily C; Raskutti, Garvesh; Romero, Philip A.

Cell Syst ; 12(1): 92-101.e8, 2021 01 20.

Artigo em Inglês | MEDLINE | ID: mdl-33212013

RESUMO

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Assuntos

Aprendizado de Máquina , Proteínas , Sequência de Aminoácidos

4.

The bias of isotonic regression.

Dai, Ran; Song, Hyebin; Barber, Rina Foygel; Raskutti, Garvesh.

Electron J Stat ; 14(1): 801-834, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32489515

RESUMO

We study the bias of the isotonic regression estimator. While there is extensive work characterizing the mean squared error of the isotonic regression estimator, relatively little is known about the bias. In this paper, we provide a sharp characterization, proving that the bias scales as O(n -ß/3) up to log factors, where 1 ≤ ß ≤ 2 is the exponent corresponding to Hölder smoothness of the underlying mean. Importantly, this result only requires a strictly monotone mean and that the noise distribution has subexponential tails, without relying on symmetric noise or other restrictive assumptions.

5.

Graph-based regularization for regression problems with alignment and highly-correlated designs.

Li, Yuan; Mark, Benjamin; Raskutti, Garvesh; Willett, Rebecca; Song, Hyebin; Neiman, David.

SIAM J Math Data Sci ; 2(2): 480-504, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32968717

RESUMO

Sparse models for high-dimensional linear regression and machine learning have received substantial attention over the past two decades. Model selection, or determining which features or covariates are the best explanatory variables, is critical to the interpretability of a learned model. Much of the current literature assumes that covariates are only mildly correlated. However, in many modern applications covariates are highly correlated and do not exhibit key properties (such as the restricted eigenvalue condition, restricted isometry property, or other related assumptions). This work considers a high-dimensional regression setting in which a graph governs both correlations among the covariates and the similarity among regression coefficients - meaning there is alignment between the covariates and regression coefficients. Using side information about the strength of correlations among features, we form a graph with edge weights corresponding to pairwise covariances. This graph is used to define a graph total variation regularizer that promotes similar weights for correlated features. This work shows how the proposed graph-based regularization yields mean-squared error guarantees for a broad range of covariance graph structures. These guarantees are optimal for many specific covariance graphs, including block and lattice graphs. Our proposed approach outperforms other methods for highly-correlated design in a variety of experiments on synthetic data and real biochemistry data.

6.

PUlasso: High-Dimensional Variable Selection With Presence-Only Data.

Song, Hyebin; Raskutti, Garvesh.

J Am Stat Assoc ; 115(529): 334-347, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-32255883

RESUMO

In various real-world problems, we are presented with classification problems with positive and unlabeled data, referred to as presence-only responses. In this article we study variable selection in the context of presence only responses where the number of features or covariates p is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this article, we develop the PUlasso algorithm for variable selection and classification with positive and unlabeled responses. Our algorithm involves using the majorization-minimization framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm converges to a stationary point, and then prove that any stationary point within a local neighborhood of the true parameter achieves the minimax optimal mean-squared error under both strict sparsity and group sparsity assumptions. We also demonstrate through simulations that our algorithm outperforms state-of-the-art algorithms in the moderate p settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Supplementary materials for this article are available online.

7.

Pretreatment gene expression profiles can be used to predict response to neoadjuvant chemoradiotherapy in esophageal cancer.

Duong, Cuong; Greenawalt, Danielle M; Kowalczyk, Adam; Ciavarella, Marianne L; Raskutti, Garvesh; Murray, William K; Phillips, Wayne A; Thomas, Robert J S.

Ann Surg Oncol ; 14(12): 3602-9, 2007 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-17896157

RESUMO

BACKGROUND: The use of neoadjuvant therapy, in particular chemoradiotherapy (CRT), in the treatment of esophageal cancer (EC) remains controversial. The ability to predict treatment response in an individual EC patient would greatly aid therapeutic planning. Gene expression profiles of EC were measured and relationship to therapeutic response assessed. METHODS: Tumor biopsy samples taken from 46 EC patients before neoadjuvant CRT were analyzed on 10.5K cDNA microarrays. Response to treatment was assessed and correlated to gene expression patterns by using a support vector machine learning algorithm. RESULTS: Complete clinical response at conclusion of CRT was achieved in 6 of 21 squamous cell carcinoma (SCC) and 11 of 25 adenocarcinoma (AC) patients. CRT response was an independent prognostic factor for survival (P < .001). A range of support vector machine models incorporating 10 to 1000 genes produced a predictive performance of tumor response to CRT peaking at 87% in SCC, but a distinct positive prediction profile was unobtainable for AC. A 32-gene classifier was produced, and by means of this classifier, 10 of 21 SCC patients could be accurately identified as having disease with an incomplete response to therapy, and thus unlikely to benefit from neoadjuvant CRT. CONCLUSIONS: Our study identifies a 32-gene classifier that can be used to predict response to neoadjuvant CRT in SCC. However, because of the molecular diversity between the two histological subtypes of EC, when considering the AC and SCC samples as a single cohort, a predictive profile could not be resolved, and a negative predictive profile was observed for AC.

Assuntos

Protocolos de Quimioterapia Combinada Antineoplásica/uso terapêutico , Biomarcadores Tumorais/metabolismo , Neoplasias Esofágicas/metabolismo , Neoplasias Esofágicas/terapia , Terapia Neoadjuvante , Adenocarcinoma/metabolismo , Adenocarcinoma/secundário , Adenocarcinoma/terapia , Idoso , Biomarcadores Tumorais/genética , Carcinoma de Células Escamosas/metabolismo , Carcinoma de Células Escamosas/secundário , Carcinoma de Células Escamosas/terapia , Terapia Combinada , Neoplasias Esofágicas/tratamento farmacológico , Neoplasias Esofágicas/radioterapia , Esofagectomia , Feminino , Perfilação da Expressão Gênica , Humanos , Técnicas Imunoenzimáticas , Metástase Linfática/patologia , Masculino , Pessoa de Meia-Idade , Estadiamento de Neoplasias , Análise de Sequência com Séries de Oligonucleotídeos , Prognóstico , Taxa de Sobrevida , Resultado do Tratamento

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa