Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 36(5): 1570-1576, 2020 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-31621830

RESUMEN

MOTIVATION: Matched case-control analysis is widely used in biomedical studies to identify exposure variables associated with health conditions. The matching is used to improve the efficiency. Existing variable selection methods for matched case-control studies are challenged in high-dimensional settings where interactions among variables are also important. We describe a quite different method for high-dimensional matched case-control data, based on the potential outcome model, which is not only flexible regarding the number of matching and exposure variables but also able to detect interaction effects. RESULTS: We present Matched Forest (MF), an algorithm for variable selection in matched case-control data. The method preserves the case and control values in each instance but transforms the matched case-control data with added counterfactuals. A modified variable importance score from a supervised learner is used to detect important variables. The method is conceptually simple and can be applied with widely available software tools. Simulation studies show the effectiveness of MF in identifying important variables. MF is also applied to data from the biomedical domain and its performance is compared with alternative approaches. AVAILABILITY AND IMPLEMENTATION: R code for implementing MF is available at https://github.com/NooshinSh/Matched_Forest. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Estudios de Casos y Controles , Bosques , Aprendizaje Automático Supervisado
2.
J Public Health Manag Pract ; 27(5): E205-E209, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33109933

RESUMEN

CONTEXT: Public health collaboratives are effective platforms to develop interventions for improving population health. Most collaboratives are limited to the public health and health care delivery sectors; however, multisector collaboratives are becoming more recognized as a strategy for combining efforts from medical, public health, social services, and other sectors. PROGRAM: Based on a 4-year multisector collaborative project, we identify concepts for widening the lens to conduct multisector alignment research. The goal of the collaborative was to address the serious care fragmentation and conflicting financing systems for persons with behavioral health disorders. Our work with these 7 sectors provides insight for creating a framework to conduct multisector alignment research for investigating how alignment problems can be identified, investigated, and applied to achieve systems alignment. IMPLEMENTATION: The multisector collaborative was undertaken in Maricopa County, encompassing Phoenix, Arizona, and consisted of more than 50 organizations representing 7 sectors. EVALUATION: We develop a framework for systems alignment consisting of 4 dimensions (alignment problems, alignment mechanisms, alignment solutions, and goal attainment) and a vocabulary for implementing multisector alignment research. We then describe the interplay and reciprocity between the 4 dimensions. DISCUSSION: This framework can be used by multisector collaboratives to help identify strategies, implement programs, and develop metrics to assess impact on population health and equity.


Asunto(s)
Salud Poblacional , Arizona , Humanos , Salud Pública , Servicio Social
3.
BMC Bioinformatics ; 21(Suppl 2): 77, 2020 Mar 11.
Artículo en Inglés | MEDLINE | ID: mdl-32164534

RESUMEN

BACKGROUND: In biomarker discovery, applying domain knowledge is an effective approach to eliminating false positive features, prioritizing functionally impactful markers and facilitating the interpretation of predictive signatures. Several computational methods have been developed that formulate the knowledge-based biomarker discovery as a feature selection problem guided by prior information. These methods often require that prior information is encoded as a single score and the algorithms are optimized for biological knowledge of a specific type. However, in practice, domain knowledge from diverse resources can provide complementary information. But no current methods can integrate heterogeneous prior information for biomarker discovery. To address this problem, we developed the Know-GRRF (know-guided regularized random forest) method that enables dynamic incorporation of domain knowledge from multiple disciplines to guide feature selection. RESULTS: Know-GRRF embeds domain knowledge in a regularized random forest framework. It combines prior information from multiple domains in a linear model to derive a composite score, which, together with other tuning parameters, controls the regularization of the random forests model. Know-GRRF concurrently optimizes the weight given to each type of domain knowledge and other tuning parameters to minimize the AIC of out-of-bag predictions. The objective is to select a compact feature subset that has a high discriminative power and strong functional relevance to the biological phenotype. Via rigorous simulations, we show that Know-GRRF guided by multiple-domain prior information outperforms feature selection methods guided by single-domain prior information or no prior information. We then applied Known-GRRF to a real-world study to identify prognostic biomarkers of prostate cancers. We evaluated the combination of cancer-related gene annotations, evolutionary conservation and pre-computed statistical scores as the prior knowledge to assemble a panel of biomarkers. We discovered a compact set of biomarkers with significant improvements on prediction accuracies. CONCLUSIONS: Know-GRRF is a powerful novel method to incorporate knowledge from multiple domains for feature selection. It has a broad range of applications in biomarker discoveries. We implemented this method and released a KnowGRRF package in the R/CRAN archive.


Asunto(s)
Algoritmos , Biomarcadores de Tumor/genética , Área Bajo la Curva , Biomarcadores de Tumor/metabolismo , Bases de Datos Factuales , Humanos , Modelos Lineales , Masculino , Neoplasias de la Próstata/diagnóstico , Curva ROC
4.
BMC Genomics ; 19(1): 841, 2018 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-30482155

RESUMEN

BACKGROUND: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. RESULTS: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. CONCLUSIONS: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.


Asunto(s)
Cistadenocarcinoma Seroso/clasificación , Variaciones en el Número de Copia de ADN , Ciencia de los Datos/métodos , Genoma Humano , Neoplasias Ováricas/clasificación , Cistadenocarcinoma Seroso/genética , Cistadenocarcinoma Seroso/patología , Femenino , Humanos , Clasificación del Tumor , Neoplasias Ováricas/genética , Neoplasias Ováricas/patología , Secuenciación Completa del Genoma
5.
Int J Cancer ; 142(11): 2355-2362, 2018 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-29313979

RESUMEN

While long-term survival rates for early-stage lung cancer are high, most cases are diagnosed in later stages that can negatively impact survival rates. We aim to design a simple, single biomarker blood test for early-stage lung cancer that is robust to preclinical variables and can be readily implemented in the clinic. Whole blood was collected in PAXgene tubes from a training set of 29 patients, and a validation set of 260 patients, of which samples from 58 patients were prospectively collected in a clinical trial specifically for our study. After RNA was extracted, the expressions of FPR1 and a reference gene were quantified by an automated one-step Taqman RT-PCR assay. Elevated levels of FPR1 mRNA in whole blood predicted lung cancer status with a sensitivity of 55% and a specificity of 87% on all validation specimens. The prospectively collected specimens had a significantly higher 68% sensitivity and 89% specificity. Results from patients with benign nodules were similar to healthy volunteers. No meaningful correlation was present between our test results and any clinical characteristic other than lung cancer diagnosis. FPR1 mRNA levels in whole blood can predict the presence of lung cancer. Using this as a reflex test for positive lung cancer screening computed tomography scans has the potential to increase the positive predictive value. This marker can be easily measured in an automated process utilizing off-the-shelf equipment and reagents. Further work is justified to explain the source of this biomarker.


Asunto(s)
Biomarcadores de Tumor , Carcinoma de Pulmón de Células no Pequeñas/diagnóstico , Carcinoma de Pulmón de Células no Pequeñas/genética , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/genética , ARN Mensajero , Receptores de Formil Péptido/genética , Carcinoma Pulmonar de Células Pequeñas/diagnóstico , Carcinoma Pulmonar de Células Pequeñas/genética , Estudios de Casos y Controles , Comorbilidad , Detección Precoz del Cáncer , Femenino , Humanos , Masculino , Estadificación de Neoplasias , Curva ROC
6.
J Immigr Minor Health ; 25(4): 862-869, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-36757600

RESUMEN

COVID-19 burdens are disproportionally high in underserved and vulnerable communities in Arizona. As the pandemic progressed, it is unclear if the initial associated health disparities have changed. This study aims to elicit the dynamic landscape of COVID-19 disparities at the community level and identify newly emerging vulnerable subpopulations. Findings from this study can inform interventions to increase health equity among minoritized communities in the Southwest, other regions of the US, and globally. We compiled biweekly COVID-19 case counts of 274 zip code tabulation areas (ZCTAs) in Arizona from October 21, 2020, to November 25, 2021, a time spanning multiple waves of COVID-19 case growth. Within each biweekly period, we tested the associations between the growth rate of COVID-19 cases and the population composition in a ZCTA including race/ethnicity, income, employment, and age using multiple regression analysis. We then compared the associations across time periods to discover temporal patterns of health disparities. The association between the percentage of Latinx population and the COVID-19 growth rate was positive before April 2021 but gradually converted to negative afterwards. The percentage of Black population was not associated with the COVID-19 growth rate at the beginning of the study but became positive after January 2021 which persisted till the end of the study period. Young median age and high unemployment rate emerged as new risk factors around mid-August 2021. Based on these findings, we identified 37 ZCTAs that were highly vulnerable to future fast escalation of COVID-19 cases. As the pandemic progresses, vulnerabilities associated with Latinx ethnicity improved gradually, possibly bolstered by culturally responsive programs in Arizona to support Latinx. Still communities with disadvantaged social determinants of health continued to struggle. Our findings inform the need to adjust current resource allocations to support the design and implementation of new interventions addressing the emerging vulnerabilities at the community level.


Asunto(s)
COVID-19 , Disparidades en el Estado de Salud , Humanos , Arizona/epidemiología , Población Negra , Empleo , Etnicidad , Hispánicos o Latinos , Determinantes Sociales de la Salud
7.
Bioinformatics ; 25(15): 1905-14, 2009 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-19447782

RESUMEN

MOTIVATION: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems. RESULTS: In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas Bacterianas/química , Proteínas Bacterianas/genética , Desulfovibrio vulgaris/genética , Desulfovibrio vulgaris/metabolismo , Perfilación de la Expresión Génica/métodos , Dinámicas no Lineales , Proteómica/métodos , Bases de Datos de Proteínas
8.
IEEE Trans Pattern Anal Mach Intell ; 31(7): 1338-44, 2009 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-19443930

RESUMEN

This paper proposes a new feature selection methodology. The methodology is based on the stepwise variable selection procedure, but, instead of using the traditional discriminant metrics such as Wilks' Lambda, it uses an estimation of the misclassification error as the figure of merit to evaluate the introduction of new features. The expected misclassification error rate (MER) is obtained by using the densities of a constructed function of random variables, which is the stochastic representation of the conditional distribution of the quadratic discriminant function estimate. The application of the proposed methodology results in significant savings of computational time in the estimation of classification error over the traditional simulation and cross-validation methods. One of the main advantages of the proposed method is that it provides a direct estimation of the expected misclassification error at the time of feature selection, which provides an immediate assessment of the benefits of introducing an additional feature into an inspection/classification algorithm.


Asunto(s)
Algoritmos , Inteligencia Artificial , Análisis de Falla de Equipo/métodos , Modelos Teóricos , Reconocimiento de Normas Patrones Automatizadas/métodos , Simulación por Computador
9.
IEEE Trans Neural Netw Learn Syst ; 29(10): 4709-4718, 2018 10.
Artículo en Inglés | MEDLINE | ID: mdl-29990242

RESUMEN

In this paper, we propose a new end-to-end deep neural network model for time-series classification (TSC) with emphasis on both the accuracy and the interpretation. The proposed model contains a convolutional network component to extract high-level features and a recurrent network component to enhance the modeling of the temporal characteristics of TS data. In addition, a feedforward fully connected network with the sparse group lasso (SGL) regularization is used to generate the final classification. The proposed architecture not only achieves satisfying classification accuracy, but also obtains good interpretability through the SGL regularization. All these networks are connected and jointly trained in an end-to-end framework, and it can be generally applied to TSC tasks across different domains without the efforts of feature engineering. Our experiments in various TS data sets show that the proposed model outperforms the traditional convolutional neural network model for the classification accuracy, and also demonstrate how the SGL contributes to a better model interpretation.

10.
IEEE Trans Neural Netw Learn Syst ; 29(1): 156-166, 2018 01.
Artículo en Inglés | MEDLINE | ID: mdl-27810837

RESUMEN

Autoassociative neural networks (ANNs) have been proposed as a nonlinear extension of principal component analysis (PCA), which is commonly used to identify linear variation patterns in high-dimensional data. While principal component scores represent uncorrelated features, standard backpropagation methods for training ANNs provide no guarantee of producing distinct features, which is important for interpretability and for discovering the nature of the variation patterns in the data. Here, we present an alternating nonlinear PCA method, which encourages learning of distinct features in ANNs. A new measure motivated by the condition of orthogonal loadings in PCA is proposed for measuring the extent to which the nonlinear principal components represent distinct variation patterns. We demonstrate the effectiveness of our method using a simulated point cloud data set as well as a subset of the MNIST handwritten digits data. The results show that standard ANNs consistently mix the true variation sources in the low-dimensional representation learned by the model, whereas our alternating method produces solutions where the patterns are better separated in the low-dimensional space.

11.
PLoS One ; 13(4): e0196556, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29702695

RESUMEN

BACKGROUND: Next generation sequencing tests (NGS) are usually performed on relatively small core biopsy or fine needle aspiration (FNA) samples. Data is limited on what amount of tumor by volume or minimum number of FNA passes are needed to yield sufficient material for running NGS. We sought to identify the amount of tumor for running the PCDx NGS platform. METHODS: 2,723 consecutive tumor tissues of all cancer types were queried and reviewed for inclusion. Information on tumor volume, success of performing NGS, and results of NGS were compiled. Assessment of sequence analysis, mutation calling and sensitivity, quality control, drug associations, and data aggregation and analysis were performed. RESULTS: 6.4% of samples were rejected from all testing due to insufficient tumor quantity. The number of genes with insufficient sensitivity make definitive mutation calls increased as the percentage of tumor decreased, reaching statistical significance below 5% tumor content. The number of drug associations also decreased with a lower percentage of tumor, but this difference only became significant between 1-3%. The number of drug associations did decrease with smaller tissue size as expected. Neither specimen size or percentage of tumor affected the ability to pass mRNA quality control. A tumor area of 10 mm2 provides a good margin of error for specimens to yield adequate drug association results. CONCLUSIONS: Specimen suitability remains a major obstacle to clinical NGS testing. We determined that PCR-based library creation methods allow the use of smaller specimens, and those with a lower percentage of tumor cells to be run on the PCDx NGS platform.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Neoplasias/diagnóstico , Neoplasias/genética , Biopsia con Aguja Fina/métodos , Análisis Mutacional de ADN , ADN Complementario/metabolismo , Femenino , Biblioteca de Genes , Humanos , Masculino , Mutación , Reacción en Cadena de la Polimerasa , ARN Mensajero/metabolismo , Reproducibilidad de los Resultados , Estudios Retrospectivos , Sensibilidad y Especificidad
13.
Int Urol Nephrol ; 48(2): 249-56, 2016 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-26661258

RESUMEN

PURPOSE: Predictive models allow clinicians to identify higher- and lower-risk patients and make targeted treatment decisions. Microalbuminuria (MA) is a condition whose presence is understood to be an early marker for cardiovascular disease. The aims of this study were to develop a patient data-driven predictive model and a risk-score assessment to improve the identification of MA. METHODS: The 2007-2008 National Health and Nutrition Examination Survey (NHANES) was utilized to create a predictive model. The dataset was split into thirds; one-third was used to develop the model, while the other two-thirds were utilized for internal validation. The 2012-2013 NHANES was used as an external validation database. Multivariate logistic regression was performed to create the model. Performance was evaluated using three criteria: (1) receiver operating characteristic curves; (2) pseudo-R (2) values; and (3) goodness of fit (Hosmer-Lemeshow). The model was then used to develop a risk-score chart. RESULTS: A model was developed using variables for which there was a significant relationship. Variables included were systolic blood pressure, fasting glucose, C-reactive protein, blood urea nitrogen, and alcohol consumption. The model performed well, and no significant differences were observed when utilized in the validation datasets. A risk score was developed, and the probability of developing MA for each score was calculated. CONCLUSION: The predictive model provides new evidence about variables related with MA and may be used by clinicians to identify at-risk patients and to tailor treatment. The risk score developed may allow clinicians to measure a patient's MA risk.


Asunto(s)
Albuminuria/diagnóstico , Biomarcadores/análisis , Modelos Estadísticos , Encuestas Nutricionales/métodos , Medición de Riesgo/métodos , Adulto , Albuminuria/sangre , Albuminuria/epidemiología , Bases de Datos Factuales , Femenino , Humanos , Masculino , Persona de Mediana Edad , Valor Predictivo de las Pruebas , Estudios Prospectivos , Curva ROC , Factores de Riesgo , Estados Unidos/epidemiología
14.
IEEE Trans Pattern Anal Mach Intell ; 35(11): 2796-802, 2013 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-24051736

RESUMEN

Time series classification is an important task with many challenging applications. A nearest neighbor (NN) classifier with dynamic time warping (DTW) distance is a strong solution in this context. On the other hand, feature-based approaches have been proposed as both classifiers and to provide insight into the series, but these approaches have problems handling translations and dilations in local patterns. Considering these shortcomings, we present a framework to classify time series based on a bag-of-features representation (TSBF). Multiple subsequences selected from random locations and of random lengths are partitioned into shorter intervals to capture the local information. Consequently, features computed from these subsequences measure properties at different locations and dilations when viewed from the original series. This provides a feature-based approach that can handle warping (although differently from DTW). Moreover, a supervised learner (that handles mixed data types, different units, etc.) integrates location information into a compact codebook through class probability estimates. Additionally, relevant global features can easily supplement the codebook. TSBF is compared to NN classifiers and other alternatives (bag-of-words strategies, sparse spatial sample kernels, shapelets). Our experimental results show that TSBF provides better results than competitive methods on benchmark datasets from the UCR time series database.


Asunto(s)
Algoritmos , Inteligencia Artificial , Modelos Teóricos , Reconocimiento de Normas Patrones Automatizadas/métodos , Simulación por Computador
15.
IEEE Trans Neural Netw Learn Syst ; 23(4): 644-56, 2012 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-24805047

RESUMEN

Kernel principal component analysis (KPCA) is a method widely used for denoising multivariate data. Using geometric arguments, we investigate why a projection operation inherent to all existing KPCA denoising algorithms can sometimes cause very poor denoising. Based on this, we propose a modification to the projection operation that remedies this problem and can be incorporated into any of the existing KPCA algorithms. Using toy examples and real datasets, we show that the proposed algorithm can substantially improve denoising performance and is more robust to misspecification of an important tuning parameter.

16.
Mol Biosyst ; 8(3): 804-17, 2012 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22222464

RESUMEN

Phenotypic characterization of individual cells provides crucial insights into intercellular heterogeneity and enables access to information that is unavailable from ensemble averaged, bulk cell analyses. Single-cell studies have attracted significant interest in recent years and spurred the development of a variety of commercially available and research-grade technologies. To quantify cell-to-cell variability of cell populations, we have developed an experimental platform for real-time measurements of oxygen consumption (OC) kinetics at the single-cell level. Unique challenges inherent to these single-cell measurements arise, and no existing data analysis methodology is available to address them. Here we present a data processing and analysis method that addresses challenges encountered with this unique type of data in order to extract biologically relevant information. We applied the method to analyze OC profiles obtained with single cells of two different cell lines derived from metaplastic and dysplastic human Barrett's esophageal epithelium. In terms of method development, three main challenges were considered for this heterogeneous dynamic system: (i) high levels of noise, (ii) the lack of a priori knowledge of single-cell dynamics, and (iii) the role of intercellular variability within and across cell types. Several strategies and solutions to address each of these three challenges are presented. The features such as slopes, intercepts, breakpoint or change-point were extracted for every OC profile and compared across individual cells and cell types. The results demonstrated that the extracted features facilitated exposition of subtle differences between individual cells and their responses to cell-cell interactions. With minor modifications, this method can be used to process and analyze data from other acquisition and experimental modalities at the single-cell level, providing a valuable statistical framework for single-cell analysis.


Asunto(s)
Oxígeno/metabolismo , Análisis de la Célula Individual/métodos , Esófago de Barrett/metabolismo , Línea Celular , Esófago/metabolismo , Humanos , Modelos Lineales
17.
Mol Biosyst ; 7(4): 1093-104, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21212895

RESUMEN

Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.


Asunto(s)
Perfilación de la Expresión Génica , Proteómica , Shewanella/genética , Shewanella/metabolismo , Algoritmos , Cromatos/farmacología , Biología Computacional , Contaminantes Ambientales/farmacología , Regulación Bacteriana de la Expresión Génica/efectos de los fármacos , Modelos Estadísticos , Compuestos de Potasio/farmacología , Control de Calidad , Shewanella/efectos de los fármacos , Factores de Tiempo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA