Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
J Immigr Minor Health ; 25(4): 862-869, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-36757600

RESUMO

COVID-19 burdens are disproportionally high in underserved and vulnerable communities in Arizona. As the pandemic progressed, it is unclear if the initial associated health disparities have changed. This study aims to elicit the dynamic landscape of COVID-19 disparities at the community level and identify newly emerging vulnerable subpopulations. Findings from this study can inform interventions to increase health equity among minoritized communities in the Southwest, other regions of the US, and globally. We compiled biweekly COVID-19 case counts of 274 zip code tabulation areas (ZCTAs) in Arizona from October 21, 2020, to November 25, 2021, a time spanning multiple waves of COVID-19 case growth. Within each biweekly period, we tested the associations between the growth rate of COVID-19 cases and the population composition in a ZCTA including race/ethnicity, income, employment, and age using multiple regression analysis. We then compared the associations across time periods to discover temporal patterns of health disparities. The association between the percentage of Latinx population and the COVID-19 growth rate was positive before April 2021 but gradually converted to negative afterwards. The percentage of Black population was not associated with the COVID-19 growth rate at the beginning of the study but became positive after January 2021 which persisted till the end of the study period. Young median age and high unemployment rate emerged as new risk factors around mid-August 2021. Based on these findings, we identified 37 ZCTAs that were highly vulnerable to future fast escalation of COVID-19 cases. As the pandemic progresses, vulnerabilities associated with Latinx ethnicity improved gradually, possibly bolstered by culturally responsive programs in Arizona to support Latinx. Still communities with disadvantaged social determinants of health continued to struggle. Our findings inform the need to adjust current resource allocations to support the design and implementation of new interventions addressing the emerging vulnerabilities at the community level.


Assuntos
COVID-19 , Disparidades nos Níveis de Saúde , Humanos , Arizona/epidemiologia , População Negra , Emprego , Etnicidade , Hispânico ou Latino , Determinantes Sociais da Saúde
2.
J Public Health Manag Pract ; 27(5): E205-E209, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33109933

RESUMO

CONTEXT: Public health collaboratives are effective platforms to develop interventions for improving population health. Most collaboratives are limited to the public health and health care delivery sectors; however, multisector collaboratives are becoming more recognized as a strategy for combining efforts from medical, public health, social services, and other sectors. PROGRAM: Based on a 4-year multisector collaborative project, we identify concepts for widening the lens to conduct multisector alignment research. The goal of the collaborative was to address the serious care fragmentation and conflicting financing systems for persons with behavioral health disorders. Our work with these 7 sectors provides insight for creating a framework to conduct multisector alignment research for investigating how alignment problems can be identified, investigated, and applied to achieve systems alignment. IMPLEMENTATION: The multisector collaborative was undertaken in Maricopa County, encompassing Phoenix, Arizona, and consisted of more than 50 organizations representing 7 sectors. EVALUATION: We develop a framework for systems alignment consisting of 4 dimensions (alignment problems, alignment mechanisms, alignment solutions, and goal attainment) and a vocabulary for implementing multisector alignment research. We then describe the interplay and reciprocity between the 4 dimensions. DISCUSSION: This framework can be used by multisector collaboratives to help identify strategies, implement programs, and develop metrics to assess impact on population health and equity.


Assuntos
Saúde da População , Arizona , Humanos , Saúde Pública , Serviço Social
3.
BMC Bioinformatics ; 21(Suppl 2): 77, 2020 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-32164534

RESUMO

BACKGROUND: In biomarker discovery, applying domain knowledge is an effective approach to eliminating false positive features, prioritizing functionally impactful markers and facilitating the interpretation of predictive signatures. Several computational methods have been developed that formulate the knowledge-based biomarker discovery as a feature selection problem guided by prior information. These methods often require that prior information is encoded as a single score and the algorithms are optimized for biological knowledge of a specific type. However, in practice, domain knowledge from diverse resources can provide complementary information. But no current methods can integrate heterogeneous prior information for biomarker discovery. To address this problem, we developed the Know-GRRF (know-guided regularized random forest) method that enables dynamic incorporation of domain knowledge from multiple disciplines to guide feature selection. RESULTS: Know-GRRF embeds domain knowledge in a regularized random forest framework. It combines prior information from multiple domains in a linear model to derive a composite score, which, together with other tuning parameters, controls the regularization of the random forests model. Know-GRRF concurrently optimizes the weight given to each type of domain knowledge and other tuning parameters to minimize the AIC of out-of-bag predictions. The objective is to select a compact feature subset that has a high discriminative power and strong functional relevance to the biological phenotype. Via rigorous simulations, we show that Know-GRRF guided by multiple-domain prior information outperforms feature selection methods guided by single-domain prior information or no prior information. We then applied Known-GRRF to a real-world study to identify prognostic biomarkers of prostate cancers. We evaluated the combination of cancer-related gene annotations, evolutionary conservation and pre-computed statistical scores as the prior knowledge to assemble a panel of biomarkers. We discovered a compact set of biomarkers with significant improvements on prediction accuracies. CONCLUSIONS: Know-GRRF is a powerful novel method to incorporate knowledge from multiple domains for feature selection. It has a broad range of applications in biomarker discoveries. We implemented this method and released a KnowGRRF package in the R/CRAN archive.


Assuntos
Algoritmos , Biomarcadores Tumorais/genética , Área Sob a Curva , Biomarcadores Tumorais/metabolismo , Bases de Dados Factuais , Humanos , Modelos Lineares , Masculino , Neoplasias da Próstata/diagnóstico , Curva ROC
4.
Bioinformatics ; 36(5): 1570-1576, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-31621830

RESUMO

MOTIVATION: Matched case-control analysis is widely used in biomedical studies to identify exposure variables associated with health conditions. The matching is used to improve the efficiency. Existing variable selection methods for matched case-control studies are challenged in high-dimensional settings where interactions among variables are also important. We describe a quite different method for high-dimensional matched case-control data, based on the potential outcome model, which is not only flexible regarding the number of matching and exposure variables but also able to detect interaction effects. RESULTS: We present Matched Forest (MF), an algorithm for variable selection in matched case-control data. The method preserves the case and control values in each instance but transforms the matched case-control data with added counterfactuals. A modified variable importance score from a supervised learner is used to detect important variables. The method is conceptually simple and can be applied with widely available software tools. Simulation studies show the effectiveness of MF in identifying important variables. MF is also applied to data from the biomedical domain and its performance is compared with alternative approaches. AVAILABILITY AND IMPLEMENTATION: R code for implementing MF is available at https://github.com/NooshinSh/Matched_Forest. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Estudos de Casos e Controles , Florestas , Aprendizado de Máquina Supervisionado
5.
BMC Genomics ; 19(1): 841, 2018 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-30482155

RESUMO

BACKGROUND: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. RESULTS: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. CONCLUSIONS: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.


Assuntos
Cistadenocarcinoma Seroso/classificação , Variações do Número de Cópias de DNA , Ciência de Dados/métodos , Genoma Humano , Neoplasias Ovarianas/classificação , Cistadenocarcinoma Seroso/genética , Cistadenocarcinoma Seroso/patologia , Feminino , Humanos , Gradação de Tumores , Neoplasias Ovarianas/genética , Neoplasias Ovarianas/patologia , Sequenciamento Completo do Genoma
6.
IEEE Trans Neural Netw Learn Syst ; 29(10): 4709-4718, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-29990242

RESUMO

In this paper, we propose a new end-to-end deep neural network model for time-series classification (TSC) with emphasis on both the accuracy and the interpretation. The proposed model contains a convolutional network component to extract high-level features and a recurrent network component to enhance the modeling of the temporal characteristics of TS data. In addition, a feedforward fully connected network with the sparse group lasso (SGL) regularization is used to generate the final classification. The proposed architecture not only achieves satisfying classification accuracy, but also obtains good interpretability through the SGL regularization. All these networks are connected and jointly trained in an end-to-end framework, and it can be generally applied to TSC tasks across different domains without the efforts of feature engineering. Our experiments in various TS data sets show that the proposed model outperforms the traditional convolutional neural network model for the classification accuracy, and also demonstrate how the SGL contributes to a better model interpretation.

8.
PLoS One ; 13(4): e0196556, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29702695

RESUMO

BACKGROUND: Next generation sequencing tests (NGS) are usually performed on relatively small core biopsy or fine needle aspiration (FNA) samples. Data is limited on what amount of tumor by volume or minimum number of FNA passes are needed to yield sufficient material for running NGS. We sought to identify the amount of tumor for running the PCDx NGS platform. METHODS: 2,723 consecutive tumor tissues of all cancer types were queried and reviewed for inclusion. Information on tumor volume, success of performing NGS, and results of NGS were compiled. Assessment of sequence analysis, mutation calling and sensitivity, quality control, drug associations, and data aggregation and analysis were performed. RESULTS: 6.4% of samples were rejected from all testing due to insufficient tumor quantity. The number of genes with insufficient sensitivity make definitive mutation calls increased as the percentage of tumor decreased, reaching statistical significance below 5% tumor content. The number of drug associations also decreased with a lower percentage of tumor, but this difference only became significant between 1-3%. The number of drug associations did decrease with smaller tissue size as expected. Neither specimen size or percentage of tumor affected the ability to pass mRNA quality control. A tumor area of 10 mm2 provides a good margin of error for specimens to yield adequate drug association results. CONCLUSIONS: Specimen suitability remains a major obstacle to clinical NGS testing. We determined that PCR-based library creation methods allow the use of smaller specimens, and those with a lower percentage of tumor cells to be run on the PCDx NGS platform.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias/diagnóstico , Neoplasias/genética , Biópsia por Agulha Fina/métodos , Análise Mutacional de DNA , DNA Complementar/metabolismo , Feminino , Biblioteca Gênica , Humanos , Masculino , Mutação , Reação em Cadeia da Polimerase , RNA Mensageiro/metabolismo , Reprodutibilidade dos Testes , Estudos Retrospectivos , Sensibilidade e Especificidade
9.
Int J Cancer ; 142(11): 2355-2362, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29313979

RESUMO

While long-term survival rates for early-stage lung cancer are high, most cases are diagnosed in later stages that can negatively impact survival rates. We aim to design a simple, single biomarker blood test for early-stage lung cancer that is robust to preclinical variables and can be readily implemented in the clinic. Whole blood was collected in PAXgene tubes from a training set of 29 patients, and a validation set of 260 patients, of which samples from 58 patients were prospectively collected in a clinical trial specifically for our study. After RNA was extracted, the expressions of FPR1 and a reference gene were quantified by an automated one-step Taqman RT-PCR assay. Elevated levels of FPR1 mRNA in whole blood predicted lung cancer status with a sensitivity of 55% and a specificity of 87% on all validation specimens. The prospectively collected specimens had a significantly higher 68% sensitivity and 89% specificity. Results from patients with benign nodules were similar to healthy volunteers. No meaningful correlation was present between our test results and any clinical characteristic other than lung cancer diagnosis. FPR1 mRNA levels in whole blood can predict the presence of lung cancer. Using this as a reflex test for positive lung cancer screening computed tomography scans has the potential to increase the positive predictive value. This marker can be easily measured in an automated process utilizing off-the-shelf equipment and reagents. Further work is justified to explain the source of this biomarker.


Assuntos
Biomarcadores Tumorais , Carcinoma Pulmonar de Células não Pequenas/diagnóstico , Carcinoma Pulmonar de Células não Pequenas/genética , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/genética , RNA Mensageiro , Receptores de Formil Peptídeo/genética , Carcinoma de Pequenas Células do Pulmão/diagnóstico , Carcinoma de Pequenas Células do Pulmão/genética , Estudos de Casos e Controles , Comorbidade , Detecção Precoce de Câncer , Feminino , Humanos , Masculino , Estadiamento de Neoplasias , Curva ROC
10.
IEEE Trans Neural Netw Learn Syst ; 29(1): 156-166, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-27810837

RESUMO

Autoassociative neural networks (ANNs) have been proposed as a nonlinear extension of principal component analysis (PCA), which is commonly used to identify linear variation patterns in high-dimensional data. While principal component scores represent uncorrelated features, standard backpropagation methods for training ANNs provide no guarantee of producing distinct features, which is important for interpretability and for discovering the nature of the variation patterns in the data. Here, we present an alternating nonlinear PCA method, which encourages learning of distinct features in ANNs. A new measure motivated by the condition of orthogonal loadings in PCA is proposed for measuring the extent to which the nonlinear principal components represent distinct variation patterns. We demonstrate the effectiveness of our method using a simulated point cloud data set as well as a subset of the MNIST handwritten digits data. The results show that standard ANNs consistently mix the true variation sources in the low-dimensional representation learned by the model, whereas our alternating method produces solutions where the patterns are better separated in the low-dimensional space.

11.
Int Urol Nephrol ; 48(2): 249-56, 2016 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26661258

RESUMO

PURPOSE: Predictive models allow clinicians to identify higher- and lower-risk patients and make targeted treatment decisions. Microalbuminuria (MA) is a condition whose presence is understood to be an early marker for cardiovascular disease. The aims of this study were to develop a patient data-driven predictive model and a risk-score assessment to improve the identification of MA. METHODS: The 2007-2008 National Health and Nutrition Examination Survey (NHANES) was utilized to create a predictive model. The dataset was split into thirds; one-third was used to develop the model, while the other two-thirds were utilized for internal validation. The 2012-2013 NHANES was used as an external validation database. Multivariate logistic regression was performed to create the model. Performance was evaluated using three criteria: (1) receiver operating characteristic curves; (2) pseudo-R (2) values; and (3) goodness of fit (Hosmer-Lemeshow). The model was then used to develop a risk-score chart. RESULTS: A model was developed using variables for which there was a significant relationship. Variables included were systolic blood pressure, fasting glucose, C-reactive protein, blood urea nitrogen, and alcohol consumption. The model performed well, and no significant differences were observed when utilized in the validation datasets. A risk score was developed, and the probability of developing MA for each score was calculated. CONCLUSION: The predictive model provides new evidence about variables related with MA and may be used by clinicians to identify at-risk patients and to tailor treatment. The risk score developed may allow clinicians to measure a patient's MA risk.


Assuntos
Albuminúria/diagnóstico , Biomarcadores/análise , Modelos Estatísticos , Inquéritos Nutricionais/métodos , Medição de Risco/métodos , Adulto , Albuminúria/sangue , Albuminúria/epidemiologia , Bases de Dados Factuais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Valor Preditivo dos Testes , Estudos Prospectivos , Curva ROC , Fatores de Risco , Estados Unidos/epidemiologia
12.
IEEE Trans Pattern Anal Mach Intell ; 35(11): 2796-802, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24051736

RESUMO

Time series classification is an important task with many challenging applications. A nearest neighbor (NN) classifier with dynamic time warping (DTW) distance is a strong solution in this context. On the other hand, feature-based approaches have been proposed as both classifiers and to provide insight into the series, but these approaches have problems handling translations and dilations in local patterns. Considering these shortcomings, we present a framework to classify time series based on a bag-of-features representation (TSBF). Multiple subsequences selected from random locations and of random lengths are partitioned into shorter intervals to capture the local information. Consequently, features computed from these subsequences measure properties at different locations and dilations when viewed from the original series. This provides a feature-based approach that can handle warping (although differently from DTW). Moreover, a supervised learner (that handles mixed data types, different units, etc.) integrates location information into a compact codebook through class probability estimates. Additionally, relevant global features can easily supplement the codebook. TSBF is compared to NN classifiers and other alternatives (bag-of-words strategies, sparse spatial sample kernels, shapelets). Our experimental results show that TSBF provides better results than competitive methods on benchmark datasets from the UCR time series database.


Assuntos
Algoritmos , Inteligência Artificial , Modelos Teóricos , Reconhecimento Automatizado de Padrão/métodos , Simulação por Computador
13.
Mol Biosyst ; 8(3): 804-17, 2012 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-22222464

RESUMO

Phenotypic characterization of individual cells provides crucial insights into intercellular heterogeneity and enables access to information that is unavailable from ensemble averaged, bulk cell analyses. Single-cell studies have attracted significant interest in recent years and spurred the development of a variety of commercially available and research-grade technologies. To quantify cell-to-cell variability of cell populations, we have developed an experimental platform for real-time measurements of oxygen consumption (OC) kinetics at the single-cell level. Unique challenges inherent to these single-cell measurements arise, and no existing data analysis methodology is available to address them. Here we present a data processing and analysis method that addresses challenges encountered with this unique type of data in order to extract biologically relevant information. We applied the method to analyze OC profiles obtained with single cells of two different cell lines derived from metaplastic and dysplastic human Barrett's esophageal epithelium. In terms of method development, three main challenges were considered for this heterogeneous dynamic system: (i) high levels of noise, (ii) the lack of a priori knowledge of single-cell dynamics, and (iii) the role of intercellular variability within and across cell types. Several strategies and solutions to address each of these three challenges are presented. The features such as slopes, intercepts, breakpoint or change-point were extracted for every OC profile and compared across individual cells and cell types. The results demonstrated that the extracted features facilitated exposition of subtle differences between individual cells and their responses to cell-cell interactions. With minor modifications, this method can be used to process and analyze data from other acquisition and experimental modalities at the single-cell level, providing a valuable statistical framework for single-cell analysis.


Assuntos
Oxigênio/metabolismo , Análise de Célula Única/métodos , Esôfago de Barrett/metabolismo , Linhagem Celular , Esôfago/metabolismo , Humanos , Modelos Lineares
14.
IEEE Trans Neural Netw Learn Syst ; 23(4): 644-56, 2012 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24805047

RESUMO

Kernel principal component analysis (KPCA) is a method widely used for denoising multivariate data. Using geometric arguments, we investigate why a projection operation inherent to all existing KPCA denoising algorithms can sometimes cause very poor denoising. Based on this, we propose a modification to the projection operation that remedies this problem and can be incorporated into any of the existing KPCA algorithms. Using toy examples and real datasets, we show that the proposed algorithm can substantially improve denoising performance and is more robust to misspecification of an important tuning parameter.

15.
Mol Biosyst ; 7(4): 1093-104, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21212895

RESUMO

Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.


Assuntos
Perfilação da Expressão Gênica , Proteômica , Shewanella/genética , Shewanella/metabolismo , Algoritmos , Cromatos/farmacologia , Biologia Computacional , Poluentes Ambientais/farmacologia , Regulação Bacteriana da Expressão Gênica/efeitos dos fármacos , Modelos Estatísticos , Compostos de Potássio/farmacologia , Controle de Qualidade , Shewanella/efeitos dos fármacos , Fatores de Tempo
16.
Bioinformatics ; 25(15): 1905-14, 2009 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-19447782

RESUMO

MOTIVATION: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems. RESULTS: In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Desulfovibrio vulgaris/genética , Desulfovibrio vulgaris/metabolismo , Perfilação da Expressão Gênica/métodos , Dinâmica não Linear , Proteômica/métodos , Bases de Dados de Proteínas
17.
IEEE Trans Pattern Anal Mach Intell ; 31(7): 1338-44, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19443930

RESUMO

This paper proposes a new feature selection methodology. The methodology is based on the stepwise variable selection procedure, but, instead of using the traditional discriminant metrics such as Wilks' Lambda, it uses an estimation of the misclassification error as the figure of merit to evaluate the introduction of new features. The expected misclassification error rate (MER) is obtained by using the densities of a constructed function of random variables, which is the stochastic representation of the conditional distribution of the quadratic discriminant function estimate. The application of the proposed methodology results in significant savings of computational time in the estimation of classification error over the traditional simulation and cross-validation methods. One of the main advantages of the proposed method is that it provides a direct estimation of the expected misclassification error at the time of feature selection, which provides an immediate assessment of the benefits of introducing an additional feature into an inspection/classification algorithm.


Assuntos
Algoritmos , Inteligência Artificial , Análise de Falha de Equipamento/métodos , Modelos Teóricos , Reconhecimento Automatizado de Padrão/métodos , Simulação por Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA