Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 36(5): 1570-1576, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-31621830

RESUMO

MOTIVATION: Matched case-control analysis is widely used in biomedical studies to identify exposure variables associated with health conditions. The matching is used to improve the efficiency. Existing variable selection methods for matched case-control studies are challenged in high-dimensional settings where interactions among variables are also important. We describe a quite different method for high-dimensional matched case-control data, based on the potential outcome model, which is not only flexible regarding the number of matching and exposure variables but also able to detect interaction effects. RESULTS: We present Matched Forest (MF), an algorithm for variable selection in matched case-control data. The method preserves the case and control values in each instance but transforms the matched case-control data with added counterfactuals. A modified variable importance score from a supervised learner is used to detect important variables. The method is conceptually simple and can be applied with widely available software tools. Simulation studies show the effectiveness of MF in identifying important variables. MF is also applied to data from the biomedical domain and its performance is compared with alternative approaches. AVAILABILITY AND IMPLEMENTATION: R code for implementing MF is available at https://github.com/NooshinSh/Matched_Forest. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Estudos de Casos e Controles , Florestas , Aprendizado de Máquina Supervisionado
2.
BMC Genomics ; 19(1): 841, 2018 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-30482155

RESUMO

BACKGROUND: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. RESULTS: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. CONCLUSIONS: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.


Assuntos
Cistadenocarcinoma Seroso/classificação , Variações do Número de Cópias de DNA , Ciência de Dados/métodos , Genoma Humano , Neoplasias Ovarianas/classificação , Cistadenocarcinoma Seroso/genética , Cistadenocarcinoma Seroso/patologia , Feminino , Humanos , Gradação de Tumores , Neoplasias Ovarianas/genética , Neoplasias Ovarianas/patologia , Sequenciamento Completo do Genoma
3.
IEEE/ACM Trans Comput Biol Bioinform ; 18(4): 1620-1631, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-31675340

RESUMO

Topological data analysis (TDA) is a powerful method for reducing data dimensionality, mining underlying data relationships, and intuitively representing the data structure. The Mapper algorithm is one such tool that projects high-dimensional data to 1-dimensional space by using a filter function that is subsequently used to reconstruct the data topology relationships. However, domain context information and prior knowledge have not been considered in current TDA modeling frameworks. Here, we report the development and evaluation of a semi-supervised topological analysis (STA) framework that incorporates discrete or continuously labeled data points and selects the most relevant filter functions accordingly. We validate the proposed STA framework with simulation data and then apply it to samples from Genotype-Tissue Expression data and ovarian cancer transcriptome datasets. The graphs generated by STA for these 2 datasets, based on gene expression profiles, are consistent with prior knowledge, thereby supporting the effectiveness of the proposed framework.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Perfilação da Expressão Gênica/métodos , Aprendizado de Máquina Supervisionado , Transcriptoma/genética , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Feminino , Humanos , Neoplasias Ovarianas/genética
4.
IEEE Trans Neural Netw Learn Syst ; 29(10): 4709-4718, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-29990242

RESUMO

In this paper, we propose a new end-to-end deep neural network model for time-series classification (TSC) with emphasis on both the accuracy and the interpretation. The proposed model contains a convolutional network component to extract high-level features and a recurrent network component to enhance the modeling of the temporal characteristics of TS data. In addition, a feedforward fully connected network with the sparse group lasso (SGL) regularization is used to generate the final classification. The proposed architecture not only achieves satisfying classification accuracy, but also obtains good interpretability through the SGL regularization. All these networks are connected and jointly trained in an end-to-end framework, and it can be generally applied to TSC tasks across different domains without the efforts of feature engineering. Our experiments in various TS data sets show that the proposed model outperforms the traditional convolutional neural network model for the classification accuracy, and also demonstrate how the SGL contributes to a better model interpretation.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA