Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 3 de 3
Filtrar
Mais filtros

Base de dados
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
AMIA Annu Symp Proc ; 2021: 891-899, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35309001

RESUMO

The persistence and emergence of new multi-drug resistant Mycobacterium tuberculosis (M. tb) strains continues to advance the devastating tuberculosis (TB) epidemic. Robust systems are needed to accurately and rapidly perform drug-resistance profiling, and machine learning (ML) methods combined with genomic sequence data may provide novel insights into drug-resistance mechanisms. Using 372 M. tb isolates, the combined utility of ML and bioinformatics to perform drug-resistance profiling is demonstrated. SNPs, InDels, and dinucleotide frequencies are explored as input features for three ML models, namely Decision Trees, Random Forest, and the eXtreme Gradient Boosted model. Using SNPs and InDels, all three models performed equally well yielding a 99% accuracy, 97% recall, and 99% F1-score. Using dinucleotide frequencies, the XGBoost algorithm was superior with a 97% accuracy, 94% recall and 97% F1-score. This study validates the use of variants and presents dinucleotide features as another effective feature encoding method for ML-based phenotype classification.


Assuntos
Antituberculosos , Farmacorresistência Bacteriana Múltipla , Aprendizado de Máquina , Mycobacterium tuberculosis , Tuberculose , Antituberculosos/farmacologia , Antituberculosos/uso terapêutico , Farmacorresistência Bacteriana Múltipla/genética , Humanos , Mycobacterium tuberculosis/efeitos dos fármacos , Mycobacterium tuberculosis/genética , Tuberculose/tratamento farmacológico
2.
IEEE Access ; 8: 195263-195273, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-34976561

RESUMO

The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.

3.
J Bioinform Comput Biol ; 14(5): 1650022, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27411306

RESUMO

Microarray for transcriptomics experiments often suffer from limited statistical power due to small sample size. Quantile discretization (QD) maps expression values for a sample into a series of equivalently sized 'bins' that represent a discrete numerical range, e.g. [Formula: see text]4 to [Formula: see text]4, which enables normalized data from multiple experiments and/or expression platforms to be combined for re-analysis. We found, however, that informal selection of bin numbers often resulted in loss of the underlying correlation structure in the data through assigning of the same numerical value to genes that are in reality expressed at significantly different levels within a sample. Here we report a procedure for determining an optimal bin number for dataset. Applying this to integrated public breast cancer datasets enabled statistical identification of several differentially expressed tumorigenesis-related genes that were not found when analyzing the individual datasets, and also several cancer biomarkers not previously indicated as having utility in the disease. Notably, differential modulation of translational control and protein synthesis via multiple pathways were found to potentially have central roles in breast cancer development and progression. These findings suggest that our protocol has significant utility in making meaningful novel biomedical discoveries by leveraging the large public expression data repositories.


Assuntos
Algoritmos , Neoplasias da Mama/genética , Regulação Neoplásica da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Biomarcadores Tumorais/genética , Bases de Dados Genéticas , Feminino , Humanos , Masculino , Modelos Teóricos , Fenótipo , Neoplasias da Próstata/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA