Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
PLoS One ; 18(4): e0284619, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37098036

RESUMO

Feature selection in high dimensional gene expression datasets not only reduces the dimension of the data, but also the execution time and computational cost of the underlying classifier. The current study introduces a novel feature selection method called weighted signal to noise ratio (WSNR) by exploiting the weights of features based on support vectors and signal to noise ratio, with an objective to identify the most informative genes in high dimensional classification problems. The combination of two state-of-the-art procedures enables the extration of the most informative genes. The corresponding weights of these procedures are then multiplied and arranged in decreasing order. Larger weight of a feature indicates its discriminatory power in classifying the tissue samples to their true classes. The current method is validated on eight gene expression datasets. Moreover, results of the proposed method (WSNR) are also compared with four well known feature selection methods. We found that the (WSNR) outperform the other competing methods on 6 out of 8 datasets. Box-plots and Bar-plots of the results of the proposed method and all the other methods are also constructed. The proposed method is further assessed on simulated data. Simulation analysis reveal that (WSNR) outperforms all the other methods included in the study.


Assuntos
Algoritmos , Perfilação da Expressão Gênica , Perfilação da Expressão Gênica/métodos , Razão Sinal-Ruído , Análise em Microsséries , Expressão Gênica
2.
J Healthc Eng ; 2021: 2567080, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34512933

RESUMO

In this paper, we have focused on machine learning (ML) feature selection (FS) algorithms for identifying and diagnosing multidrug-resistant (MDR) tuberculosis (TB). MDR-TB is a universal public health problem, and its early detection has been one of the burning issues. The present study has been conducted in the Malakand Division of Khyber Pakhtunkhwa, Pakistan, to further add to the knowledge on the disease and to deal with the issues of identification and early detection of MDR-TB by ML algorithms. These models also identify the most important factors causing MDR-TB infection whose study gives additional insights into the matter. ML algorithms such as random forest, k-nearest neighbors, support vector machine, logistic regression, leaset absolute shrinkage and selection operator (LASSO), artificial neural networks (ANNs), and decision trees are applied to analyse the case-control dataset. This study reveals that close contacts of MDR-TB patients, smoking, depression, previous TB history, improper treatment, and interruption in first-line TB treatment have a great impact on the status of MDR. Accordingly, weight loss, chest pain, hemoptysis, and fatigue are important symptoms. Based on accuracy, sensitivity, and specificity, SVM and RF are the suggested models to be used for patients' classifications.


Assuntos
Antituberculosos , Tuberculose Resistente a Múltiplos Medicamentos , Algoritmos , Antituberculosos/uso terapêutico , Humanos , Aprendizado de Máquina , Paquistão , Tuberculose Resistente a Múltiplos Medicamentos/diagnóstico , Tuberculose Resistente a Múltiplos Medicamentos/tratamento farmacológico , Tuberculose Resistente a Múltiplos Medicamentos/epidemiologia
3.
PeerJ Comput Sci ; 7: e562, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34141889

RESUMO

In this paper, a novel feature selection method called Robust Proportional Overlapping Score (RPOS), for microarray gene expression datasets has been proposed, by utilizing the robust measure of dispersion, i.e., Median Absolute Deviation (MAD). This method robustly identifies the most discriminative genes by considering the overlapping scores of the gene expression values for binary class problems. Genes with a high degree of overlap between classes are discarded and the ones that discriminate between the classes are selected. The results of the proposed method are compared with five state-of-the-art gene selection methods based on classification error, Brier score, and sensitivity, by considering eleven gene expression datasets. Classification of observations for different sets of selected genes by the proposed method is carried out by three different classifiers, i.e., random forest, k-nearest neighbors (k-NN), and support vector machine (SVM). Box-plots and stability scores of the results are also shown in this paper. The results reveal that in most of the cases the proposed method outperforms the other methods.

4.
PeerJ Comput Sci ; 7: e746, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35036527

RESUMO

BACKGROUND: Forecasting the time of forthcoming pandemic reduces the impact of diseases by taking precautionary steps such as public health messaging and raising the consciousness of doctors. With the continuous and rapid increase in the cumulative incidence of COVID-19, statistical and outbreak prediction models including various machine learning (ML) models are being used by the research community to track and predict the trend of the epidemic, and also in developing appropriate strategies to combat and manage its spread. METHODS: In this paper, we present a comparative analysis of various ML approaches including Support Vector Machine, Random Forest, K-Nearest Neighbor and Artificial Neural Network in predicting the COVID-19 outbreak in the epidemiological domain. We first apply the autoregressive distributed lag (ARDL) method to identify and model the short and long-run relationships of the time-series COVID-19 datasets. That is, we determine the lags between a response variable and its respective explanatory time series variables as independent variables. Then, the resulting significant variables concerning their lags are used in the regression model selected by the ARDL for predicting and forecasting the trend of the epidemic. RESULTS: Statistical measures-Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Symmetric Mean Absolute Percentage Error (SMAPE)-are used for model accuracy. The values of MAPE for the best-selected models for confirmed, recovered and deaths cases are 0.003, 0.006 and 0.115, respectively, which falls under the category of highly accurate forecasts. In addition, we computed 15 days ahead forecast for the daily deaths, recovered, and confirm patients and the cases fluctuated across time in all aspects. Besides, the results reveal the advantages of ML algorithms for supporting the decision-making of evolving short-term policies.

5.
PLoS One ; 15(11): e0242762, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33253248

RESUMO

OBJECTIVES: Forecasting epidemics like COVID-19 is of crucial importance, it will not only help the governments but also, the medical practitioners to know the future trajectory of the spread, which might help them with the best possible treatments, precautionary measures and protections. In this study, the popular autoregressive integrated moving average (ARIMA) will be used to forecast the cumulative number of confirmed, recovered cases, and the number of deaths in Pakistan from COVID-19 spanning June 25, 2020 to July 04, 2020 (10 days ahead forecast). METHODS: To meet the desire objectives, data for this study have been taken from the Ministry of National Health Service of Pakistan's website from February 27, 2020 to June 24, 2020. Two different ARIMA models will be used to obtain the next 10 days ahead point and 95% interval forecast of the cumulative confirmed cases, recovered cases, and deaths. Statistical software, RStudio, with "forecast", "ggplot2", "tseries", and "seasonal" packages have been used for data analysis. RESULTS: The forecasted cumulative confirmed cases, recovered, and the number of deaths up to July 04, 2020 are 231239 with a 95% prediction interval of (219648, 242832), 111616 with a prediction interval of (101063, 122168), and 5043 with a 95% prediction interval of (4791, 5295) respectively. Statistical measures i.e. root mean square error (RMSE) and mean absolute error (MAE) are used for model accuracy. It is evident from the analysis results that the ARIMA and seasonal ARIMA model is better than the other time series models in terms of forecasting accuracy and hence recommended to be used for forecasting epidemics like COVID-19. CONCLUSION: It is concluded from this study that the forecasting accuracy of ARIMA models in terms of RMSE, and MAE are better than the other time series models, and therefore could be considered a good forecasting tool in forecasting the spread, recoveries, and deaths from the current outbreak of COVID-19. Besides, this study can also help the decision-makers in developing short-term strategies with regards to the current number of disease occurrences until an appropriate medication is developed.


Assuntos
COVID-19/epidemiologia , Previsões , Humanos , Modelos Estatísticos , Paquistão/epidemiologia , Estações do Ano
6.
J Pak Med Assoc ; 70(7): 1169-1172, 2020 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-32799268

RESUMO

OBJECTIVE: To assess the risk factors associated with tonsillitis. METHODS: The cross-sectional study was conducted at Mardan Medical Complex and District Headquarter Hospital, Mardan, Pakistan, from January to June 2018, and comprised tonsillitis patients. Data was collected using a questionnaire which included different risk factors like age 1-10 years, gender, residential area, dietary habit etc. Data was analysed using SPSS 20. RESULTS: Of the 325 subjects, 200(61.54%), were clinically diagnosed with tonsillitis; 138(69%) being males. Age, unhygienic living condition, balanced diet, stressful environment and the use of sore/spicy foods were identified as significantly associated factors (p<0.05). CONCLUSIONS: Age, unhygienic living condition, balanced diet, stressful environment and the use of sore/spicy food were found to have a strong association with tonsillitis.


Assuntos
Tonsilite , Criança , Pré-Escolar , Estudos Transversais , Comportamento Alimentar , Humanos , Lactente , Masculino , Paquistão/epidemiologia , Fatores de Risco , Tonsilite/epidemiologia
7.
J Pak Med Assoc ; 70(12(B)): 2356-2362, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33475543

RESUMO

OBJECTIVE: The aim of this study is to filter out the most informative genes that mainly regulate the target tissue class, increase classification accuracy, reduce the curse of dimensionality, and discard redundant and irrelevant genes. METHOD: This paper presented the idea of gene selection using bagging sub-forest (BSF). The proposed method provided genes importance grounded on the idea specified in the standard random forest algorithm. The new method is compared with three state-of-the art methods, i.e., Wilcoxon, masked painter and proportional overlapped score (POS). These methods were applied on 5 data sets, i.e. Colon, Lymph node breast cancer, Leukaemia, Serrated colorectal carcinomas, and Breast Cancer. Comparison was done by selecting top 20 genes by applying the gene selection methods and applying random forest (RF) and support vector machine (SVM) classifiers to assess their predictive performance on the datasets with selected genes. Classification accuracy, Brier score, and sensitivity have been used as performance measures. RESULTS: The proposed method gave better results than the other methods using both random forest and SVM classifiers on all the datasets among all the feature selection methods. CONCLUSIONS: The proposed method showed improved performance in terms of classification accuracy, Brier score and sensitivity, and hence, could be used as a novel method for gene selection to classify tissue samples into their correct classes.


Assuntos
Aprendizado de Máquina , Máquina de Vetores de Suporte , Algoritmos , Genes Reguladores , Genômica , Humanos
8.
Adv Data Anal Classif ; 12(4): 827-840, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30931011

RESUMO

Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.

9.
PLoS One ; 11(11): e0166990, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27898702

RESUMO

Exponential Smooth Transition Autoregressive (ESTAR) models can capture non-linear adjustment of the deviations from equilibrium conditions which may explain the economic behavior of many variables that appear non stationary from a linear viewpoint. Many researchers employ the Kapetanios test which has a unit root as the null and a stationary nonlinear model as the alternative. However this test statistics is based on the assumption of normally distributed errors in the DGP. Cook has analyzed the size of the nonlinear unit root of this test in the presence of heavy-tailed innovation process and obtained the critical values for both finite variance and infinite variance cases. However the test statistics of Cook are oversized. It has been found by researchers that using conventional tests is dangerous though the best performance among these is a HCCME. The over sizing for LM tests can be reduced by employing fixed design wild bootstrap remedies which provide a valuable alternative to the conventional tests. In this paper the size of the Kapetanios test statistic employing hetroscedastic consistent covariance matrices has been derived and the results are reported for various sample sizes in which size distortion is reduced. The properties for estimates of ESTAR models have been investigated when errors are assumed non-normal. We compare the results obtained through the fitting of nonlinear least square with that of the quantile regression fitting in the presence of outliers and the error distribution was considered to be from t-distribution for various sample sizes.


Assuntos
Coleta de Dados/estatística & dados numéricos , Modelos Estatísticos , Dinâmica não Linear , Humanos , Distribuição Normal , Análise de Regressão , Tamanho da Amostra
10.
Methods Inf Med ; 55(6): 557-563, 2016 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-27868133

RESUMO

BACKGROUND: Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. OBJECTIVES: We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. METHODS: The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. RESULTS: In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. CONCLUSIONS: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma.


Assuntos
Algoritmos , Bases de Dados como Assunto , Glaucoma/diagnóstico , Área Sob a Curva , Simulação por Computador , Humanos
11.
J Ayub Med Coll Abbottabad ; 28(3): 514-517, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-28712225

RESUMO

BACKGROUND: Soft tissues tumours are tumours of mesenchymal origin excluding epithelial, skeletal tissue, reticuloendothelial system, brain coverings and solid viscera of the body. The objective of this study was to know the histopathological pattern of soft tissues tumours in the Pathology Department of Lady Reading Hospital Peshawar Khyber Pakhtunkhwa Pakistan. METHODS: This descriptive study was conducted on retrospective data from January 2009 to December 2013. All the soft tissues biopsy specimens were received in 10% formalin, labelled, gross performed, sections processed in alcohol, xylene, wax, block prepared, frozen, microtome sections taken and processed for H&E staining, mounted and reported by a Histopathologist. The inclusion criteria were any sufficient soft tissue tumour biopsy specimen of any age, sex, location in body whereas the exclusion criteria were autolysed biopsy specimen. A minimum of four and maximum of eight sections and 5 micron thick were taken from each specimen. RESULTS: A total of 267 soft tissues tumours biopsy specimens were received in the pathology laboratory with age range of 1-75 years, with mean age of 30.68±17.71 years. Male to female ratio was 1.13:1. Amongst the total, benign tumours were 176 (65.91%). Haemangioma, 73 (27.3%) was the commonest tumours followed by lipomas 41 (15.4%) cases. Amongst the total malignant tumours, i.e., 91 (34.08%), rhabdomyosarcoma, 35 (13.1%) was the commonest tumour followed by angiosarcoma 14 (5.2%) cases. CONCLUSIONS: Haemangioma is the commonest benign tumour and rhabdomyosarcoma is the commonest malignant tumour in this study.


Assuntos
Neoplasias de Tecidos Moles/patologia , Adolescente , Adulto , Idoso , Biópsia , Criança , Pré-Escolar , Feminino , Hemangioma/patologia , Hemangiossarcoma/patologia , Humanos , Lactente , Lipoma/patologia , Masculino , Pessoa de Meia-Idade , Paquistão , Serviço Hospitalar de Patologia , Estudos Retrospectivos , Rabdomiossarcoma/patologia , Adulto Jovem
12.
BMC Bioinformatics ; 15: 274, 2014 Aug 11.
Artigo em Inglês | MEDLINE | ID: mdl-25113817

RESUMO

BACKGROUND: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task. RESULTS: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance. CONCLUSIONS: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes.


Assuntos
Perfilação da Expressão Gênica/métodos , Genômica/métodos , Análise por Conglomerados , Humanos , Análise de Sequência com Séries de Oligonucleotídeos , Máquina de Vetores de Suporte
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA