RESUMO
In recent times, time-to-event data such as time to failure or death is routinely collected alongside high-throughput covariates. These high-dimensional bioinformatics data often challenge classical survival models, which are either infeasible to fit or produce low prediction accuracy due to overfitting. To address this issue, the focus has shifted towards introducing a novel approaches for feature selection and survival prediction. In this article, we propose a new hybrid feature selection approach that handles high-dimensional bioinformatics datasets for improved survival prediction. This study explores the efficacy of four distinct variable selection techniques: LASSO, RSF-vs, SCAD, and CoxBoost, in the context of non-parametric biomedical survival prediction. Leveraging these methods, we conducted comprehensive variable selection processes. Subsequently, survival analysis models-specifically CoxPH, RSF, and DeepHit NN-were employed to construct predictive models based on the selected variables. Furthermore, we introduce a novel approach wherein only variables consistently selected by a majority of the aforementioned feature selection techniques are considered. This innovative strategy, referred to as the proposed method, aims to enhance the reliability and robustness of variable selection, subsequently improving the predictive performance of the survival analysis models. To evaluate the effectiveness of the proposed method, we compare the performance of the proposed approach with the existing LASSO, RSF-vs, SCAD, and CoxBoost techniques using various performance metrics including integrated brier score (IBS), concordance index (C-Index) and integrated absolute error (IAE) for numerous high-dimensional survival datasets. The real data applications reveal that the proposed method outperforms the competing methods in terms of survival prediction accuracy.
Assuntos
Redes Neurais de Computação , Humanos , Análise de Sobrevida , Estatísticas não Paramétricas , Biologia Computacional/métodosRESUMO
In this article, we propose the exponentiated sine-generated family of distributions. Some important properties are demonstrated, such as the series representation of the probability density function, quantile function, moments, stress-strength reliability, and Rényi entropy. A particular member, called the exponentiated sine Weibull distribution, is highlighted; we analyze its skewness and kurtosis, moments, quantile function, residual mean and reversed mean residual life functions, order statistics, and extreme value distributions. Maximum likelihood estimation and Bayes estimation under the square error loss function are considered. Simulation studies are used to assess the techniques, and their performance gives satisfactory results as discussed by the mean square error, confidence intervals, and coverage probabilities of the estimates. The stress-strength reliability parameter of the exponentiated sine Weibull model is derived and estimated by the maximum likelihood estimation method. Also, nonparametric bootstrap techniques are used to approximate the confidence interval of the reliability parameter. A simulation is conducted to examine the mean square error, standard deviations, confidence intervals, and coverage probabilities of the reliability parameter. Finally, three real applications of the exponentiated sine Weibull model are provided. One of them considers stress-strength data.
RESUMO
Probability distributions play a pivotal and significant role in modeling real-life data in every field. For this activity, a series of probability distributions have been introduced and exercised in applied sectors. This paper also contributes a new method for modeling continuous data sets. The proposed family is called the exponent power sine-G family of distributions. Based on the exponent power sine-G method, a new model, namely, the exponent power sine-Weibull model is studied. Several mathematical properties such as quantile function, identifiability property, and rth moment are derived. For the exponent power sine-G method, the maximum likelihood estimators are obtained. Simulation studies are also presented. Finally, the optimality of the exponent power sine-Weibull model is shown by taking two applications from the healthcare sector. Based on seven evaluating criteria, it is demonstrated that the proposed model is the best competing distribution for analyzing healthcare phenomena.
RESUMO
RNA modifications are pivotal in the development of newly synthesized structures, showcasing a vast array of alterations across various RNA classes. Among these, 5-hydroxymethylcytosine (5HMC) stands out, playing a crucial role in gene regulation and epigenetic changes, yet its detection through conventional methods proves cumbersome and costly. To address this, we propose Deep5HMC, a robust learning model leveraging machine learning algorithms and discriminative feature extraction techniques for accurate 5HMC sample identification. Our approach integrates seven feature extraction methods and various machine learning algorithms, including Random Forest, Naive Bayes, Decision Tree, and Support Vector Machine. Through K-fold cross-validation, our model achieved a notable 84.07% accuracy rate, surpassing previous models by 7.59%, signifying its potential in early cancer and cardiovascular disease diagnosis. This study underscores the promise of Deep5HMC in offering insights for improved medical assessment and treatment protocols, marking a significant advancement in RNA modification analysis.
Assuntos
5-Metilcitosina/análogos & derivados , Algoritmos , Redes Neurais de Computação , Teorema de Bayes , Máquina de Vetores de Suporte , RNARESUMO
In the model-based approach, researchers assume that the underlying structure, which generates the population of interest, is correctly specified. However, when the working model differs from the underlying true population model, the estimation process becomes quite unreliable due to misspecification bias. Selecting a sample by applying the balancing conditions on some functions of the covariates can reduce such bias. This study aims at suggesting an estimator of population total by applying the balancing conditions on the basis functions of the auxiliary character(s) for the situations where the working model is different from the underlying true model under a ranked set sampling without replacement scheme. Special cases of the misspecified basis function model, i.e. homogeneous, linear, and proportional, are considered and balancing conditions are introduced in each case. Both simulation and bootstrapped studies show that the total estimators under proposed sampling mechanism keep up the superiority over simple random sampling in terms of efficiency and maintaining robustness against model failure.
RESUMO
Background: Lung cancer is the top cause of mortality in males and the second largest cause of cancer-related fatalities in women worldwide. Non-small cell lung cancer (NSCLC) cases are discovered at an advanced stage, raising major challenges in disease management and survival outcomes. This study aimed to investigate the clinical findings and management of stage IIIB and IV NSCLC patients for better decision-making, disease management, and understanding of this fatal disease. Methods: In this cohort study of 340 patients, a total of 140 (41.2%) were diagnosed with advanced-stage NSCLC at a mean age of 64 years. The electronic data of patients from 2015 to 2021 who met the inclusion criteria were retrieved from two tertiary hospitals in Riyadh, Saudi Arabia, and an Excel sheet was used to record the variables. Patients' data including all categorical variables such as gender, stage, metastasis, ALK, EGFR, and ROS, etc., and continuous variables such as age and body mass index (BMI) were retrieved and analyzed. Results: The multivariate Cox-regression model indicated that smoking was the significant risk factor of death for two-thirds of male smokers (37.9%), with a median survival time of 123 days. Disease progression was higher with pleural and brain metastasis, and localized metastasis was the highest in 75% of patients. The intent of treatment was mainly palliative, however, a statistically significant association was found with the simultaneous use of chemotherapy and immunotherapy. Patients' response to first-line treatment revealed a significant improvement if chemotherapy treatment was maintained at the same dose without interruption of dosage. Conclusions: The overall cure and survival rates for NSCLC remain low, particularly in metastatic disease. Therefore, continued research into new drugs and combination therapies is required for better decision-making to expand the clinical benefit to a broader patient population and to improve outcomes in NSCLC.
RESUMO
The objective of this study is to investigate the behavior of the Bayesian exponentially weighted moving average (EWMA) control chart in the presence of measurement error (ME). It explores the impact of different ranked set sampling designs and loss functions on the performance of the control chart when ME is present. The analysis incorporates a covariate model, multiple measurement methods, and a conjugate prior to account for ME. The performance evaluation of the proposed Bayesian EWMA control chart with ME includes metrics such as average run length and standard deviation of run lengths. The findings, obtained through Monte Carlo simulation and real data application, indicated that ME significantly affects the performance of the Bayesian EWMA control chart when RSS schemes are employed. Particularly noteworthy is the superior performance of the median RSS scheme compared to the other two schemes in the presence of ME.
RESUMO
This article aims to suggest a new improved generalized class of estimators for finite population distribution function of the study and the auxiliary variables as well as mean of the usual auxiliary variable under simple random sampling. The numerical expressions for the bias and mean squared error (MSE) are derived up to first degree of approximation. From our generalized class of estimators, we obtained two improved estimators. The gain in second proposed estimator is more as compared to first estimator. Three real data sets and a simulation are accompanied to measure the performances of our generalized class of estimators. The MSE of our proposed estimators is minimum and consequently percentage relative efficiency is higher as compared to their existing counterparts. From the numerical outcomes it has been shown that the proposed estimators perform well as compared to all considered estimators in this study.
RESUMO
The rising number of confirmed cases and deaths in Pakistan caused by the coronavirus have caused problems in all areas of the country, not just healthcare. For accurate policy making, it is very important to have accurate and efficient predictions of confirmed cases and death counts. In this article, we use a coronavirus dataset that includes the number of deaths, confirmed cases, and recovered cases to test an artificial neural network model and compare it to different univariate time series models. In contrast to the artificial neural network model, we consider five univariate time series models to predict confirmed cases, deaths count, and recovered cases. The considered models are applied to Pakistan's daily records of confirmed cases, deaths, and recovered cases from 10 March 2020 to 3 July 2020. Two statistical measures are considered to assess the performances of the models. In addition, a statistical test, namely, the Diebold and Mariano test, is implemented to check the accuracy of the mean errors. The results (mean error and statistical test) show that the artificial neural network model is better suited to predict death and recovered coronavirus cases. In addition, the moving average model outperforms all other confirmed case models, while the autoregressive moving average is the second-best model.
RESUMO
The word extreme events refer to unnatural or undesirable events. Due to the general destructive effects on society and scientific problems in various applied fields, the study of extreme events is an important subject for researchers. Many real-life phenomena exhibit clusters of extreme observations that cannot be adequately predicted and modeled by the traditional distributions. Therefore, we need new flexible probability distributions that are useful in modeling extreme-value data in various fields such as the financial sector, telecommunications, hydrology, engineering, and meteorology. In this piece of research work, a new flexible probability distribution is introduced, which is attained by joining together the flexible Weibull distribution with the weighted T-X strategy. The new model is named a new flexible Weibull extension distribution. The distributional properties of the new model are derived. Furthermore, some frequently implemented estimation approaches are considered to obtain the estimators of the new flexible Weibull extension model. Finally, we demonstrate the utility of the new flexible Weibull extension distribution by analyzing an extreme value data set.
RESUMO
We introduced a brand-new member of the family that is going to be referred to as the New Power Topp-Leone Generated (NPTL-G). This new member is one of a kind. Given the major functions that created this new member, important mathematical aspects are discussed in as much detail as possible. We derived some functions for the new one, included the Rényi entropy, the qf, series development, and moment weighted probabilities. Moreover, to estimate the values of the parameters of our model that were not known, we employed the maximum likelihood technique. In addition, two actual datasets from the real world were investigated in order to bring attention to the possible applications of this novel distribution. This new model performs better than three key rivals based on the measurements that were collected.
Assuntos
Probabilidade , EntropiaRESUMO
The initial COVID-19 vaccinations were created and distributed to the general population in 2020 thanks to emergency authorization and conditional approval. Consequently, numerous countries followed the process that is currently a global campaign. Taking into account the fact that people are being vaccinated, there are concerns about the effectiveness of that medical solution. Actually, this study is the first one focusing on how the number of vaccinated people might influence the spread of the pandemic in the world. From the Global Change Data Lab "Our World in Data", we were able to get data sets about the number of new cases and vaccinated people. This study is a longitudinal one from 14/12/2020 to 21/03/2021. In addition, we computed Generalized log-Linear Model on count time series (Negative Binomial distribution due to over dispersion in data) and implemented validation tests to confirm the robustness of our results. The findings revealed that when the number of vaccinated people increases by one new vaccination on a given day, the number of new cases decreases significantly two days after by one. The influence is not notable on the same day of vaccination. Authorities should increase the vaccination campaign to control well the pandemic. That solution has effectively started to reduce the spread of COVID-19 in the world.
Assuntos
COVID-19 , Humanos , Vacinas contra COVID-19 , Programas de Imunização , Modelos Lineares , VacinaçãoRESUMO
Epstein Barr Virus (EBV) is implicated in the carcinogenesis of nasopharyngeal carcinoma (NPC) and currently associated with at least 1% of global cancers. The differential prognosis analysis of NPC in EBV genotypes remains to be elucidated. Medical, radiological, pathological, and laboratory reports of 146 NPC patients were collected retrospectively over a 6-year period between 2015 and 2020. From the pathology archives, DNA was extracted from tumor blocks and used for EBV nuclear antigen 3C (EBNA-3C) genotyping by nested polymerase chain reaction (PCR). We found a high prevalence of 96% of EBV infection in NPC patients with a predominance of genotype I detected in 73% of NPC samples. Histopathological examination showed that most of the NPC patients were in the advanced stages of cancer: stage III (38.4%) or stage IV-B (37.7%). Only keratinized squamous cell carcinoma was significantly higher in EBV negative NPC patients compared with those who were EBV positive (OR = 0.01, 95%CI = (0.004-0.32; p = 0.009)), whereas the majority of patients (91.8%) had undifferentiated, non-keratinizing squamous cell carcinoma, followed by differentiated, non-keratinizing squamous cell carcinoma (7.5%). Although NPC had metastasized to 16% of other body sites, it was not associated with EBV infection, except for lung metastasis. A statistically significant reverse association was observed between EBV infection and lung metastasis (OR = 0.07, 95%CI = (0.01-0.51; p = 0.008)). Although 13% of NPC patients died, the overall survival (OS) mean time was 5.59 years. Given the high prevalence of EBV-associated NPC in our population, Saudi could be considered as an area with a high incidence of EBV-associated NPC with a predominance of EBV genotype I. A future multi-center study with a larger sample size is needed to assess the true burden of EBV-associated NPC in Saudi Arabia.
RESUMO
For a variety of well-known approaches, optimum predictors and estimators are determined in relation to the asymmetrical LINEX loss function. The applications of an iteratively practicable lowest mean squared error estimation of the regression disturbance variation with the LINEX loss function are discussed in this research. This loss is a symmetrical generalisation of the quadratic loss function. Whenever the LINEX loss function is applied, we additionally look at the risk performance of the feasible virtually unbiased generalised Liu estimator and practicable generalised Liu estimator. Whenever the variation σ 2 is specified, we get all acceptable linear estimation in the class of linear estimation techniques, and when σ 2 is undetermined, we get all acceptable linear estimation in the class of linear estimation techniques. During position transformations, the proposed Liu estimators are stable. The estimators' biases and hazards are calculated and evaluated. We utilize an asymmetrical loss function, the LINEX loss function, to calculate the actual hazards of several error variation estimators. The employment of δ P (σ), which is easy to use and maximin, is recommended in the conclusions.
RESUMO
Healthcare systems have been under immense pressure since the beginning of the COVID-19 pandemic; hence, studies on using machine learning (ML) methods for classifying ICU admissions and resource allocation are urgently needed. We investigated whether ML can propose a useful classification model for predicting the ICU admissions of COVID-19 patients. In this retrospective study, the clinical characteristics and laboratory findings of 100 patients with laboratory-confirmed COVID-19 tests were retrieved between May 2020 and January 2021. Based on patients' demographic and clinical data, we analyzed the capability of the proposed weighted radial kernel support vector machine (SVM), coupled with (RFE). The proposed method is compared with other reference methods such as linear discriminant analysis (LDA) and kernel-based SVM variants including the linear, polynomial, and radial kernels coupled with REF for predicting ICU admissions of COVID-19 patients. An initial performance assessment indicated that the SVM with weighted radial kernels coupled with REF outperformed the other classification methods in discriminating between ICU and non-ICU admissions in COVID-19 patients. Furthermore, applying the Recursive Feature Elimination (RFE) with weighted radial kernel SVM identified a significant set of variables that can predict and statistically distinguish ICU from non-ICU COVID-19 patients. The patients' weight, PCR Ct Value, CCL19, INF-ß, BLC, INR, PT, PTT, CKMB, HB, platelets, RBC, urea, creatinine and albumin results were found to be the significant predicting features. We believe that weighted radial kernel SVM can be used as an assisting ML approach to guide hospital decision makers in resource allocation and mobilization between intensive care and isolation units. We model the data retrospectively on a selected subset of patient-derived variables based on previous knowledge of ICU admission and this needs to be trained in order to forecast prospectively.
RESUMO
Survival analysis is a collection of statistical techniques which examine the time it takes for an event to occur, and it is one of the most important fields in biomedical sciences and other variety of scientific disciplines. Furthermore, the computational rapid advancements in recent decades have advocated the application of Bayesian techniques in this field, giving a powerful and flexible alternative to the classical inference. The aim of this study is to consider the Bayesian inference for the generalized log-logistic proportional hazard model with applications to right-censored healthcare data sets. We assume an independent gamma prior for the baseline hazard parameters and a normal prior is placed on the regression coefficients. We then obtain the exact form of the joint posterior distribution of the regression coefficients and distributional parameters. The Bayesian estimates of the parameters of the proposed model are obtained using the Markov chain Monte Carlo (McMC) simulation technique. All computations are performed in Bayesian analysis using Gibbs sampling (BUGS) syntax that can be run with Just Another Gibbs Sampling (JAGS) from the R software. A detailed simulation study was used to assess the performance of the proposed parametric proportional hazard model. Two real-survival data problems in the healthcare are analyzed for illustration of the proposed model and for model comparison. Furthermore, the convergence diagnostic tests are presented and analyzed. Finally, our research found that the proposed parametric proportional hazard model performs well and could be beneficial in analyzing various types of survival data.
Assuntos
Atenção à Saúde , Teorema de Bayes , Simulação por Computador , Humanos , Cadeias de Markov , Método de Monte CarloRESUMO
This paper proposes a new generalization of the Gull Alpha Power Family of distribution, namely the exponentiated generalized gull alpha power family of distribution abbreviated as (EGGAPF) with two additional parameters. This proposed family of distributions has some well known sub-models. Some of the basic properties of the distribution like the hazard function, survival function, order statistics, quantile function, moment generating function are investigated. In order to estimate the parameters of the model the method of maximum likelihood estimation is used. To assess the performance of the MLE estimates a simulation study was performed. It is observed that with increase in sample size, the average bias, and the RMSE decrease. A distribution from this family is fitted to two real data sets and compared to its sub-models. It can be concluded that the proposed distribution outperforms its sub-models.
RESUMO
Medical costs are one of the most common recurring expenses in a person's life. Based on different research studies, BMI, ageing, smoking, and other factors are all related to greater personal medical care costs. The estimates of the expenditures of health care related to obesity are needed to help create cost-effective obesity prevention strategies. Obesity prevention at a young age is a top concern in global health, clinical practice, and public health. To avoid these restrictions, genetic variants are employed as instrumental variables in this research. Using statistics from public huge datasets, the impact of body mass index (BMI) on overall healthcare expenses is predicted. A multiview learning architecture can be used to leverage BMI information in records, including diagnostic texts, diagnostic IDs, and patient traits. A hierarchy perception structure was suggested to choose significant words, health checks, and diagnoses for training phase informative data representations, because various words, diagnoses, and previous health care have varying significance for expense calculation. In this system model, linear regression analysis, naive Bayes classifier, and random forest algorithms were compared using a business analytic method that applied statistical and machine-learning approaches. According to the results of our forecasting method, linear regression has the maximum accuracy of 97.89 percent in forecasting overall healthcare costs. In terms of financial statistics, our methodology provides a predictive method.
Assuntos
Custos de Cuidados de Saúde , Aprendizado de Máquina , Teorema de Bayes , Hospitalização , Humanos , ObesidadeRESUMO
In this work, we provide a new generated class of models, namely, the extended generalized inverted Kumaraswamy generated (EGIKw-G) family of distributions. Several structural properties (survival function (sf), hazard rate function (hrf), reverse hazard rate function (rhrf), quantile function (qf) and median, s th raw moment, generating function, mean deviation (md), etc.) are provided. The estimates for parameters of new G class are derived via maximum likelihood estimation (MLE) method. The special models of the proposed class are discussed, and particular attention is given to one special model, the extended generalized inverted Kumaraswamy Burr XII (EGIKw-Burr XII) model. Estimators are evaluated via a Monte Carlo simulation (MCS). The superiority of EGIKw-Burr XII model is proved using a lifetime data applications.
Assuntos
Orientação Espacial , Simulação por Computador , Funções Verossimilhança , Método de Monte CarloRESUMO
In this article, we proposed an improved finite population variance estimator based on simple random sampling using dual auxiliary information. Mathematical expressions of the proposed and existing estimators are obtained up to the first order of approximation. Two real data sets are used to examine the performances of a new improved proposed estimator. A simulation study is also recognized to assess the robustness and generalizability of the proposed estimator. From the result of real data sets and simulation study, it is examining that the proposed estimator give minimum mean square error and percentage relative efficiency are higher than all existing counterparts, which shown the importance of new improved estimator. The theoretical and numerical result illustrated that the proposed variance estimator based on simple random sampling using dual auxiliary information has the best among all existing estimators.