Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Biology (Basel) ; 11(10)2022 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-36290397

RESUMEN

With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.

2.
J Stat Theory Appl ; 21(3): 79-105, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35996625

RESUMEN

Number of children ever born to women of reproductive age forms a core component of fertility and is vital to the population dynamics in any country. Using Bangladesh Multiple Indicator Cluster Survey 2019 data, we fitted a novel weighted Bayesian Poisson regression model to identify multi-level individual, household, regional and societal factors of the number of children ever born among married women of reproductive age in Bangladesh. We explored the robustness of our results using multiple prior distributions, and presented the Metropolis algorithm for posterior realizations. The method is compared with regular Bayesian Poisson regression model using a Weighted Bayesian Information Criterion. Factors identified emphasize the need to revisit and strengthen the existing fertility-reduction programs and policies in Bangladesh. Supplementary Information: The online version contains supplementary material available at 10.1007/s44199-022-00044-2.

3.
Metron ; 79(3): 361-381, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34690366

RESUMEN

Statistical thresholds occur when the changes in the relationships between a response and predictor variables are not linear but abrupt at some points of the predictor variable values. In this paper, we defined a piecewise-linear regression model which can detect two thresholds in the relationships via changes in slopes. We developed the corresponding Bayesian methodology for model estimation and inference by proposing prior distributions, deriving posterior distributions, and generating posterior values using Metropolis and Gibbs sampling algorithm. The parameters in our model are easy to understand, highly interpretable, and flexible to make inferences. The methodology has been applied to estimate threshold effects in housing market pricing data in two cities - Kamloops and Chilliwack - in British Columbia, Canada. Our findings revealed that the implementation of changes in the government property tax policies had threshold effects in the market price trend. The proposed model will be useful to detect threshold effects in other correlated time series data as well.

4.
Comput Biol Med ; 136: 104760, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34416572

RESUMEN

BACKGROUND AND OBJECTIVE: In binary classification problems with a rare class of interest, there is relatively little information available for the rare class to build a model. On the other hand, the number of useful variables to develop a model for classification can be high-dimensional. For example, in drug discovery, there are usually a very few bioactive compounds in a large chemical library, whereas thousands of potentially useful explanatory variables characterize a compound's chemical structure. The sparsity of information for the rare class of interest makes it difficult for the standard classification models to exploit the richness of the useful feature variables. Thus, the objective of this paper is to develop an R package which clusters the feature variables into diverse subsets to be aggregated into a powerful ensemble for the detection of a rare class object. METHODS: The ensemble of phalanxes (EPX) builds a classifier by exploiting the richness of feature variables using several diverse subsets of variables, called phalanxes, and outperforms many competitive state-of-the-art classification methods in terms of predictive ranking of the rare class of interest. RESULTS: We present an R package EPX which implements the algorithm to form the ensemble of phalanxes as well as its associated functions. We further show how the ensemble of phalanxes can be constructed using parallel computing to lower the computational burden given high-dimensional data. CONCLUSIONS: The R package EPX shows a flexible way of clustering feature variable space into smaller and diverse subsets of variables to develop an ensemble of phalanxes which better ranks a rare class object in a highly unbalanced two class classification problem. The ensemble EPX will be useful to detect the rare drug-like active biomolecules for development in drug discovery (Tomal et al., Mar. 2016) [1] and homologous proteins using similarity scores of amino acid sequences in protein homology (Tomal et al., 2019) [2]. The package EPX is freely available to download from CRAN (https://CRAN.R-project.org/package=EPX).


Asunto(s)
Algoritmos , Secuencia de Aminoácidos , Análisis por Conglomerados
5.
Inform Health Soc Care ; 46(4): 425-442, 2021 Dec 02.
Artículo en Inglés | MEDLINE | ID: mdl-33851897

RESUMEN

Childhood stunting is a serious public health concern in Bangladesh. Earlier research used conventional statistical methods to identify the risk factors of stunting, and very little is known about the applications and usefulness of machine learning (ML) methods that can identify the risk factors of various health conditions based on complex data. This research evaluates the performance of ML methods in predicting stunting among under-5 aged children using 2014 Bangladesh Demographic and Health Survey data. Besides, this paper identifies variables which are important to predict stunting in Bangladesh. Among the selected ML methods, gradient boosting provides the smallest misclassification error in predicting stunting, followed by random forests, support vector machines, classification tree and logistic regression with forward-stepwise selection. The top 10 important variables (in order of importance) that better predict childhood stunting in Bangladesh are child age, wealth index, maternal education, preceding birth interval, paternal education, division, household size, maternal age at first birth, maternal nutritional status, and parental age. Our study shows that ML can support the building of prediction models and emphasizes on the demographic, socioeconomic, nutritional and environmental factors to understand stunting in Bangladesh.


Asunto(s)
Trastornos del Crecimiento , Aprendizaje Automático , Anciano , Bangladesh/epidemiología , Niño , Trastornos del Crecimiento/epidemiología , Humanos , Lactante , Modelos Logísticos , Factores de Riesgo
6.
Ecol Evol ; 10(23): 13500-13517, 2020 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-33304555

RESUMEN

The relationships between an environmental variable and an ecological response are usually estimated by models fitted through the conditional mean of the response given environmental stress. For example, nonparametric loess and parametric piecewise linear regression model (PLRM) are often used to represent simple to complex nonlinear relationships. In contrast, piecewise linear quantile regression models (PQRM) fitted across various quantiles of the response can reveal nonlinearities in its range of variation across the explanatory variable.We assess the number and positions of candidate breakpoints using loess and compare the relative efficiencies of PLRM and PQRM to quantitatively determine the breakpoints' location and precision. We propose a nonparametric method to generate bootstrap confidence intervals for breakpoints using PQRM and prediction bands for loess and PQRM. We illustrated the applications using data from two aquatic studies suspected to exhibit multiple environmental breakpoints: relating a fish multimetric index of community health (MMI) to agricultural activity in wetlands' adjacent drainage basins; and relating cyanobacterial biomass to total phosphorus concentration in Canadian lakes.Two statistically significant breakpoints were detected in each dataset, demarcating boundaries of three linear segments, each with markedly different slopes. PQRM generated less biased, more accurate, and narrower confidence intervals for the breakpoints and narrower prediction bands than PLRM, especially for small samples and large error variability. In both applications, the relationship between the response and environmental variables was weak/nonsignificant below the lower threshold, strong through the midrange of the environmental gradient, and weak/nonsignificant beyond the upper threshold.We describe several advantages of PQRM over PLRM in characterizing environmental relationships where the scatter of points represents natural environmental variation rather than measurement error. The proposed methodology will be useful for detecting multiple breakpoints in ecological applications where the limits of variation are as important as the conditional mean of a function.

7.
J Chem Inf Model ; 56(3): 501-9, 2016 Mar 28.
Artículo en Inglés | MEDLINE | ID: mdl-26906936

RESUMEN

A quantitative structure-activity relationship (QSAR) is a model relating a specific biological response to the chemical structures of compounds. There are many descriptor sets available to characterize chemical structure, raising the question of how to choose among them or how to use all of them for training a QSAR model. Making efficient use of all sets of descriptors is particularly problematic when active compounds are rare among the assay response data. We consider various strategies to make use of the richness of multiple descriptor sets when assay data are poor in active compounds. Comparisons are made using data from four bioassays, each with five sets of molecular descriptors. The recommended method takes all available descriptors from all sets and uses an algorithm to partition them into groups called phalanxes. Distinct statistical models are trained, each based on only the descriptors in one phalanx, and the models are then averaged in an ensemble of models. By giving the descriptors a chance to contribute in different models, the recommended method uses more of the descriptors in model averaging. This results in better ranking of active compounds to identify a shortlist of drug candidates for development.


Asunto(s)
Relación Estructura-Actividad Cuantitativa , Bioensayo , Línea Celular Tumoral , Humanos , Modelos Moleculares
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...