Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 215
Filtrar
Más filtros

Base de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Artículo en Inglés | MEDLINE | ID: mdl-39389092

RESUMEN

Currently, Diabetes Mellitus (DM) can be life-threatening due to the dietary habits and lifestyle choices of individuals. Diabetes is characterised by elevated levels of glucose in the blood and an excess of protein in the blood. Poor eating habits and lifestyles are largely responsible for the rise in overweight, obesity, and various related conditions. This study investigated many diabetes-related risk forecasting techniques and algorithms. The eight machine learning (ML) algorithms used the diabetes dataset to test various prediction techniques, including a Support Vector Classifier, gradient-boosting, multilayer perceptron, random forest, K-nearest neighbors, logistic regression, extreme gradient boosting, and decision tree. To enhance the diabetic prediction ability of the model, we suggested using Feature Engineering (FE) and feature scaling. For our investigation, we utilized the Mendeley dataset on diabetes to assess the capacity of the model to predict diabetes. We developed a model by using Python programming and eight classification techniques. The Random Forest with 99.21%, Gradient Boosting with 99.61%, Extreme Gradient Boosting, and Decision Tree achieved the highest F1 score (99.81%), accuracy rate (99.80%), precision (99.81%), and recall (99.81%) of all classification approaches.

2.
Diagnostics (Basel) ; 14(17)2024 Sep 08.
Artículo en Inglés | MEDLINE | ID: mdl-39272771

RESUMEN

Electroencephalogram (EEG) signals contain information about the brain's state as they reflect the brain's functioning. However, the manual interpretation of EEG signals is tedious and time-consuming. Therefore, automatic EEG translation models need to be proposed using machine learning methods. In this study, we proposed an innovative method to achieve high classification performance with explainable results. We introduce channel-based transformation, a channel pattern (ChannelPat), the t algorithm, and Lobish (a symbolic language). By using channel-based transformation, EEG signals were encoded using the index of the channels. The proposed ChannelPat feature extractor encoded the transition between two channels and served as a histogram-based feature extractor. An iterative neighborhood component analysis (INCA) feature selector was employed to select the most informative features, and the selected features were fed into a new ensemble k-nearest neighbor (tkNN) classifier. To evaluate the classification capability of the proposed channel-based EEG language detection model, a new EEG language dataset comprising Arabic and Turkish was collected. Additionally, Lobish was introduced to obtain explainable outcomes from the proposed EEG language detection model. The proposed channel-based feature engineering model was applied to the collected EEG language dataset, achieving a classification accuracy of 98.59%. Lobish extracted meaningful information from the cortex of the brain for language detection.

3.
Mol Pharm ; 2024 Sep 30.
Artículo en Inglés | MEDLINE | ID: mdl-39348223

RESUMEN

Computational methods including machine learning and molecular dynamics simulations have strong potential to characterize, understand, and ultimately predict the properties of proteins relevant to their stability and function as therapeutics. Such methods would streamline the development pathway by minimizing the current experimental testing required for many protein variants and formulations. The molecular understanding of thermostability and aggregation propensity has advanced significantly along with predictive algorithms based on the sequence-level or structural-level information on a protein. However, these approaches focus largely on a comparison of protein sequence variations to correlate the properties of proteins to their stability, solubility, and aggregation propensity. For therapeutic protein development, it is of equal importance to take into account the impact of the formulation conditions to elucidate and predict the stability of the antibody drugs. At the macroscopic level, changing temperature, pH, ionic strength, and the addition of excipients can significantly alter the kinetics of protein aggregation. The mechanisms controlling aggregation kinetics have been traced back to a combination of molecular features, including conformational stability, partial unfolding to aggregation-prone states, and the colloidal stability governed by surface charges and hydrophobicity. However, very little has been done to evaluate these features in the context of protein dynamics in different formulations. In this work, we have combined a range of molecular features calculated from the Fab A33 protein sequence and molecular dynamics simulations. Using the power of advanced, yet interpretable, statistical tools, it has been possible to uncover greater insights into the mechanisms behind protein stability, validating previous findings, and also develop models that can predict the aggregation kinetics within a range of 49 different solution conditions.

4.
Sci Rep ; 14(1): 22215, 2024 09 27.
Artículo en Inglés | MEDLINE | ID: mdl-39333731

RESUMEN

Breast cancer (BC) is a prominent cause of female mortality on a global scale. Recently, there has been growing interest in utilizing blood and tissue-based biomarkers to detect and diagnose BC, as this method offers a non-invasive approach. To improve the classification and prediction of BC using large biomarker datasets, several machine-learning techniques have been proposed. In this paper, we present a multi-stage approach that consists of computing new features and then sorting them into an input image for the ResNet50 neural network. The method involves transforming the original values into normalized values based on their membership in the Gaussian distribution of healthy and BC samples of each feature. To test the effectiveness of our proposed approach, we employed the Coimbra and Wisconsin datasets. The results demonstrate efficient performance improvement, with an accuracy of 100% and 100% using the Coimbra and Wisconsin datasets, respectively. Furthermore, the comparison with existing literature validates the reliability and effectiveness of our methodology, where the normalized value can reduce the misclassified samples of ML techniques because of its generality.


Asunto(s)
Neoplasias de la Mama , Redes Neurales de la Computación , Humanos , Neoplasias de la Mama/diagnóstico , Femenino , Aprendizaje Automático , Biomarcadores de Tumor , Reproducibilidad de los Resultados , Algoritmos
5.
Sci Rep ; 14(1): 22658, 2024 09 30.
Artículo en Inglés | MEDLINE | ID: mdl-39349512

RESUMEN

This study evaluates the diagnostic efficacy of automated machine learning (AutoGluon) with automated feature engineering and selection (autofeat), focusing on clinical manifestations, and a model integrating both clinical manifestations and CT findings in adult patients with ambiguous computed tomography (CT) results for acute appendicitis (AA). This evaluation was compared with conventional single machine learning models such as logistic regression(LR) and established scoring systems such as the Adult Appendicitis Score(AAS) to address the gap in diagnostic approaches for uncertain AA cases. In this retrospective analysis of 303 adult patients with indeterminate CT findings, the cohort was divided into appendicitis (n = 115) and non-appendicitis (n = 188) groups. AutoGluon and autofeat were used for AA prediction. The AutoGluon-clinical model relied solely on clinical data, whereas the AutoGluon-clinical-CT model included both clinical and CT data. The area under the receiver operating characteristic curve (AUROC) and other metrics for the test dataset, namely accuracy, sensitivity, specificity, PPV, NPV, and F1 score, were used to compare AutoGluon models with single machine learning models and the AAS. The single ML models in this study were LR, LASSO regression, ridge regression, support vector machine, decision tree, random forest, and extreme gradient boosting. Feature importance values were extracted using the "feature_importance" attribute from AutoGluon. The AutoGluon-clinical model demonstrated an AUROC of 0.785 (95% CI 0.691-0.890), and the ridge regression model with only clinical data revealed an AUROC of 0.755 (95% CI 0.649-0.861). The AutoGluon-clinical-CT model (AUROC 0.886 with 95% CI 0.820-0.951) performed better than the ridge model using clinical and CT data (AUROC 0.852 with 95% CI 0.774-0.930, p = 0.029). A new feature, exp(-(duration from pain to CT)3 + rebound tenderness), was identified (importance = 0.049, p = 0.001). AutoML (AutoGluon) and autoFE (autofeat) enhanced the diagnosis of uncertain AA cases, particularly when combining CT and clinical findings. This study suggests the potential of integrating AutoML and autoFE in clinical settings to improve diagnostic strategies and patient outcomes and make more efficient use of healthcare resources. Moreover, this research supports further exploration of machine learning in diagnostic processes.


Asunto(s)
Apendicitis , Aprendizaje Automático , Tomografía Computarizada por Rayos X , Humanos , Apendicitis/diagnóstico por imagen , Apendicitis/diagnóstico , Masculino , Tomografía Computarizada por Rayos X/métodos , Femenino , Adulto , Estudios Retrospectivos , Persona de Mediana Edad , Curva ROC
6.
Sensors (Basel) ; 24(17)2024 Aug 30.
Artículo en Inglés | MEDLINE | ID: mdl-39275547

RESUMEN

Prevalence estimates of Parkinson's disease (PD)-the fastest-growing neurodegenerative disease-are generally underestimated due to issues surrounding diagnostic accuracy, symptomatic undiagnosed cases, suboptimal prodromal monitoring, and limited screening access. Remotely monitored wearable devices and sensors provide precise, objective, and frequent measures of motor and non-motor symptoms. Here, we used consumer-grade wearable device and sensor data from the WATCH-PD study to develop a PD screening tool aimed at eliminating the gap between patient symptoms and diagnosis. Early-stage PD patients (n = 82) and age-matched comparison participants (n = 50) completed a multidomain assessment battery during a one-year longitudinal multicenter study. Using disease- and behavior-relevant feature engineering and multivariate machine learning modeling of early-stage PD status, we developed a highly accurate (92.3%), sensitive (90.0%), and specific (100%) random forest classification model (AUC = 0.92) that performed well across environmental and platform contexts. These findings provide robust support for further exploration of consumer-grade wearable devices and sensors for global population-wide PD screening and surveillance.


Asunto(s)
Enfermedad de Parkinson , Dispositivos Electrónicos Vestibles , Humanos , Enfermedad de Parkinson/diagnóstico , Masculino , Femenino , Persona de Mediana Edad , Anciano , Aprendizaje Automático , Estudios Longitudinales , Técnicas Biosensibles/instrumentación , Técnicas Biosensibles/métodos
7.
Front Pediatr ; 12: 1400110, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39318617

RESUMEN

Introduction: Autism spectrum disorder (ASD) is a neurodevelopmental condition that significantly impacts the mental, emotional, and social development of children. Early screening for ASD typically involves the use of a series of questionnaires. With answers to these questionnaires, healthcare professionals can identify whether a child is at risk for developing ASD and refer them for further evaluation and diagnosis. CHAT-23 is an effective and widely used screening test in China for the early screening of ASD, which contains 23 different kinds of questions. Methods: We have collected clinical data from Wuxi, China. All the questions of CHAT-23 are regarded as different kinds of features for building machine learning models. We introduce machine learning methods into ASD screening, using the Max-Relevance and Min-Redundancy (mRMR) feature selection method to analyze the most important questions among all 23 from the collected CHAT-23 questionnaires. Seven mainstream supervised machine learning models were built and experiments were conducted. Results: Among the seven supervised machine learning models evaluated, the best-performing model achieved a sensitivity of 0.909 and a specificity of 0.922 when the number of features was reduced to 9. This demonstrates the model's ability to accurately identify children for ASD with high precision, even with a more concise set of features. Discussion: Our study focuses on the health of Chinese children, introducing machine learning methods to provide more accurate and effective early screening tests for autism. This approach not only enhances the early detection of ASD but also helps in refining the CHAT-23 questionnaire by identifying the most relevant questions for the diagnosis process.

8.
Micromachines (Basel) ; 15(8)2024 Aug 21.
Artículo en Inglés | MEDLINE | ID: mdl-39203704

RESUMEN

This work describes a mathematical model for handwriting devices without a specific reference surface (SRS). The research was carried out on two hypotheses: the first considers possible circular segments that could be made during execution for the reconstruction of the trace, and the second is the combination of lines and circles. The proposed system has no flat reference surface, since the sensor is inside the pencil that describes the trace, not on the surface as in tablets or cell phones. An inertial sensor was used for the measurements, in this case, a commercial Micro-Electro Mechanical sensor of linear acceleration. The tracking device is an IMU sensor and a processing card that allows inertial measurements of the pen during on-the-fly tracing. It is essential to highlight that the system has a non-inertial reference frame. Comparing the two proposed models shows that it is possible to construct shapes from curved lines and that the patterns obtained are similar to what is recognized; this method provides an alternative to quaternion calculus for poorly specified orientation problems.

9.
Sci Total Environ ; 950: 175231, 2024 Nov 10.
Artículo en Inglés | MEDLINE | ID: mdl-39098417

RESUMEN

Accurate prediction of instantaneous high lake water levels and flood flows (flood stages) from micro-catchments to big river basins are critical for flood forecasting. Lake Carl Blackwell, a small-watershed reservoir in the south-central USA, served as a primary case study due to its rich historical dataset. Bearing knowledge that both current and previous rainfall contributes to the reservoirs' water body, a series of hourly rainfall features were created to maximize predicting power, which include total rainfall amounts in the current hour, the past 2 h, 3 h, …, 600 h in addition to previous-day lake levels. Notedly, the rainfall features are the accumulated rainfall amounts from present to previous hours rather than the rainfall amount in any specific hour. Random Forest Regression (RFR) was used to score the features' importance and predict the flood stages along with Neural Network - Multi-layer Perceptron Regression (NN-MLP), Support Vector Regression (SVR), Extreme Gradient Boosting (XGBoost), and the ordinary multi-variant linear regression (MLR) together with dimension reduced linear models of Principal Component Regression (PCR) and Partial Least Square Regression (PLSR). The prediction accuracy for the lake flood stages can be as high as 0.95 in R2, 0.11 ft. in mean absolute error (MAE), and 0.21 ft. in root mean square error (RMSE) for the testing dataset by the RFR (NN-MLP performed equally well), with small accuracy decreases by the other two non-linear algorithms of XGBoost and SVR. The linear regressions with dimension reductions had the lowest accuracy. Furthermore, our approach demonstrated high accuracy and broad applicability for surface runoff and streamflow predictions across three different-sized watersheds from micro-catchment to big river basins in the region, with increases of predicting power from earlier rainfall for larger watersheds and vice versa.

10.
Heliyon ; 10(15): e35167, 2024 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-39166039

RESUMEN

In developing countries, smart grids are nonexistent, and electricity theft significantly hampers power supply. This research introduces a lightweight deep-learning model using monthly customer readings as input data. By employing careful direct and indirect feature engineering techniques, including Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), UMAP (Uniform Manifold Approximation and Projection), and resampling methods such as Random-Under-Sampler (RUS), Synthetic Minority Over-sampling Technique (SMOTE), and Random-Over-Sampler (ROS), an effective solution is proposed. Previous studies indicate that models achieve high precision, recall, and F1 score for the non-theft (0) class, but perform poorly, even achieving 0 %, for the theft (1) class. Through parameter tuning and employing Random-Over-Sampler (ROS), significant improvements in accuracy, precision (89 %), recall (94 %), and F1 score (91 %) for the theft (1) class are achieved. The results demonstrate that the proposed model outperforms existing methods, showcasing its efficacy in detecting electricity theft in non-smart grid environments.

11.
J Am Stat Assoc ; 119(545): 81-94, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39185398

RESUMEN

In the emerging field of materials informatics, a fundamental task is to identify physicochemically meaningful descriptors, or materials genes, which are engineered from primary features and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by the astronomically large number of correlated predictors with limited sample size. We formulate this problem as variable selection with operator-induced structure (OIS) and propose a new method to achieve unconventional dimension reduction by utilizing the geometry embedded in OIS. Although the model remains linear, we iterate nonparametric variable selection for effective dimension reduction. This enables variable selection based on ab initio primary features, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. To select the nonparametric module, we discuss a desired performance criterion that is uniquely induced by variable selection with OIS; in particular, we propose to employ a Bayesian Additive Regression Trees (BART)-based variable selection method. Numerical studies show superiority of the proposed method, which continues to exhibit robust performance when the input dimension is out of reach of existing methods. Our analysis of single-atom catalysis identifies physical descriptors that explain the binding energy of metal-support pairs with high explanatory power, leading to interpretable insights to guide the prevention of a notorious problem called sintering and aid catalysis design.

12.
Genome Biol ; 25(1): 225, 2024 Aug 16.
Artículo en Inglés | MEDLINE | ID: mdl-39152456

RESUMEN

BACKGROUND: Single-cell chromatin accessibility assays, such as scATAC-seq, are increasingly employed in individual and joint multi-omic profiling of single cells. As the accumulation of scATAC-seq and multi-omics datasets continue, challenges in analyzing such sparse, noisy, and high-dimensional data become pressing. Specifically, one challenge relates to optimizing the processing of chromatin-level measurements and efficiently extracting information to discern cellular heterogeneity. This is of critical importance, since the identification of cell types is a fundamental step in current single-cell data analysis practices. RESULTS: We benchmark 8 feature engineering pipelines derived from 5 recent methods to assess their ability to discover and discriminate cell types. By using 10 metrics calculated at the cell embedding, shared nearest neighbor graph, or partition levels, we evaluate the performance of each method at different data processing stages. This comprehensive approach allows us to thoroughly understand the strengths and weaknesses of each method and the influence of parameter selection. CONCLUSIONS: Our analysis provides guidelines for choosing analysis methods for different datasets. Overall, feature aggregation, SnapATAC, and SnapATAC2 outperform latent semantic indexing-based methods. For datasets with complex cell-type structures, SnapATAC and SnapATAC2 are preferred. With large datasets, SnapATAC2 and ArchR are most scalable.


Asunto(s)
Benchmarking , Cromatina , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Cromatina/genética , Cromatina/metabolismo , Humanos , Biología Computacional/métodos
13.
Sensors (Basel) ; 24(15)2024 Jul 24.
Artículo en Inglés | MEDLINE | ID: mdl-39123855

RESUMEN

The detection performance of radar is significantly impaired by active jamming and mutual interference from other radars. This paper proposes a radio signal modulation recognition method to accurately recognize these signals, which helps in the jamming cancellation decisions. Based on the ensemble learning stacking algorithm improved by meta-feature enhancement, the proposed method adopts random forests, K-nearest neighbors, and Gaussian naive Bayes as the base-learners, with logistic regression serving as the meta-learner. It takes the multi-domain features of signals as input, which include time-domain features including fuzzy entropy, slope entropy, and Hjorth parameters; frequency-domain features, including spectral entropy; and fractal-domain features, including fractal dimension. The simulation experiment, including seven common signal types of radar and active jamming, was performed for the effectiveness validation and performance evaluation. Results proved the proposed method's performance superiority to other classification methods, as well as its ability to meet the requirements of low signal-to-noise ratio and few-shot learning.

14.
Adv Sci (Weinh) ; 11(36): e2405124, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39041889

RESUMEN

Amid growing interest in the precise detection of volatile organic compounds (VOCs) in industrial field, the demand for highly effective gas sensors is at an all-time high. However, traditional sensors with their classic single-output signal, bulky and complex integrated structure when forming array often involve complicated technology and high cost, limiting their widespread adoption. Here, this study introduces a novel approach, employing an integrated YSZ-based (YSZ: yttria-stabilized zirconia) mixed potential sensor equipped with a triple-sensing electrode array, to efficiently detect and differentiate six types of VOCs gases. This innovative sensor integrates NiSb2O6, CuSb2O6, and MgSb2O6 sensing electrodes (SEs), which are sensitive to pentane, isoprene, n-propanol, acetone, acetic acid, and formaldehyde gases. Through feature engineering based on intuitive spike-based response values, it accentuates the distinct characteristics of every gas. Eventually, an average classification accuracy of 98.8% and an overall R-squared error (R2) of 99.3% for concentration regression toward six target gases can be achieved, showcasing the potential to quantitatively distinguish between industrial hazardous VOCs gases.

15.
Artículo en Inglés | MEDLINE | ID: mdl-39082872

RESUMEN

Explorative data analysis (EDA) is a critical step in scientific projects, aiming to uncover valuable insights and patterns within data. Traditionally, EDA involves manual inspection, visualization, and various statistical methods. The advent of artificial intelligence (AI) and machine learning (ML) has the potential to improve EDA, offering more sophisticated approaches that enhance its efficacy. This review explores how AI and ML algorithms can improve feature engineering and selection during EDA, leading to more robust predictive models and data-driven decisions. Tree-based models, regularized regression, and clustering algorithms were identified as key techniques. These methods automate feature importance ranking, handle complex interactions, perform feature selection, reveal hidden groupings, and detect anomalies. Real-world applications include risk prediction in total hip arthroplasty and subgroup identification in scoliosis patients. Recent advances in explainable AI and EDA automation show potential for further improvement. The integration of AI and ML into EDA accelerates tasks and uncovers sophisticated insights. However, effective utilization requires a deep understanding of the algorithms, their assumptions, and limitations, along with domain knowledge for proper interpretation. As data continues to grow, AI will play an increasingly pivotal role in EDA when combined with human expertise, driving more informed, data-driven decision-making across various scientific domains. Level of Evidence: Level V - Expert opinion.

16.
JMIR Public Health Surveill ; 10: e52353, 2024 Jul 18.
Artículo en Inglés | MEDLINE | ID: mdl-39024001

RESUMEN

BACKGROUND: Multimorbidity is a significant public health concern, characterized by the coexistence and interaction of multiple preexisting medical conditions. This complex condition has been associated with an increased risk of COVID-19. Individuals with multimorbidity who contract COVID-19 often face a significant reduction in life expectancy. The postpandemic period has also highlighted an increase in frailty, emphasizing the importance of integrating existing multimorbidity details into epidemiological risk assessments. Managing clinical data that include medical histories presents significant challenges, particularly due to the sparsity of data arising from the rarity of multimorbidity conditions. Also, the complex enumeration of combinatorial multimorbidity features introduces challenges associated with combinatorial explosions. OBJECTIVE: This study aims to assess the severity of COVID-19 in individuals with multiple medical conditions, considering their demographic characteristics such as age and sex. We propose an evolutionary machine learning model designed to handle sparsity, analyzing preexisting multimorbidity profiles of patients hospitalized with COVID-19 based on their medical history. Our objective is to identify the optimal set of multimorbidity feature combinations strongly associated with COVID-19 severity. We also apply the Apriori algorithm to these evolutionarily derived predictive feature combinations to identify those with high support. METHODS: We used data from 3 administrative sources in Piedmont, Italy, involving 12,793 individuals aged 45-74 years who tested positive for COVID-19 between February and May 2020. From their 5-year pre-COVID-19 medical histories, we extracted multimorbidity features, including drug prescriptions, disease diagnoses, sex, and age. Focusing on COVID-19 hospitalization, we segmented the data into 4 cohorts based on age and sex. Addressing data imbalance through random resampling, we compared various machine learning algorithms to identify the optimal classification model for our evolutionary approach. Using 5-fold cross-validation, we evaluated each model's performance. Our evolutionary algorithm, utilizing a deep learning classifier, generated prediction-based fitness scores to pinpoint multimorbidity combinations associated with COVID-19 hospitalization risk. Eventually, the Apriori algorithm was applied to identify frequent combinations with high support. RESULTS: We identified multimorbidity predictors associated with COVID-19 hospitalization, indicating more severe COVID-19 outcomes. Frequently occurring morbidity features in the final evolved combinations were age>53, R03BA (glucocorticoid inhalants), and N03AX (other antiepileptics) in cohort 1; A10BA (biguanide or metformin) and N02BE (anilides) in cohort 2; N02AX (other opioids) and M04AA (preparations inhibiting uric acid production) in cohort 3; and G04CA (Alpha-adrenoreceptor antagonists) in cohort 4. CONCLUSIONS: When combined with other multimorbidity features, even less prevalent medical conditions show associations with the outcome. This study provides insights beyond COVID-19, demonstrating how repurposed administrative data can be adapted and contribute to enhanced risk assessment for vulnerable populations.


Asunto(s)
COVID-19 , Hospitalización , Aprendizaje Automático , Multimorbilidad , Humanos , COVID-19/epidemiología , Italia/epidemiología , Masculino , Femenino , Anciano , Hospitalización/estadística & datos numéricos , Persona de Mediana Edad , Estudios Longitudinales , Anciano de 80 o más Años
17.
Stud Health Technol Inform ; 315: 368-372, 2024 Jul 24.
Artículo en Inglés | MEDLINE | ID: mdl-39049285

RESUMEN

This paper explores the balance between fairness and performance in machine learning classification, predicting the likelihood of a patient receiving anti-microbial treatment using structured data in community nursing wound care electronic health records. The data includes two important predictors (gender and language) of the social determinants of health, which we used to evaluate the fairness of the classifiers. At the same time, the impact of various groupings of language codes on classifiers' performance and fairness is analyzed. Most common statistical learning-based classifiers are evaluated. The findings indicate that while K-Nearest Neighbors offers the best fairness metrics among different grouping settings, the performance of all classifiers is generally consistent across different language code groupings. Also, grouping more variables tends to improve the fairness metrics over all classifiers while maintaining their performance.


Asunto(s)
Registros Electrónicos de Salud , Equidad en Salud , Aprendizaje Automático , Registros Electrónicos de Salud/clasificación , Humanos , Determinantes Sociales de la Salud
18.
Biotechnol Bioeng ; 2024 Jul 23.
Artículo en Inglés | MEDLINE | ID: mdl-39044472

RESUMEN

In the burgeoning field of proteins, the effective analysis of intricate protein data remains a formidable challenge, necessitating advanced computational tools for data processing, feature extraction, and interpretation. This study introduces ProteinFlow, an innovative framework designed to revolutionize feature engineering in protein data analysis. ProteinFlow stands out by offering enhanced efficiency in data collection and preprocessing, along with advanced capabilities in feature extraction, directly addressing the complexities inherent in multidimensional protein data sets. Through a comparative analysis, ProteinFlow demonstrated a significant improvement over traditional methods, notably reducing data preprocessing time and expanding the scope of biologically significant features identified. The framework's parallel data processing strategy and advanced algorithms ensure not only rapid data handling but also the extraction of comprehensive, meaningful insights from protein sequences, structures, and interactions. Furthermore, ProteinFlow exhibits remarkable scalability, adeptly managing large-scale data sets without compromising performance, a crucial attribute in the era of big data.

19.
Carbohydr Res ; 542: 109189, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38971003

RESUMEN

There has been a long-standing bottleneck in the quantitative analysis of the frequencies of homoblock polyads beyond triads using 1H and 13C NMR for linear polysaccharides, primarily because monosaccharides within a long homoblock share similar chemical environments due to identical neighboring units, resulting in indistinct NMR peaks. In this study, through rigorous mathematical induction, inequality relations were established that enabled the calculation of frequency ranges of homoblock polyads from historically reported NMR-derived frequency values of diads and/or triads of alginates, chitosans, homogalacturonans, and galactomannans. The calculated homoblock frequency ranges were then applied to evaluate three chain growth statistical models, including the Bernoulli chain, first-order Markov chain, and second-order Markov chain, for predicting homoblock frequencies in these polysaccharides. Furthermore, based on the mathematically derived inequality relations, a novel 2D array was constructed, enabling the graphical visualization of homoblock features in polysaccharides. It was demonstrated, as a proof of concept, that the novel 2D array, along with a 1D code generated from it, could serve as an effective feature engineering tool for polymer classification using machine learning algorithms.


Asunto(s)
Alginatos , Espectroscopía de Resonancia Magnética , Mananos , Mananos/química , Alginatos/química , Galactosa/química , Galactosa/análogos & derivados , Pectinas
20.
Methods Mol Biol ; 2844: 33-44, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39068330

RESUMEN

Promoters are the genomic regions upstream of genes that RNA polymerase binds in order to initiate gene transcription. Understanding the regulation of gene expression depends on being able to identify promoters, because they are the most important component of gene expression. Agrobacterium tumefaciens (A. tumefaciens) strain C58 was the subject of this study with the goal of creating a machine learning-based model to predict promoters. In this study, nucleotide density (ND), k-mer, and one-hot were used to encode the promoter sequence. Support vector machine (SVM) on fivefold cross-validation with incremental feature selection (IFS) was used to optimize the generated features. These improved characteristics were then used to distinguish promoter sequences by feeding them into the random forest (RF) classifier. Tenfold cross-validation (CV) analysis revealed that the projected model has the ability to produce an accuracy of 84.22%.


Asunto(s)
Agrobacterium tumefaciens , Inteligencia Artificial , Regiones Promotoras Genéticas , Máquina de Vectores de Soporte , Agrobacterium tumefaciens/genética , Biología Computacional/métodos , Algoritmos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA