Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 87
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39234953

RESUMO

The internal ribosome entry site (IRES) is a cis-regulatory element that can initiate translation in a cap-independent manner. It is often related to cellular processes and many diseases. Thus, identifying the IRES is important for understanding its mechanism and finding potential therapeutic strategies for relevant diseases since identifying IRES elements by experimental method is time-consuming and laborious. Many bioinformatics tools have been developed to predict IRES, but all these tools are based on structure similarity or machine learning algorithms. Here, we introduced a deep learning model named DeepIRES for precisely identifying IRES elements in messenger RNA (mRNA) sequences. DeepIRES is a hybrid model incorporating dilated 1D convolutional neural network blocks, bidirectional gated recurrent units, and self-attention module. Tenfold cross-validation results suggest that DeepIRES can capture deeper relationships between sequence features and prediction results than other baseline models. Further comparison on independent test sets illustrates that DeepIRES has superior and robust prediction capability than other existing methods. Moreover, DeepIRES achieves high accuracy in predicting experimental validated IRESs that are collected in recent studies. With the application of a deep learning interpretable analysis, we discover some potential consensus motifs that are related to IRES activities. In summary, DeepIRES is a reliable tool for IRES prediction and gives insights into the mechanism of IRES elements.


Assuntos
Aprendizado Profundo , Sítios Internos de Entrada Ribossomal , RNA Mensageiro , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Biologia Computacional/métodos , RNA Viral/genética , RNA Viral/metabolismo , Humanos , Redes Neurais de Computação , Algoritmos
2.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38279650

RESUMO

As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Biologia Computacional/métodos , Sequência de Aminoácidos , Transporte Proteico
3.
Proc Natl Acad Sci U S A ; 120(15): e2216698120, 2023 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-37023129

RESUMO

Discovering DNA regulatory sequence motifs and their relative positions is vital to understanding the mechanisms of gene expression regulation. Although deep convolutional neural networks (CNNs) have achieved great success in predicting cis-regulatory elements, the discovery of motifs and their combinatorial patterns from these CNN models has remained difficult. We show that the main difficulty is due to the problem of multifaceted neurons which respond to multiple types of sequence patterns. Since existing interpretation methods were mainly designed to visualize the class of sequences that can activate the neuron, the resulting visualization will correspond to a mixture of patterns. Such a mixture is usually difficult to interpret without resolving the mixed patterns. We propose the NeuronMotif algorithm to interpret such neurons. Given any convolutional neuron (CN) in the network, NeuronMotif first generates a large sample of sequences capable of activating the CN, which typically consists of a mixture of patterns. Then, the sequences are "demixed" in a layer-wise manner by backward clustering of the feature maps of the involved convolutional layers. NeuronMotif can output the sequence motifs, and the syntax rules governing their combinations are depicted by position weight matrices organized in tree structures. Compared to existing methods, the motifs found by NeuronMotif have more matches to known motifs in the JASPAR database. The higher-order patterns uncovered for deep CNs are supported by the literature and ATAC-seq footprinting. Overall, NeuronMotif enables the deciphering of cis-regulatory codes from deep CNs and enhances the utility of CNN in genome interpretation.


Assuntos
Algoritmos , Redes Neurais de Computação , Motivos de Nucleotídeos/genética , Sequências Reguladoras de Ácido Nucleico/genética , Bases de Dados Factuais
4.
BMC Bioinformatics ; 25(1): 76, 2024 Feb 20.
Artigo em Inglês | MEDLINE | ID: mdl-38378494

RESUMO

BACKGROUND: Genetic ancestry, inferred from genomic data, is a quantifiable biological parameter. While much of the human genome is identical across populations, it is estimated that as much as 0.4% of the genome can differ due to ancestry. This variation is primarily characterized by single nucleotide variants (SNVs), which are often unique to specific genetic populations. Knowledge of a patient's genetic ancestry can inform clinical decisions, from genetic testing and health screenings to medication dosages, based on ancestral disease predispositions. Nevertheless, the current reliance on self-reported ancestry can introduce subjectivity and exacerbate health disparities. While genomic sequencing data enables objective determination of a patient's genetic ancestry, existing approaches are limited to ancestry inference at the continental level. RESULTS: To address this challenge, and create an objective, measurable metric of genetic ancestry we present SNVstory, a method built upon three independent machine learning models for accurately inferring the sub-continental ancestry of individuals. We also introduce a novel method for simulating individual samples from aggregate allele frequencies from known populations. SNVstory includes a feature-importance scheme, unique among open-source ancestral tools, which allows the user to track the ancestral signal broadcast by a given gene or locus. We successfully evaluated SNVstory using a clinical exome sequencing dataset, comparing self-reported ethnicity and race to our inferred genetic ancestry, and demonstrate the capability of the algorithm to estimate ancestry from 36 different populations with high accuracy. CONCLUSIONS: SNVstory represents a significant advance in methods to assign genetic ancestry, opening the door to ancestry-informed care. SNVstory, an open-source model, is packaged as a Docker container for enhanced reliability and interoperability. It can be accessed from https://github.com/nch-igm/snvstory .


Assuntos
Etnicidade , Genética Populacional , Humanos , Reprodutibilidade dos Testes , Frequência do Gene , Etnicidade/genética , Testes Genéticos , Genoma Humano , Polimorfismo de Nucleotídeo Único
5.
J Biomed Inform ; 154: 104652, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38718897

RESUMO

OBJECTIVES: Ischemic heart disease (IHD) is a significant contributor to global mortality and disability, imposing a substantial social and economic burden on individuals and healthcare systems. To enhance the efficient allocation of medical resources and ultimately benefit a larger population, accurate prediction of healthcare costs is crucial. METHODS: We developed an interpretable IHD hospitalization cost prediction model that integrates network analysis with machine learning. Specifically, our network-enhanced model extracts explainable features by leveraging a diagnosis-procedure concurrence network and advanced graph kernel techniques, facilitating the capture of intricate relationships between medical codes. RESULTS: The proposed model achieved an R2 of 0.804 ± 0.008 and a root mean square error (RMSE) of 17,076 ± 420 CNY on the temporal validation dataset, demonstrating comparable performance to the model employing less interpretable code embedding features (R2: 0.800 ± 0.008; RMSE: 17,279 ± 437 CNY) and the hybrid graph isomorphism network (R2: 0.802 ± 0.007; RMSE: 17,249 ± 387 CNY). The interpretation of the network-enhanced model assisted in pinpointing specific diagnoses and procedures associated with higher hospitalization costs, including acute kidney injury, permanent atrial fibrillation, intra-aortic balloon bump, and temporary pacemaker placement, among others. CONCLUSION: Our analysis results demonstrate that the proposed model strikes a balance between predictive accuracy and interpretability. It aids in identifying specific diagnoses and procedures associated with higher hospitalization costs, underscoring its potential to support intelligent management of IHD.


Assuntos
Hospitalização , Isquemia Miocárdica , Humanos , Isquemia Miocárdica/diagnóstico , Hospitalização/economia , Aprendizado de Máquina , Algoritmos , Custos de Cuidados de Saúde/estatística & dados numéricos , Redes Neurais de Computação
6.
J Environ Manage ; 368: 122107, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39126840

RESUMO

In China, population growth and aging have partially negated the public health benefits of air pollution control measures, underscoring the ongoing need for precise PM2.5 monitoring and mapping. Despite its prevalence, the satellite-derived Aerosol Optical Depth (AOD) method for estimating PM2.5 concentrations often encounters significant spatial data gaps. Additionally, current research still needs better representation of PM2.5 spatiotemporal heterogeneity. Addressing these challenges, we developed a two-stage model employing the Extreme Gradient Boosting (XGBoost) algorithm. By incorporating improved spatiotemporal factors, we achieved high-precision and full-coverage daily 1-km PM2.5 mappings across China for the year 2020 without utilizing AOD products. Specifically, Model 1 develops improved temporal encodings and a terrain classification factor (DC), while Model 2 constructs an enhanced spatial autocorrelation term (Ps) by integrating observed and estimated values. Notably, Model 2 excelled in 10-fold sample-based cross-validation, achieving a coefficient of determination of 0.948, a mean absolute error of 3.792 µg/m³, a root mean square error of 7.144 µg/m³, and a mean relative error of 14.171%. Feature importance and Shapley Additive exPlanations (SHAP) analyses determined the relative importance of predictors in model training and outcome prediction, while correlation analysis identified strong links between improved temporal encodings, PM2.5 concentrations, and significant meteorological factors. Two-way Partial Dependence Plots (PDPs) further explored the interactions among these factors and their impact on PM2.5 levels. Compared to traditional methods, improved temporal encodings align more closely with seasonal variations and synergize more effectively with meteorological factors. Besides, the structured nature of DC aids in model training, while the improved Ps more effectively captures PM2.5's spatial autocorrelation, outperforming traditional Ps. Overall, this study effectively represents spatiotemporal information, thereby boosting model accuracy and enabling seamless large-scale PM2.5 estimations. It provides deep insights into variables and models, providing significant implications for future air pollution research.


Assuntos
Poluentes Atmosféricos , Poluição do Ar , Monitoramento Ambiental , Material Particulado , China , Material Particulado/análise , Poluição do Ar/análise , Monitoramento Ambiental/métodos , Poluentes Atmosféricos/análise , Aerossóis/análise , Algoritmos , Análise Espaço-Temporal
7.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34368837

RESUMO

The identification of protein-ligand interaction plays a key role in biochemical research and drug discovery. Although deep learning has recently shown great promise in discovering new drugs, there remains a gap between deep learning-based and experimental approaches. Here, we propose a novel framework, named AIMEE, integrating AI model and enzymological experiments, to identify inhibitors against 3CL protease of SARS-CoV-2 (Severe acute respiratory syndrome coronavirus 2), which has taken a significant toll on people across the globe. From a bioactive chemical library, we have conducted two rounds of experiments and identified six novel inhibitors with a hit rate of 29.41%, and four of them showed an IC50 value <3 µM. Moreover, we explored the interpretability of the central model in AIMEE, mapping the deep learning extracted features to the domain knowledge of chemical properties. Based on this knowledge, a commercially available compound was selected and was proven to be an activity-based probe of 3CLpro. This work highlights the great potential of combining deep learning models and biochemical experiments for intelligent iteration and for expanding the boundaries of drug discovery. The code and data are available at https://github.com/SIAT-code/AIMEE.


Assuntos
Tratamento Farmacológico da COVID-19 , Inibidores de Proteases/química , SARS-CoV-2/química , Bibliotecas de Moléculas Pequenas/química , Antivirais/química , Antivirais/uso terapêutico , Inteligência Artificial , COVID-19/genética , COVID-19/virologia , Descoberta de Drogas , Humanos , Ligantes , Inibidores de Proteases/uso terapêutico , SARS-CoV-2/efeitos dos fármacos , SARS-CoV-2/patogenicidade , Bibliotecas de Moléculas Pequenas/uso terapêutico
8.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-34020542

RESUMO

Machine learning methods have been widely applied to big data analysis in genomics and epigenomics research. Although accuracy and efficiency are common goals in many modeling tasks, model interpretability is especially important to these studies towards understanding the underlying molecular and cellular mechanisms. Deep neural networks (DNNs) have recently gained popularity in various types of genomic and epigenomic studies due to their capabilities in utilizing large-scale high-throughput bioinformatics data and achieving high accuracy in predictions and classifications. However, DNNs are often challenged by their potential to explain the predictions due to their black-box nature. In this review, we present current development in the model interpretation of DNNs, focusing on their applications in genomics and epigenomics. We first describe state-of-the-art DNN interpretation methods in representative machine learning fields. We then summarize the DNN interpretation methods in recent studies on genomics and epigenomics, focusing on current data- and computing-intensive topics such as sequence motif identification, genetic variations, gene expression, chromatin interactions and non-coding RNAs. We also present the biological discoveries that resulted from these interpretation methods. We finally discuss the advantages and limitations of current interpretation approaches in the context of genomic and epigenomic studies. Contact:xiaoman@mail.ucf.edu, haihu@cs.ucf.edu.


Assuntos
Aprendizado Profundo , Epigênese Genética , Genômica , Redes Neurais de Computação , Cromatina/metabolismo , Biologia Computacional/métodos , DNA/genética , Expressão Gênica , Ligação Proteica , RNA/genética
9.
BMC Gastroenterol ; 23(1): 111, 2023 Apr 06.
Artigo em Inglês | MEDLINE | ID: mdl-37024814

RESUMO

BACKGROUND: Hepatic encephalopathy (HE) is associated with marked increases in morbidity and mortality for cirrhosis patients. This study aimed to develop and validate machine learning (ML) models to predict 28-day mortality for patients with HE. METHODS: A retrospective cohort study was conducted in the Medical Information Mart for Intensive Care (MIMIC)-IV database. Patients from MIMIC-IV were randomized into training and validation cohorts in a ratio of 7:3. Training cohort was used for establishing the model while validation cohort was used for validation. The outcome was defined as 28-day mortality. Predictors were identified by recursive feature elimination (RFE) within 24 h of intensive care unit (ICU) admission. The area under the curve (AUC) and calibration curve were used to determine the predictive performance of different ML models. RESULTS: In the MIMIC-IV database, 601 patients were eventually diagnosed with HE. Of these, 112 (18.64%) experienced death within 28 days. Acute physiology score III (APSIII), sepsis related organ failure assessment (SOFA), international normalized ratio (INR), total bilirubin (TBIL), albumin, blood urea nitrogen (BUN), acute kidney injury (AKI) and mechanical ventilation were identified as independent risk factors. Validation set indicated that the artificial neural network (NNET) model had the highest AUC of 0.837 (95% CI:0.774-0.901). Furthermore, in the calibration curve, the NNET model was also well-calibrated (P = 0.323), which means that it can better predict the 28-day mortality in HE patients. Additionally, the performance of the NNET is superior to existing scores, including Model for End-Stage Liver Disease (MELD) and Model for End-Stage Liver Disease-Sodium (MELD-Na). CONCLUSIONS: In this study, the NNET model demonstrated better discrimination in predicting 28-day mortality as compared to other models. This developed model could potentially improve the early detection of HE with high mortality, subsequently improving clinical outcomes in these patients with HE, but further external prospective validation is still required.


Assuntos
Doença Hepática Terminal , Encefalopatia Hepática , Humanos , Encefalopatia Hepática/etiologia , Estudos Retrospectivos , Prognóstico , Índice de Gravidade de Doença , Unidades de Terapia Intensiva
10.
Environ Sci Technol ; 57(34): 12760-12770, 2023 08 29.
Artigo em Inglês | MEDLINE | ID: mdl-37594125

RESUMO

Understanding plant uptake and translocation of nanomaterials is crucial for ensuring the successful and sustainable applications of seed nanotreatment. Here, we collect a dataset with 280 instances from experiments for predicting the relative metal/metalloid concentration (RMC) in maize seedlings after seed priming by various metal and metalloid oxide nanoparticles. To obtain unbiased predictions and explanations on small datasets, we present an averaging strategy and add a dimension for interpretable machine learning. The findings in post-hoc interpretations of sophisticated LightGBM models demonstrate that solubility is highly correlated with model performance. Surface area, concentration, zeta potential, and hydrodynamic diameter of nanoparticles and seedling part and relative weight of plants are dominant factors affecting RMC, and their effects and interactions are explained. Furthermore, self-interpretable models using the RuleFit algorithm are established to successfully predict RMC only based on six important features identified by post-hoc explanations. We then develop a visualization tool called RuleGrid to depict feature effects and interactions in numerous generated rules. Consistent parameter-RMC relationships are obtained by different methods. This study offers a promising interpretable data-driven approach to expand the knowledge of nanoparticle fate in plants and may profoundly contribute to the safety-by-design of nanomaterials in agricultural and environmental applications.


Assuntos
Metaloides , Sementes , Transporte Biológico , Agricultura , Aprendizado de Máquina , Plântula
11.
Environ Sci Technol ; 57(48): 19860-19870, 2023 Dec 05.
Artigo em Inglês | MEDLINE | ID: mdl-37976424

RESUMO

Electricity consumption and sludge yield (SY) are important indirect greenhouse gas (GHG) emission sources in wastewater treatment plants (WWTPs). Predicting these byproducts is crucial for tailoring technology-related policy decisions. However, it challenges balancing mass balance models and mechanistic models that respectively have limited intervariable nexus representation and excessive requirements on operational parameters. Herein, we propose integrating two machine learning models, namely, gradient boosting tree (GBT) and deep learning (DL), to precisely pointwise model electricity consumption intensity (ECI) and SY for WWTPs in China. Results indicate that GBT and DL are capable of mining massive data to compensate for the lack of available parameters, providing a comprehensive modeling focusing on operation conditions and designed parameters, respectively. The proposed model reveals that lower ECI and SY were associated with higher treated wastewater volumes, more lenient effluent standards, and newer equipment. Moreover, ECI and SY showed different patterns when influent biochemical oxygen demand is above or below 100 mg/L in the anaerobic-anoxic-oxic process. Therefore, managing ECI and SY requires quantifying the coupling relationships between biochemical reactions instead of isolating each variable. Furthermore, the proposed models demonstrate potential economic-related inequalities resulting from synergizing water pollution and GHG emissions management.


Assuntos
Gases de Efeito Estufa , Purificação da Água , Eliminação de Resíduos Líquidos , Águas Residuárias , Esgotos , Purificação da Água/métodos , Efeito Estufa
12.
J Biomed Inform ; 144: 104438, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37414368

RESUMO

Unpacking and comprehending how black-box machine learning algorithms (such as deep learning models) make decisions has been a persistent challenge for researchers and end-users. Explaining time-series predictive models is useful for clinical applications with high stakes to understand the behavior of prediction models, e.g., to determine how different variables and time points influence the clinical outcome. However, existing approaches to explain such models are frequently unique to architectures and data where the features do not have a time-varying component. In this paper, we introduce WindowSHAP, a model-agnostic framework for explaining time-series classifiers using Shapley values. We intend for WindowSHAP to mitigate the computational complexity of calculating Shapley values for long time-series data as well as improve the quality of explanations. WindowSHAP is based on partitioning a sequence into time windows. Under this framework, we present three distinct algorithms of Stationary, Sliding and Dynamic WindowSHAP, each evaluated against baseline approaches, KernelSHAP and TimeSHAP, using perturbation and sequence analyses metrics. We applied our framework to clinical time-series data from both a specialized clinical domain (Traumatic Brain Injury - TBI) as well as a broad clinical domain (critical care medicine). The experimental results demonstrate that, based on the two quantitative metrics, our framework is superior at explaining clinical time-series classifiers, while also reducing the complexity of computations. We show that for time-series data with 120 time steps (hours), merging 10 adjacent time points can reduce the CPU time of WindowSHAP by 80 % compared to KernelSHAP. We also show that our Dynamic WindowSHAP algorithm focuses more on the most important time steps and provides more understandable explanations. As a result, WindowSHAP not only accelerates the calculation of Shapley values for time-series data, but also delivers more understandable explanations with higher quality.


Assuntos
Algoritmos , Lesões Encefálicas Traumáticas , Humanos , Fatores de Tempo , Benchmarking , Lesões Encefálicas Traumáticas/diagnóstico , Aprendizado de Máquina
13.
Proc Natl Acad Sci U S A ; 117(35): 21373-21380, 2020 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-32801215

RESUMO

Cytometry technologies are essential tools for immunology research, providing high-throughput measurements of the immune cells at the single-cell level. Existing approaches in interpreting and using cytometry measurements include manual or automated gating to identify cell subsets from the cytometry data, providing highly intuitive results but may lead to significant information loss, in that additional details in measured or correlated cell signals might be missed. In this study, we propose and test a deep convolutional neural network for analyzing cytometry data in an end-to-end fashion, allowing a direct association between raw cytometry data and the clinical outcome of interest. Using nine large cytometry by time-of-flight mass spectrometry or mass cytometry (CyTOF) studies from the open-access ImmPort database, we demonstrated that the deep convolutional neural network model can accurately diagnose the latent cytomegalovirus (CMV) in healthy individuals, even when using highly heterogeneous data from different studies. In addition, we developed a permutation-based method for interpreting the deep convolutional neural network model. We were able to identify a CD27- CD94+ CD8+ T cell population significantly associated with latent CMV infection, confirming the findings in previous studies. Finally, we provide a tutorial for creating, training, and interpreting the tailored deep learning model for cytometry data using Keras and TensorFlow (https://github.com/hzc363/DeepLearningCyTOF).


Assuntos
Aprendizado Profundo , Citometria de Fluxo , Infecções por Citomegalovirus/diagnóstico , Humanos , Linfócitos T/citologia
14.
Neurocrit Care ; 38(2): 335-344, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36195818

RESUMO

BACKGROUND: Acute kidney injury (AKI), a prevalent non-neurological complication following traumatic brain injury (TBI), is a major clinical issue with an unfavorable prognosis. This study aimed to develop and validate machine learning models to predict severe AKI (stage 3 or greater) incidence in patients with TBI. METHODS: A retrospective cohort study was conducted by using two public databases: the Medical Information Mart for Intensive Care IV (MIMIC)-IV and the eICU Collaborative Research Database (eICU-CRD). Recursive feature elimination was used to select candidate predictors obtained within 24 h of intensive care unit admission. The area under the curve and decision curve analysis curves were used to determine the discriminatory ability. On the other hand, the calibration curve was employed to evaluate the calibrated performance of the newly developed machine learning models. RESULTS: In the MIMIC-IV database, there were 808 patients diagnosed with moderate and severe TBI (msTBI) (msTBI is defined as Glasgow Coma Score < 12). Of these, 60 (7.43%) patients experienced severe AKI. External validation in the eICU-CRD indicated that the random forest (RF) model had the highest area under the curve of 0.819 (95% confidence interval 0.783-0.851). Furthermore, in the calibration curve, the RF model was well calibrated (P = 0.795). CONCLUSIONS: In this study, the RF model demonstrated better discrimination in predicting severe AKI than other models. An online calculator could facilitate its application, potentially improving the early detection of severe AKI and subsequently improving the clinical outcomes among patients with msTBI.


Assuntos
Injúria Renal Aguda , Lesões Encefálicas Traumáticas , Humanos , Estudos Retrospectivos , Hospitalização , Lesões Encefálicas Traumáticas/complicações , Injúria Renal Aguda/epidemiologia , Aprendizado de Máquina
15.
BMC Med Inform Decis Mak ; 23(1): 173, 2023 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-37653403

RESUMO

BACKGROUND: Chronic kidney disease (CKD) is a global public health concern. Therefore, to provide timely intervention for non-hospitalized high-risk patients and rationally allocate limited clinical resources is important to mine the key factors when designing a CKD prediction model. METHODS: This study included data from 1,358 patients with CKD pathologically confirmed during the period from December 2017 to September 2020 at Zhongshan Hospital. A CKD prediction interpretation framework based on machine learning was proposed. From among 100 variables, 17 were selected for the model construction through a recursive feature elimination with logistic regression feature screening. Several machine learning classifiers, including extreme gradient boosting, gaussian-based naive bayes, a neural network, ridge regression, and linear model logistic regression (LR), were trained, and an ensemble model was developed to predict 24-hour urine protein. The detailed relationship between the risk of CKD progression and these predictors was determined using a global interpretation. A patient-specific analysis was conducted using a local interpretation. RESULTS: The results showed that LR achieved the best performance, with an area under the curve (AUC) of 0.850 in a single machine learning model. The ensemble model constructed using the voting integration method further improved the AUC to 0.856. The major predictors of moderate-to-severe severity included lower levels of 25-OH-vitamin, albumin, transferrin in males, and higher levels of cystatin C. CONCLUSIONS: Compared with the clinical single kidney function evaluation indicators (eGFR, Scr), the machine learning model proposed in this study improved the prediction accuracy of CKD progression by 17.6% and 24.6%, respectively, and the AUC was improved by 0.250 and 0.236, respectively. Our framework can achieve a good predictive interpretation and provide effective clinical decision support.


Assuntos
Hospitais , Urinálise , Masculino , Humanos , Teorema de Bayes , Área Sob a Curva , Aprendizado de Máquina
16.
Empir Softw Eng ; 28(2): 39, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36776918

RESUMO

The Ethereum platform allows developers to implement and deploy applications called ÐApps onto the blockchain for public use through the use of smart contracts. To execute code within a smart contract, a paid transaction must be issued towards one of the functions that are exposed in the interface of a contract. However, such a transaction is only processed once one of the miners in the peer-to-peer network selects it, adds it to a block, and appends that block to the blockchain This creates a delay between transaction submission and code execution. It is crucial for ÐApp developers to be able to precisely estimate when transactions will be processed, since this allows them to define and provide a certain Quality of Service (QoS) level (e.g., 95% of the transactions processed within 1 minute). However, the impact that different factors have on these times have not yet been studied. Processing time estimation services are used by ÐApp developers to achieve predefined QoS. Yet, these services offer minimal insights into what factors impact processing times. Considering the vast amount of data that surrounds the Ethereum blockchain, changes in processing times are hard for ÐApp developers to predict, making it difficult to maintain said QoS. In our study, we build random forest models to understand the factors that are associated with transaction processing times. We engineer several features that capture blockchain internal factors, as well as gas pricing behaviors of transaction issuers. By interpreting our models, we conclude that features surrounding gas pricing behaviors are very strongly associated with transaction processing times. Based on our empirical results, we provide ÐApp developers with concrete insights that can help them provide and maintain high levels of QoS.

17.
BMC Bioinformatics ; 23(1): 91, 2022 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-35291940

RESUMO

BACKGROUND: Upland cotton provides the most natural fiber in the world. During fiber development, the quality and yield of fiber were influenced by gene transcription. Revealing sequence features related to transcription has a profound impact on cotton molecular breeding. We applied convolutional neural networks to predict gene expression status based on the sequences of gene transcription start regions. After that, a gradient-based interpretation and an N-adjusted kernel transformation were implemented to extract sequence features contributing to transcription. RESULTS: Our models had approximate 80% accuracies, and the area under the receiver operating characteristic curve reached over 0.85. Gradient-based interpretation revealed 5' untranslated region contributed to gene transcription. Furthermore, 6 DOF binding motifs and 4 transcription activator binding motifs were obtained by N-adjusted kernel-motif transformation from models in three developmental stages. Apart from 10 general motifs, 3 DOF5.1 genes were also detected. In silico analysis about these motifs' binding proteins implied their potential functions in fiber formation. Besides, we also found some novel motifs in plants as important sequence features for transcription. CONCLUSIONS: In conclusion, the N-adjusted kernel transformation method could interpret convolutional neural networks and reveal important sequence features related to transcription during fiber development. Potential functions of motifs interpreted from convolutional neural networks could be validated by further wet-lab experiments and applied in cotton molecular breeding.


Assuntos
Redes Neurais de Computação
18.
Brief Bioinform ; 21(6): 1999-2010, 2020 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-31792536

RESUMO

MOTIVATION: Since the initial discovery of microRNAs as post-transcriptional, regulatory key players in the 1990s, a total number of $2656$ mature microRNAs have been publicly described for Homo sapiens. As discovery of new miRNAs is still on-going, target identification remains to be an essential and challenging step preceding functional annotation analysis. One key challenge for researchers seems to be the selection of the most appropriate tool out of the larger multiverse of published solutions for a given research study set-up. RESULTS: In this review we collectively describe the field of in silico target prediction in the course of time and point out long withstanding principles as well as recent developments. By compiling a catalog of characteristics about the 98 prediction methods and identifying common and exclusive traits, we signpost a simplified mechanism to address the problem of application selection. Going further we devised interpretation strategies for common types of output as generated by frequently used computational methods. To this end, our work specifically aims to make prospective users aware of common mistakes and practical questions that arise during the application of target prediction tools. AVAILABILITY: An interactive implementation of our recommendations including materials shown in the manuscript is freely available at https://www.ccb.uni-saarland.de/mtguide.


Assuntos
Biologia Computacional , Simulação por Computador , Regulação da Expressão Gênica , MicroRNAs , Biologia Computacional/métodos , Estudos Prospectivos , Software
19.
BMC Med Res Methodol ; 22(1): 183, 2022 07 04.
Artigo em Inglês | MEDLINE | ID: mdl-35787248

RESUMO

OBJECTIVE: Our study aimed to identify predictors as well as develop machine learning (ML) models to predict the risk of 30-day mortality in patients with sepsis-associated encephalopathy (SAE). MATERIALS AND METHODS: ML models were developed and validated based on a public database named Medical Information Mart for Intensive Care (MIMIC)-IV. Models were compared by the area under the curve (AUC), accuracy, sensitivity, specificity, positive and negative predictive values, and Hosmer-Lemeshow good of fit test. RESULTS: Of 6994 patients in MIMIC-IV included in the final cohort, a total of 1232 (17.62%) patients died following SAE. Recursive feature elimination (RFE) selected 15 variables, including acute physiology score III (APSIII), Glasgow coma score (GCS), sepsis related organ failure assessment (SOFA), Charlson comorbidity index (CCI), red blood cell volume distribution width (RDW), blood urea nitrogen (BUN), age, respiratory rate, PaO2, temperature, lactate, creatinine (CRE), malignant cancer, metastatic solid tumor, and platelet (PLT). The validation cohort demonstrated all ML approaches had higher discriminative ability compared with the bagged trees (BT) model, although the difference was not statistically significant. Furthermore, in terms of the calibration performance, the artificial neural network (NNET), logistic regression (LR), and adapting boosting (Ada) models had a good calibration-namely, a high accuracy of prediction, with P-values of 0.831, 0.119, and 0.129, respectively. CONCLUSIONS: The ML models, as demonstrated by our study, can be used to evaluate the prognosis of SAE patients in the intensive care unit (ICU). Online calculator could facilitate the sharing of predictive models.


Assuntos
Encefalopatia Associada a Sepse , Sepse , Morte , Humanos , Aprendizado de Máquina , Redes Neurais de Computação , Sepse/complicações , Sepse/diagnóstico
20.
Environ Sci Technol ; 56(3): 2054-2064, 2022 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-34995441

RESUMO

Solute descriptors have been widely used to model chemical transfer processes through poly-parameter linear free energy relationships (pp-LFERs); however, there are still substantial difficulties in obtaining these descriptors accurately and quickly for new organic chemicals. In this research, models (PaDEL-DNN) that require only SMILES of chemicals were built to satisfactorily estimate pp-LFER descriptors using deep neural networks (DNN) and the PaDEL chemical representation. The PaDEL-DNN-estimated pp-LFER descriptors demonstrated good performance in modeling storage-lipid/water partitioning coefficient (log Kstorage-lipid/water), bioconcentration factor (BCF), aqueous solubility (ESOL), and hydration free energy (freesolve). Then, assuming that the accuracy in the estimated values of widely available properties, e.g., logP (octanol-water partition coefficient), can calibrate estimates for less available but related properties, we proposed logP as a surrogate metric for evaluating the overall accuracy of the estimated pp-LFER descriptors. When using the pp-LFER descriptors to model log Kstorage-lipid/water, BCF, ESOL, and freesolve, we achieved around 0.1 log unit lower errors for chemicals whose estimated pp-LFER descriptors were deemed "accurate" by the surrogate metric. The interpretation of the PaDEL-DNN models revealed that, for a given test chemical, having several (around 5) "similar" chemicals in the training data set was crucial for accurate estimation while the remaining less similar training chemicals provided reasonable baseline estimates. Lastly, pp-LFER descriptors for over 2800 persistent, bioaccumulative, and toxic chemicals were reasonably estimated by combining PaDEL-DNN with the surrogate metric. Overall, the PaDEL-DNN/surrogate metric and newly estimated descriptors will greatly benefit chemical transfer modeling.


Assuntos
Compostos Orgânicos , Água , Fenômenos Químicos , Redes Neurais de Computação , Octanóis , Compostos Orgânicos/química , Água/química
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA