RESUMO
Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.
Assuntos
Análise de Dados , Idioma , Sítios de Ligação , Sequência de Aminoácidos , Bases de Dados FactuaisRESUMO
CRISPR Cas-9 is a groundbreaking genome-editing tool that harnesses bacterial defense systems to alter DNA sequences accurately. This innovative technology holds vast promise in multiple domains like biotechnology, agriculture and medicine. However, such power does not come without its own peril, and one such issue is the potential for unintended modifications (Off-Target), which highlights the need for accurate prediction and mitigation strategies. Though previous studies have demonstrated improvement in Off-Target prediction capability with the application of deep learning, they often struggle with the precision-recall trade-off, limiting their effectiveness and do not provide proper interpretation of the complex decision-making process of their models. To address these limitations, we have thoroughly explored deep learning networks, particularly the recurrent neural network based models, leveraging their established success in handling sequence data. Furthermore, we have employed genetic algorithm for hyperparameter tuning to optimize these models' performance. The results from our experiments demonstrate significant performance improvement compared with the current state-of-the-art in Off-Target prediction, highlighting the efficacy of our approach. Furthermore, leveraging the power of the integrated gradient method, we make an effort to interpret our models resulting in a detailed analysis and understanding of the underlying factors that contribute to Off-Target predictions, in particular the presence of two sub-regions in the seed region of single guide RNA which extends the established biological hypothesis of Off-Target effects. To the best of our knowledge, our model can be considered as the first model combining high efficacy, interpretability and a desirable balance between precision and recall.
Assuntos
Sistemas CRISPR-Cas , Aprendizado Profundo , Edição de Genes/métodos , RNA Guia de Sistemas CRISPR-Cas , Redes Neurais de ComputaçãoRESUMO
Accurate prediction of transcription factor binding sites (TFBSs) is essential for understanding gene regulation mechanisms and the etiology of diseases. Despite numerous advances in deep learning for predicting TFBSs, their performance can still be enhanced. In this study, we propose MLSNet, a novel deep learning architecture designed specifically to predict TFBSs. MLSNet innovatively integrates multisize convolutional fusion with long short-term memory (LSTM) networks to effectively capture DNA-sparse higher-order sequence features. Further, MLSNet incorporates super token attention and Bi-LSTM to systematically extract and integrate higher-order DNA shape features. Experimental results on 165 ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets indicate that MLSNet consistently outperforms several state-of-the-art algorithms in the prediction of TFBSs. Specifically, MLSNet reports average metrics: 0.8306 for ACC, 0.8992 for AUROC, and 0.9035 for AUPRC, surpassing the second-best methods by 1.82%, 1.68%, and 1.54%, respectively. This research delineates the effectiveness of combining multi-size convolutional layers with LSTM and DNA shape-based features in enhancing predictive accuracy. Moreover, this study comprehensively assesses the variability in model performance across different cell lines and transcription factors. The source code of MLSNet is available at https://github.com/minghaidea/MLSNet.
Assuntos
Aprendizado Profundo , Fatores de Transcrição , Fatores de Transcrição/metabolismo , Sítios de Ligação , Algoritmos , Biologia Computacional/métodos , Humanos , Sequenciamento de Cromatina por Imunoprecipitação/métodos , DNA/metabolismo , DNA/químicaRESUMO
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play important roles in essential biological processes. To facilitate functional annotation and accurate prediction of different types of NABPs, many machine learning-based computational approaches have been developed. However, the datasets used for training and testing as well as the prediction scopes in these studies have limited their applications. In this paper, we developed new strategies to overcome these limitations by generating more accurate and robust datasets and developing deep learning-based methods including both hierarchical and multi-class approaches to predict the types of NABPs for any given protein. The deep learning models employ two layers of convolutional neural network and one layer of long short-term memory. Our approaches outperform existing DBP and RBP predictors with a balanced prediction between DBPs and RBPs, and are more practically useful in identifying novel NABPs. The multi-class approach greatly improves the prediction accuracy of DBPs and RBPs, especially for the DBPs with ~12% improvement. Moreover, we explored the prediction accuracy of single-stranded DNA binding proteins and their effect on the overall prediction accuracy of NABP predictions.
Assuntos
Biologia Computacional , Proteínas de Ligação a DNA , Aprendizado Profundo , Proteínas de Ligação a RNA , Proteínas de Ligação a RNA/metabolismo , Proteínas de Ligação a DNA/metabolismo , Biologia Computacional/métodos , Redes Neurais de Computação , HumanosRESUMO
Interactions of biological molecules in organisms are considered to be primary factors for the lifecycle of that organism. Various important biological functions are dependent on such interactions and among different kinds of interactions, the protein DNA interactions are very important for the processes of transcription, regulation of gene expression, DNA repairing and packaging. Thus, keeping the knowledge of such interactions and the sites of those interactions is necessary to study the mechanism of various biological processes. As experimental identification through biological assays is quite resource-demanding, costly and error-prone, scientists opt for the computational methods for efficient and accurate identification of such DNA-protein interaction sites. Thus, herein, we propose a novel and accurate method namely DeepDBS for the identification of DNA-binding sites in proteins, using primary amino acid sequences of proteins under study. From protein sequences, deep representations were computed through a one-dimensional convolution neural network (1D-CNN), recurrent neural network (RNN) and long short-term memory (LSTM) network and were further used to train a Random Forest classifier. Random Forest with LSTM-based features outperformed the other models, as well as the existing state-of-the-art methods with an accuracy score of 0.99 for self-consistency test, 10-fold cross-validation, 5-fold cross-validation, and jackknife validation while 0.92 for independent dataset testing. It is concluded based on results that the DeepDBS can help accurate and efficient identification of DNA binding sites (DBS) in proteins.
Assuntos
Proteínas de Ligação a DNA , DNA , Redes Neurais de Computação , Sítios de Ligação , DNA/genética , DNA/metabolismo , DNA/química , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Proteínas de Ligação a DNA/química , Biologia Computacional/métodos , Sequência de Aminoácidos/genética , Algoritmos , Ligação Proteica , Bases de Dados de Proteínas , Análise de Sequência de Proteína/métodos , Aprendizado Profundo , Algoritmo Florestas AleatóriasRESUMO
Short-length antimicrobial peptides (AMPs) have been demonstrated to have intensified antimicrobial activities against a wide spectrum of microbes. Therefore, exploration of novel and promising short AMPs is highly essential in developing various types of antimicrobial drugs or treatments. In addition to experimental approaches, computational methods have been developed to improve screening efficiency. Although existing computational methods have achieved satisfactory performance, there is still much room for model improvement. In this study, we proposed iAMP-DL, an efficient hybrid deep learning architecture, for predicting short AMPs. The model was constructed using two well-known deep learning architectures: the long short-term memory architecture and convolutional neural networks. To fairly assess the performance of the model, we compared our model with existing state-of-the-art methods using the same independent test set. Our comparative analysis shows that iAMP-DL outperformed other methods. Furthermore, to assess the robustness and stability of our model, the experiments were repeated 10 times to observe the variation in prediction efficiency. The results demonstrate that iAMP-DL is an effective, robust, and stable framework for detecting promising short AMPs. Another comparative study of different negative data sampling methods also confirms the effectiveness of our method and demonstrates that it can also be used to develop a robust model for predicting AMPs in general. The proposed framework was also deployed as an online web server with a user-friendly interface to support the research community in identifying short AMPs.
Assuntos
Peptídeos Antimicrobianos , Aprendizado Profundo , Peptídeos Antimicrobianos/química , Peptídeos Antimicrobianos/farmacologia , Redes Neurais de Computação , Biologia Computacional/métodos , Peptídeos Catiônicos Antimicrobianos/química , Peptídeos Catiônicos Antimicrobianos/farmacologiaRESUMO
BACKGROUND: Natural proteins occupy a small portion of the protein sequence space, whereas artificial proteins can explore a wider range of possibilities within the sequence space. However, specific requirements may not be met when generating sequences blindly. Research indicates that small proteins have notable advantages, including high stability, accurate resolution prediction, and facile specificity modification. RESULTS: This study involves the construction of a neural network model named TopoProGenerator(TPGen) using a transformer decoder. The model is trained with sequences consisting of a maximum of 65 amino acids. The training process of TopoProGenerator incorporates reinforcement learning and adversarial learning, for fine-tuning. Additionally, it encompasses a stability predictive model trained with a dataset comprising over 200,000 sequences. The results demonstrate that TopoProGenerator is capable of designing stable small protein sequences with specified topology structures. CONCLUSION: TPGen has the ability to generate protein sequences that fold into the specified topology, and the pretraining and fine-tuning methods proposed in this study can serve as a framework for designing various types of proteins.
Assuntos
Aminoácidos , Fontes de Energia Elétrica , Sequência de Aminoácidos , Idioma , AprendizagemRESUMO
Replication of DNA is an important process for the cell division cycle, gene expression regulation and other biological evolution processes. It also has a crucial role in a living organism's physical growth and structure. Replication of DNA comprises of three stages known as initiation, elongation and termination, whereas the origin of replication sites (ORI) is the location of initiation of the DNA replication process. There exist various methodologies to identify ORIs in the genomic sequences, however, these methods have used either extensive computations for execution, or have limited optimization for the large datasets. Herein, a model called ORI-Deep is proposed to identify ORIs from the multiple cell type genomic sequence benchmark data. An efficient method is proposed using a deep neural network to identify ORIs for four different eukaryotic species. For better representation of data, a feature vector is constructed using statistical moments for the training and testing of data and is further fed to a long short-term memory (LSTM) network. To prove the effectiveness of the proposed model, we applied several validation techniques at different levels to obtain seven accuracy metrics, and the accuracy score for self-consistency, 10-fold cross-validation, jackknife and the independent set test is observed to be 0.977, 0.948, 0.976 and 0.977, respectively. Based on the results, it can be concluded that ORI-Deep can efficiently predict the sites of origin replication in DNA sequence with high accuracy. Webserver for ORI-Deep is available at (https://share.streamlit.io/waqarhusain/orideep/main/app.py), whereas source code is available at (https://github.com/WaqarHusain/OriDeep).
Assuntos
Memória de Curto Prazo , Origem de Replicação , Eucariotos , Redes Neurais de Computação , SoftwareRESUMO
Protein S-sulfinylation is an important posttranslational modification that regulates a variety of cell and protein functions. This modification has been linked to signal transduction, redox homeostasis and neuronal transmission in studies. Therefore, identification of S-sulfinylation sites is crucial to understanding its structure and function, which is critical in cell biology and human diseases. In this study, we propose a multi-module deep learning framework named DLF-Sul for identification of S-sulfinylation sites in proteins. First, three types of features are extracted including binary encoding, BLOSUM62 and amino acid index. Then, sequential features are further extracted based on these three types of features using bidirectional long short-term memory network. Next, multi-head self-attention mechanism is utilized to filter the effective attribute information, and residual connection helps to reduce information loss. Furthermore, convolutional neural network is employed to extract local deep features information. Finally, fully connected layers acts as classifier that map samples to corresponding label. Performance metrics on independent test set, including sensitivity, specificity, accuracy, Matthews correlation coefficient and area under curve, reach 91.80%, 92.36%, 92.08%, 0.8416 and 96.40%, respectively. The results show that DLF-Sul is an effective tool for predicting S-sulfinylation sites. The source code is available on the website https://github.com/ningq669/DLF-Sul.
Assuntos
Aprendizado Profundo , Aminoácidos , Humanos , Redes Neurais de Computação , Proteínas/química , SoftwareRESUMO
Due to the rapid emergence of multi-drug resistant (MDR) bacteria, existing antibiotics are becoming ineffective. So, researchers are looking for alternatives in the form of antibacterial peptides (ABPs) based medicines. The discovery of novel ABPs using wet-lab experiments is time-consuming and expensive. Many machine learning models have been proposed to search for new ABPs, but there is still scope to develop a robust model that has high accuracy and precision. In this work, we present StaBle-ABPpred, a stacked ensemble technique-based deep learning classifier that uses bidirectional long-short term memory (biLSTM) and attention mechanism at base-level and an ensemble of random forest, gradient boosting and logistic regression at meta-level to classify peptides as antibacterial or otherwise. The performance of our model has been compared with several state-of-the-art classifiers, and results were subjected to analysis of variance (ANOVA) test and its post hoc analysis, which proves that our model performs better than existing classifiers. Furthermore, a web app has been developed and deployed at https://stable-abppred.anvil.app to identify novel ABPs in protein sequences. Using this app, we identified novel ABPs in all the proteins of the Streptococcus phage T12 genome. These ABPs have shown amino acid similarities with experimentally tested antimicrobial peptides (AMPs) of other organisms. Hence, they could be chemically synthesized and experimentally validated for their activity against different bacteria. The model and app developed in this work can be further utilized to explore the protein diversity for identifying novel ABPs with broad-spectrum activity, especially against MDR bacterial pathogens.
Assuntos
Antibacterianos , Peptídeos , Sequência de Aminoácidos , Antibacterianos/farmacologia , Aprendizado de Máquina , Peptídeos/química , ProteínasRESUMO
N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.
Assuntos
Aprendizado Profundo , DNA/genética , Metilação de DNA , SoftwareRESUMO
Protein lysine crotonylation (Kcr) is an important type of posttranslational modification that is associated with a wide range of biological processes. The identification of Kcr sites is critical to better understanding their functional mechanisms. However, the existing experimental techniques for detecting Kcr sites are cost-ineffective, to a great need for new computational methods to address this problem. We here describe Adapt-Kcr, an advanced deep learning model that utilizes adaptive embedding and is based on a convolutional neural network together with a bidirectional long short-term memory network and attention architecture. On the independent testing set, Adapt-Kcr outperformed the current state-of-the-art Kcr prediction model, with an improvement of 3.2% in accuracy and 1.9% in the area under the receiver operating characteristic curve. Compared to other Kcr models, Adapt-Kcr additionally had a more robust ability to distinguish between crotonylation and other lysine modifications. Another model (Adapt-ST) was trained to predict phosphorylation sites in SARS-CoV-2, and outperformed the equivalent state-of-the-art phosphorylation site prediction model. These results indicate that self-adaptive embedding features perform better than handcrafted features in capturing discriminative information; when used in attention architecture, this could be an effective way of identifying protein Kcr sites. Together, our Adapt framework (including learning embedding features and attention architecture) has a strong potential for prediction of other protein posttranslational modification sites.
Assuntos
Biologia Computacional , Aprendizado Profundo , Lisina/metabolismo , Processamento de Proteína Pós-Traducional , Software , Algoritmos , Benchmarking , Biologia Computacional/métodos , Biologia Computacional/normas , Bases de Dados Factuais , Redes Neurais de Computação , Fosforilação , Curva ROC , Reprodutibilidade dos Testes , Interface Usuário-ComputadorRESUMO
The purpose of the current study was to explore the feasibility of training a deep neural network to accelerate the process of generating T1, T2, and T1ρ maps for a recently proposed free-breathing cardiac multiparametric mapping technique, where a recurrent neural network (RNN) was utilized to exploit the temporal correlation among the multicontrast images. The RNN-based model was developed for rapid and accurate T1, T2, and T1ρ estimation. Bloch simulation was performed to simulate a dataset of more than 10 million signals and time correspondences with different noise levels for network training. The proposed RNN-based method was compared with a dictionary-matching method and a conventional mapping method to evaluate the model's effectiveness in phantom and in vivo studies at 3 T, respectively. In phantom studies, the RNN-based method and the dictionary-matching method achieved similar accuracy and precision in T1, T2, and T1ρ estimations. In in vivo studies, the estimated T1, T2, and T1ρ values obtained by the two methods achieved similar accuracy and precision for 10 healthy volunteers (T1: 1228.70 ± 53.80 vs. 1228.34 ± 52.91 ms, p > 0.1; T2: 40.70 ± 2.89 vs. 41.19 ± 2.91 ms, p > 0.1; T1ρ: 45.09 ± 4.47 vs. 45.23 ± 4.65 ms, p > 0.1). The RNN-based method can generate cardiac multiparameter quantitative maps simultaneously in just 2 s, achieving 60-fold acceleration compared with the dictionary-matching method. The RNN-accelerated method offers an almost instantaneous approach for reconstructing accurate T1, T2, and T1ρ maps, being much more efficient than the dictionary-matching method for the free-breathing multiparametric cardiac mapping technique, which may pave the way for inline mapping in clinical applications.
Assuntos
Coração , Redes Neurais de Computação , Imagens de Fantasmas , Humanos , Coração/diagnóstico por imagem , Masculino , Adulto , Imageamento por Ressonância Magnética/métodos , Feminino , Processamento de Imagem Assistida por Computador/métodos , AlgoritmosRESUMO
BACKGROUND: Smoking is a critical risk factor responsible for over eight million annual deaths worldwide. It is essential to obtain information on smoking habits to advance research and implement preventive measures such as screening of high-risk individuals. In most countries, including Denmark, smoking habits are not systematically recorded and at best documented within unstructured free-text segments of electronic health records (EHRs). This would require researchers and clinicians to manually navigate through extensive amounts of unstructured data, which is one of the main reasons that smoking habits are rarely integrated into larger studies. Our aim is to develop machine learning models to classify patients' smoking status from their EHRs. METHODS: This study proposes an efficient natural language processing (NLP) pipeline capable of classifying patients' smoking status and providing explanations for the decisions. The proposed NLP pipeline comprises four distinct components, which are; (1) considering preprocessing techniques to address abbreviations, punctuation, and other textual irregularities, (2) four cutting-edge feature extraction techniques, i.e. Embedding, BERT, Word2Vec, and Count Vectorizer, employed to extract the optimal features, (3) utilization of a Stacking-based Ensemble (SE) model and a Convolutional Long Short-Term Memory Neural Network (CNN-LSTM) for the identification of smoking status, and (4) application of a local interpretable model-agnostic explanation to explain the decisions rendered by the detection models. The EHRs of 23,132 patients with suspected lung cancer were collected from the Region of Southern Denmark during the period 1/1/2009-31/12/2018. A medical professional annotated the data into 'Smoker' and 'Non-Smoker' with further classifications as 'Active-Smoker', 'Former-Smoker', and 'Never-Smoker'. Subsequently, the annotated dataset was used for the development of binary and multiclass classification models. An extensive comparison was conducted of the detection performance across various model architectures. RESULTS: The results of experimental validation confirm the consistency among the models. However, for binary classification, BERT method with CNN-LSTM architecture outperformed other models by achieving precision, recall, and F1-scores between 97% and 99% for both Never-Smokers and Active-Smokers. In multiclass classification, the Embedding technique with CNN-LSTM architecture yielded the most favorable results in class-specific evaluations, with equal performance measures of 97% for Never-Smoker and measures in the range of 86 to 89% for Active-Smoker and 91-92% for Never-Smoker. CONCLUSION: Our proposed NLP pipeline achieved a high level of classification performance. In addition, we presented the explanation of the decision made by the best performing detection model. Future work will expand the model's capabilities to analyze longer notes and a broader range of categories to maximize its utility in further research and screening applications.
Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Fumar , Humanos , Dinamarca/epidemiologia , Registros Eletrônicos de Saúde/estatística & dados numéricos , Fumar/epidemiologia , Aprendizado de Máquina , Feminino , Masculino , Pessoa de Meia-Idade , Redes Neurais de ComputaçãoRESUMO
BACKGROUND: Gonorrhea has long been a serious public health problem in mainland China that requires attention, modeling to describe and predict its prevalence patterns can help the government to develop more scientific interventions. METHODS: Time series (TS) data of the gonorrhea incidence in China from January 2004 to August 2022 were collected, with the incidence data from September 2021 to August 2022 as the validation. The seasonal autoregressive integrated moving average (SARIMA) model, long short-term memory network (LSTM) model, and hybrid SARIMA-LSTM model were used to simulate the data respectively, the model performance were evaluated by calculating the mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE) of the training and validation sets of the models. RESULTS: The Seasonal components after data decomposition showed an approximate bimodal distribution with a period of 12 months. The three models identified were SARIMA(1,1,1) (2,1,2)12, LSTM with 150 hidden units, and SARIMA-LSTM with 150 hidden units, the SARIMA-LSTM model fitted best in the training and validation sets, for the smallest MAPE, RMSE, and MPE. CONCLUSIONS: The overall incidence trend of gonorrhea in mainland China has been on the decline since 2004, with some periods exhibiting an upward trend. The incidence of gonorrhea displays a seasonal distribution, typically peaking in July and December each year. The SARIMA model, LSTM model, and SARIMA-LSTM model can all fit the monthly incidence time series data of gonorrhea in mainland China. However, in terms of predictive performance, the SARIMA-LSTM model outperforms the SARIMA and LSTM models, with the LSTM model surpassing the SARIMA model. This suggests that the SARIMA-LSTM model can serve as a preferred tool for time series analysis, providing evidence for the government to predict trends in gonorrhea incidence. The model's predictions indicate that the incidence of gonorrhea in mainland China will remain at a high level in 2024, necessitating that policymakers implement public health measures in advance to prevent the spread of the disease.
Assuntos
Gonorreia , Humanos , Fatores de Tempo , Gonorreia/epidemiologia , China/epidemiologia , Governo , Saúde Pública , ConvulsõesRESUMO
Syphilis remains a serious public health problem in mainland China that requires attention, modelling to describe and predict its prevalence patterns can help the government to develop more scientific interventions. The seasonal autoregressive integrated moving average (SARIMA) model, long short-term memory network (LSTM) model, hybrid SARIMA-LSTM model, and hybrid SARIMA-nonlinear auto-regressive models with exogenous inputs (SARIMA-NARX) model were used to simulate the time series data of the syphilis incidence from January 2004 to November 2023 respectively. Compared to the SARIMA, LSTM, and SARIMA-LSTM models, the median absolute deviation (MAD) value of the SARIMA-NARX model decreases by 352.69%, 4.98%, and 3.73%, respectively. The mean absolute percentage error (MAPE) value decreases by 73.7%, 23.46%, and 13.06%, respectively. The root mean square error (RMSE) value decreases by 68.02%, 26.68%, and 23.78%, respectively. The mean absolute error (MAE) value decreases by 70.90%, 23.00%, and 21.80%, respectively. The hybrid SARIMA-NARX and SARIMA-LSTM methods predict syphilis cases more accurately than the basic SARIMA and LSTM methods, so that can be used for governments to develop long-term syphilis prevention and control programs. In addition, the predicted cases still maintain a fairly high level of incidence, so there is an urgent need to develop more comprehensive prevention strategies.
Assuntos
Previsões , Sífilis , Sífilis/epidemiologia , China/epidemiologia , Humanos , Incidência , Modelos Estatísticos , PrevalênciaRESUMO
OBJECTIVE: At different times, public health faces various challenges and the degree of intervention measures varies. The research on the impact and prediction of meteorology factors on influenza is increasing gradually, however, there is currently no evidence on whether its research results are affected by different periods. This study aims to provide limited evidence to reveal this issue. METHODS: Daily data on influencing factors and influenza in Xiamen were divided into three parts: overall period (phase AB), non-COVID-19 epidemic period (phase A), and COVID-19 epidemic period (phase B). The association between influencing factors and influenza was analysed using generalized additive models (GAMs). The excess risk (ER) was used to represent the percentage change in influenza as the interquartile interval (IQR) of meteorology factors increases. The 7-day average daily influenza cases were predicted using the combination of bi-directional long short memory (Bi-LSTM) and random forest (RF) through multi-step rolling input of the daily multifactor values of the previous 7-day. RESULTS: In periods A and AB, air temperature below 22 °C was a risk factor for influenza. However, in phase B, temperature showed a U-shaped effect on it. Relative humidity had a more significant cumulative effect on influenza in phase AB than in phase A (peak: accumulate 14d, AB: ER = 281.54, 95% CI = 245.47 ~ 321.37; A: ER = 120.48, 95% CI = 100.37 ~ 142.60). Compared to other age groups, children aged 4-12 were more affected by pressure, precipitation, sunshine, and day light, while those aged ≥ 13 were more affected by the accumulation of humidity over multiple days. The accuracy of predicting influenza was highest in phase A and lowest in phase B. CONCLUSIONS: The varying degrees of intervention measures adopted during different phases led to significant differences in the impact of meteorology factors on influenza and in the influenza prediction. In association studies of respiratory infectious diseases, especially influenza, and environmental factors, it is advisable to exclude periods with more external interventions to reduce interference with environmental factors and influenza related research, or to refine the model to accommodate the alterations brought about by intervention measures. In addition, the RF-Bi-LSTM model has good predictive performance for influenza.
Assuntos
Algoritmos , COVID-19 , Influenza Humana , Conceitos Meteorológicos , Humanos , COVID-19/epidemiologia , Influenza Humana/epidemiologia , SARS-CoV-2 , Inteligência Artificial , China/epidemiologia , Temperatura , Fatores de Risco , Tempo (Meteorologia) , CriançaRESUMO
BACKGROUND: Influenza outbreaks have occurred frequently these years, especially in the summer of 2022 when the number of influenza cases in southern provinces of China increased abnormally. However, the exact evidence of the driving factors involved in the prodrome period is unclear, posing great difficulties for early and accurate prediction in practical work. METHODS: In order to avoid the serious interference of strict prevention and control measures on the analysis of influenza influencing factors during the COVID-19 epidemic period, only the impact of meteorological and air quality factors on influenza A (H3N2) in Xiamen during the non coronavirus disease 2019 (COVID-19) period (2013/01/01-202/01/24) was analyzed using the distribution lag non-linear model. Phylogenetic analysis of influenza A (H3N2) during 2013-2022 was also performed. Influenza A (H3N2) was predicted through a random forest and long short-term memory (RF-LSTM) model via actual and forecasted meteorological and influenza A (H3N2) values. RESULTS: Twenty nine thousand four hundred thirty five influenza cases were reported in 2022, accounting for 58.54% of the total cases during 2013-2022. A (H3N2) dominated the 2022 summer epidemic season, accounting for 95.60%. The influenza cases in the summer of 2022 accounted for 83.72% of the year and 49.02% of all influenza reported from 2013 to 2022. Among them, the A (H3N2) cases in the summer of 2022 accounted for 83.90% of all A (H3N2) reported from 2013 to 2022. Daily precipitation(20-50 mm), relative humidity (70-78%), low (≤ 3 h) and high (≥ 7 h) sunshine duration, air temperature (≤ 21 °C) and O3 concentration (≤ 30 µg/m3, > 85 µg/m3) had significant cumulative effects on influenza A (H3N2) during the non-COVID-19 period. The daily values of PRE, RHU, SSD, and TEM in the prodrome period of the abnormal influenza A (H3N2) epidemic (19-22 weeks) in the summer of 2022 were significantly different from the average values of the same period from 2013 to 2019 (P < 0.05). The minimum RHU value was 70.5%, the lowest TEM value was 16.0 °C, and there was no sunlight exposure for 9 consecutive days. The highest O3 concentration reached 164 µg/m3. The range of these factors were consistent with the risk factor range of A (H3N2). The common influenza A (H3N2) variant genotype in 2022 was 3 C.2a1b.2a.1a. It was more accurate to predict influenza A (H3N2) with meteorological forecast values than with actual values only. CONCLUSION: The extreme weather conditions of sustained low temperature and wet rain may have been important driving factors for the abnormal influenza A (H3N2) epidemic. A low vaccination rate, new mutated strains, and insufficient immune barriers formed by natural infections may have exacerbated this epidemic. Meteorological forecast values can aid in the early prediction of influenza outbreaks. This study can help relevant departments prepare for influenza outbreaks during extreme weather, provide a scientific basis for prevention strategies and risk warnings, better adapt to climate change, and improve public health.
Assuntos
COVID-19 , Vírus da Influenza A Subtipo H3N2 , Influenza Humana , Humanos , Vírus da Influenza A Subtipo H3N2/genética , Vírus da Influenza A Subtipo H3N2/isolamento & purificação , Influenza Humana/epidemiologia , Influenza Humana/virologia , China/epidemiologia , COVID-19/epidemiologia , COVID-19/virologia , Estações do Ano , Filogenia , Epidemias , SARS-CoV-2/genética , SARS-CoV-2/isolamento & purificaçãoRESUMO
BACKGROUND: Zoonotic infections, characterized with huge pathogen diversity, wide affecting area and great society harm, have become a major global public health problem. Early and accurate prediction of their outbreaks is crucial for disease control. The aim of this study was to develop zoonotic diseases risk predictive models based on time-series incidence data and three zoonotic diseases in mainland China were employed as cases. METHODS: The incidence data for schistosomiasis, echinococcosis, and leptospirosis were downloaded from the Scientific Data Centre of the National Ministry of Health of China, and were processed by interpolation, dynamic curve reconstruction and time series decomposition. Data were decomposed into three distinct components: the trend component, the seasonal component, and the residual component. The trend component was used as input to construct the Long Short-Term Memory (LSTM) prediction model, while the seasonal component was used in the comparison of the periods and amplitudes. Finaly, the accuracy of the hybrid LSTM prediction model was comprehensive evaluated. RESULTS: This study employed trend series of incidence numbers and incidence rates of three zoonotic diseases for modeling. The prediction results of the model showed that the predicted incidence number and incidence rate were very close to the real incidence data. Model evaluation revealed that the prediction error of the hybrid LSTM model was smaller than that of the single LSTM. Thus, these results demonstrate that using trending sequences as input sequences for the model leads to better-fitting predictive models. CONCLUSIONS: Our study successfully developed LSTM hybrid models for disease outbreak risk prediction using three zoonotic diseases as case studies. We demonstrate that the LSTM, when combined with time series decomposition, delivers more accurate results compared to conventional LSTM models using the raw data series. Disease outbreak trends can be predicted more accurately using hybrid models.
Assuntos
Surtos de Doenças , Equinococose , Leptospirose , Esquistossomose , Zoonoses , Leptospirose/epidemiologia , Humanos , Animais , Equinococose/epidemiologia , China/epidemiologia , Zoonoses/epidemiologia , Incidência , Esquistossomose/epidemiologia , Medição de RiscoRESUMO
In this study, we propose a novel long short-term memory (LSTM) neural network model that leverages color features (HSV: hue, saturation, value) extracted from street images to estimate air quality with particulate matter (PM) in four typical European environments: urban, suburban, villages, and the harbor. To evaluate its performance, we utilize concentration data for eight parameters of ambient PM (PM1.0, PM2.5, and PM10, particle number concentration, lung-deposited surface area, equivalent mass concentrations of ultraviolet PM, black carbon, and brown carbon) collected from a mobile monitoring platform during the nonheating season in downtown Augsburg, Germany, along with synchronized street view images. Experimental comparisons were conducted between the LSTM model and other deep learning models (recurrent neural network and gated recurrent unit). The results clearly demonstrate a better performance of the LSTM model compared with other statistically based models. The LSTM-HSV model achieved impressive interpretability rates above 80%, for the eight PM metrics mentioned above, indicating the expected performance of the proposed model. Moreover, the successful application of the LSTM-HSV model in other seasons of Augsburg city and various environments (suburbs, villages, and harbor cities) demonstrates its satisfactory generalization capabilities in both temporal and spatial dimensions. The successful application of the LSTM-HSV model underscores its potential as a versatile tool for the estimation of air pollution after presampling of the studied area, with broad implications for urban planning and public health initiatives.