ABSTRACT
Peptides are promising drug development frameworks that have been hindered by intrinsic undesired properties including hemolytic activity. We aim to get a better insight into the chemical space of hemolytic peptides using a novel approach based on network science and data mining. Metadata networks (METNs) were useful to characterize and find general patterns associated with hemolytic peptides, whereas Half-Space Proximal Networks (HSPNs), represented the hemolytic peptide space. The best candidate HSPNs were used to extract various subsets of hemolytic peptides (scaffolds) considering network centrality and peptide similarity. These scaffolds have been proved to be useful in developing robust similarity-based model classifiers. Finally, using an alignment-free approach, we reported 47 putative hemolytic motifs, which can be used as toxic signatures when developing novel peptide-based drugs. We provided evidence that the number of hemolytic motifs in a sequence might be related to the likelihood of being hemolytic.
Subject(s)
Data Mining , Hemolysis , Peptides , Data Mining/methods , Hemolysis/drug effects , Humans , Computational Biology/methodsABSTRACT
Molecular evolution analysis typically involves identifying selection pressure and reconstructing evolutionary trends. This process usually requires access to specific data related to a target gene or gene family within a particular group of organisms. While recent advancements in high-throughput sequencing techniques have resulted in the rapid accumulation of extensive genomics and transcriptomics data and the creation of new databases in public repositories, extracting valuable insights from such vast data sets remains a significant challenge for researchers. Here, we elucidated the evolutionary history of THI1, a gene responsible for encoding thiamine thiazole synthase. The thiazole ring is a precursor for vitamin B1 and a crucial cofactor in primary metabolic pathways. A thorough search of complete genomes available within public repositories reveals 702 THI1 homologs of Archaea and Eukarya. Throughout its diversification, the plant lineage has preserved the THI1 gene by incorporating the N-terminus and targeting the chloroplasts. Likewise, evolutionary pressures and lifestyle appear to be associated with retention of TPP riboswitch sites and consequent dual posttranscriptional regulation of the de novo biosynthesis pathway in basal groups. Multicopy retention of THI1 is not a typical plant pattern, even after successive genome duplications. Examining cis-regulatory sites in plants uncovers two shared motifs across all plant lineages. A data mining of 484 transcriptome data sets supports the THI1 homolog expression under a light/dark cycle response and a tissue-specific pattern. Finally, the work presented brings a new look at public repositories as an opportunity to explore evolutionary trends to THI1.
Subject(s)
Data Mining , Evolution, Molecular , Plants/genetics , PhylogenyABSTRACT
Long COVID is characterized by persistent that extends symptoms beyond established timeframes. Its varied presentation across different populations and healthcare systems poses significant challenges in understanding its clinical manifestations and implications. In this study, we present a novel application of text mining technique to automatically extract unstructured data from a long COVID survey conducted at a prominent university hospital in São Paulo, Brazil. Our phonetic text clustering (PTC) method enables the exploration of unstructured Electronic Healthcare Records (EHR) data to unify different written forms of similar terms into a single phonemic representation. We used n-gram text analysis to detect compound words and negated terms in Portuguese-BR, focusing on medical conditions and symptoms related to long COVID. By leveraging text mining, we aim to contribute to a deeper understanding of this chronic condition and its implications for healthcare systems globally. The model developed in this study has the potential for scalability and applicability in other healthcare settings, thereby supporting broader research efforts and informing clinical decision-making for long COVID patients.
Subject(s)
COVID-19 , Data Mining , Humans , Data Mining/methods , COVID-19/epidemiology , COVID-19/virology , Electronic Health Records , Hospitalization , SARS-CoV-2/isolation & purification , Brazil/epidemiology , Post-Acute COVID-19 SyndromeABSTRACT
OBJECTIVE: This study aims to identify safety signals of ophthalmic prostaglandin analogues through data mining the Food and Drug Administration Adverse Event Reporting System (FAERS) database. METHODS: A data mining search by proportional reporting ratio, reporting OR, Bayesian confidence propagation neural network, information component 0.25 and χ2 for safety signals detection was conducted to the FAERS database for the following ophthalmic medications: latanoprost, travoprost, tafluprost and bimatoprost. RESULTS: 12 preferred terms were statistically associated: diabetes mellitus, n=2; hypoacusis, n=2; malignant mediastinal neoplasm, n=1; blood immunoglobulin E increased, n=1; cataract, n=1; blepharospasm, n=1; full blood count abnormal, n=1; skin exfoliation, n=1; chest discomfort, n=1; and dry mouth, n=1. LIMITATION OF THE STUDY: The FAERS database's limitations, such as the undetermined causality of cases, under-reporting and the lack of restriction to only health professionals reporting this type of event, could modify the statistical outcomes. These limitations are particularly relevant in the context of ophthalmic drug analysis, as they can affect the accuracy and reliability of the data, potentially leading to biased or incomplete results. CONCLUSIONS: Our findings have revealed a potential relationship due to the biological plausibility among malignant mediastinal neoplasm, full blood count abnormal, blood immunoglobulin E increased, diabetes mellitus, blepharospasm, cataracts, chest discomfort and dry mouth; therefore, it is relevant to continue investigating the possible drug-event association, whether to refute the safety signal or identify a new risk.
Subject(s)
Adverse Drug Reaction Reporting Systems , Data Mining , Databases, Factual , United States Food and Drug Administration , Humans , Adverse Drug Reaction Reporting Systems/statistics & numerical data , United States/epidemiology , Prostaglandins, Synthetic/adverse effects , Antihypertensive Agents/adverse effects , Ophthalmic Solutions/adverse effectsABSTRACT
BACKGROUND: Tandem repeats are specific sequences in genomic DNA repeated in tandem that are present in all organisms. Among the subcategories of TRs we have Satellite repeats, that is divided into macrosatellites, minisatellites, and microsatellites, being the last two of specific interest because they can identify polymorphisms between organisms due to their instability. Currently, most mining tools focus on Simple Sequence Repeats (SSR) mining, and only a few can identify SSRs in the coding regions. RESULTS: We developed a microsatellite mining software called SATIN (Micro and Mini SATellite IdentificatioN tool) based on a new sliding window algorithm written in C and Python. It represents a new approach to SSR mining by addressing the limitations of existing tools, particularly in coding region SSR mining. SATIN is available at https://github.com/labgm/SATIN.git . It was shown to be the second fastest for perfect and compound SSR mining. It can identify SSRs from coding regions plus SSRs with motif sizes bigger than 6. Besides the SSR mining, SATIN can also analyze SSRs polymorphism on coding-regions from pre-determined groups, and identify SSRs differentially abundant among them on a per-gene basis. To validate, we analyzed SSRs from two groups of Escherichia coli (K12 and O157) and compared the results with 5 known SSRs from coding regions. SATIN identified all 5 SSRs from 237 genes with at least one SSR on it. CONCLUSIONS: The SATIN is a novel microsatellite search software that utilizes an innovative sliding window technique based on a numerical list for repeat region search to identify perfect, and composite SSRs while generating comprehensible and analyzable outputs. It is a tool capable of using files in fasta or GenBank format as input for microsatellite mining, also being able to identify SSRs present in coding regions for GenBank files. In conclusion, we expect SATIN to help identify potential SSRs to be used as genetic markers.
Subject(s)
Data Mining , Microsatellite Repeats , Polymorphism, Genetic , Software , Microsatellite Repeats/genetics , Data Mining/methods , Algorithms , Open Reading Frames/genetics , DNA, Satellite/geneticsSubject(s)
Copyright , Data Mining , Humans , Copyright/legislation & jurisprudence , Brazil , Biomedical ResearchABSTRACT
AIMS: Machine learning models can use image and text data to predict the number of years since diabetes diagnosis; such model can be applied to new patients to predict, approximately, how long the new patient may have lived with diabetes unknowingly. We aimed to develop a model to predict self-reported diabetes duration. METHODS: We used the Brazilian Multilabel Ophthalmological Dataset. Unit of analysis was the fundus image and its meta-data, regardless of the patient. We included people 40â¯+ years and fundus images without diabetic retinopathy. Fundus images and meta-data (sex, age, comorbidities and taking insulin) were passed to the MedCLIP model to extract the embedding representation. The embedding representation was passed to an Extra Tree Classifier to predict: 0-4, 5-9, 10-14 and 15â¯+ years with self-reported diabetes. RESULTS: There were 988 images from 563 people (mean ageâ¯=â¯67 years; 64â¯% were women). Overall, the F1 score was 57â¯%. The group 15â¯+ years of self-reported diabetes had the highest precision (64â¯%) and F1 score (63â¯%), while the highest recall (69â¯%) was observed in the group 0-4 years. The proportion of correctly classified observations was 55â¯% for the group 0-4 years, 51â¯% for 5-9 years, 58â¯% for 10-14 years, and 64â¯% for 15â¯+ years with self-reported diabetes. CONCLUSIONS: The machine learning model had acceptable accuracy and F1 score, and correctly classified more than half of the patients according to diabetes duration. Using large foundational models to extract image and text embeddings seems a feasible and efficient approach to predict years living with self-reported diabetes.
Subject(s)
Diabetes Mellitus , Fundus Oculi , Machine Learning , Predictive Value of Tests , Self Report , Humans , Female , Male , Aged , Middle Aged , Time Factors , Diabetes Mellitus/diagnosis , Diabetes Mellitus/epidemiology , Brazil/epidemiology , Adult , Databases, Factual , Diabetic Retinopathy/diagnosis , Diabetic Retinopathy/epidemiology , Data Mining/methods , Reproducibility of Results , Image Interpretation, Computer-AssistedABSTRACT
OBJECTIVE: To determine reference intervals (RI) for fasting blood insulin (FBI) in Brazilian adolescents, 12 to 17 years old, by direct and indirect approaches, and to validate indirectly determined RI. METHODS: Two databases were used for RI determination. Database 1 (DB1), used to obtain RI through a posteriori direct method, consisted of prospectively selected healthy individuals. Database 2 (DB2) was retrospectively mined from an outpatient laboratory information system (LIS) used for the indirect method (Bhattacharya method). RESULTS: From DB1, 29345 individuals were enrolled (57.65 % female) and seven age ranges and sex partitions were statistically determined according to mean FBI values: females: 12 and 13 years-old, 14 years-old, 15 years-old, 16 and 17 years-old; and males: 12, 13 and 14 years-old, 15 years-old, 16 and 17 years-old. From DB2, 5465 adolescents (67.5 % female) were selected and grouped according to DB1 partitions. The mean FBI level was significantly higher in DB2, on all groups. The RI upper limit (URL) determined by Bhattacharya method was slightly lower than the 90 % CI URL directly obtained on DB1, except for group female 12 and 13 years old. High agreement rates for diagnosing elevated FBI in all groups on DB1 validated indirect RI presented. CONCLUSION: The present study demonstrates that Bhattacharya indirect method to determine FBI RI in adolescents can overcome some of the difficulties and challenges of the direct approach.
Subject(s)
Data Mining , Fasting , Insulin , Humans , Adolescent , Female , Male , Reference Values , Brazil , Child , Insulin/blood , Fasting/blood , Data Mining/methods , Retrospective Studies , Databases, FactualABSTRACT
BACKGROUND: The COVID-19 pandemic has caused significant disruptions to everyday life and has had social, political, and financial consequences that will persist for years. Several initiatives with intensive use of technology were quickly developed in this scenario. However, technologies that enhance epidemiological surveillance in contexts with low testing capacity and healthcare resources are scarce. Therefore, this study aims to address this gap by developing a data science model that uses routinely generated healthcare encounter records to detect possible new outbreaks early in real-time. METHODS: We defined an epidemiological indicator that is a proxy for suspected cases of COVID-19 using the health records of Emergency Care Unit (ECU) patients and text mining techniques. The open-field dataset comprises 2,760,862 medical records from nine ECUs, where each record has information about the patient's age, reported symptoms, and the time and date of admission. We also used a dataset where 1,026,804 cases of COVID-19 were officially confirmed. The records range from January 2020 to May 2022. Sample cross-correlation between two finite stochastic time series was used to evaluate the models. RESULTS: For patients with age 18 years, we find time-lag () = 72 days and cross-correlation () ~ 0.82, = 25 days and ~ 0.93, and = 17 days and ~ 0.88 for the first, second, and third waves, respectively. CONCLUSIONS: In conclusion, the developed model can aid in the early detection of signs of possible new COVID-19 outbreaks, weeks before traditional surveillance systems, thereby anticipating in initiating preventive and control actions in public health with a higher likelihood of success.
Subject(s)
COVID-19 , Humans , Adolescent , COVID-19/epidemiology , Electronic Health Records , Pandemics , Disease Outbreaks , Data MiningABSTRACT
Alzheimer's disease (AD) is the most common type and accounts for 60%-70% of the reported cases of dementia. MicroRNAs (miRNAs) are small non-coding RNAs that play a crucial role in gene expression regulation. Although the diagnosis of AD is primarily clinical, several miRNAs have been associated with AD and considered as potential markers for diagnosis and progression of AD. We sought to match AD-related miRNAs in cerebrospinal fluid (CSF) found in the GeoDataSets, evaluated by machine learning, with miRNAs listed in a systematic review, and a pathway analysis. Using machine learning approaches, we identified most differentially expressed miRNAs in Gene Expression Omnibus (GEO), which were validated by the systematic review, using the acronym PECO-Population (P): Patients with AD, Exposure (E): expression of miRNAs, Comparison (C): Healthy individuals, and Objective (O): miRNAs differentially expressed in CSF. Additionally, pathway enrichment analysis was performed to identify the main pathways involving at least four miRNAs selected. Four miRNAs were identified for differentiating between patients with and without AD in machine learning combined to systematic review, and followed the pathways analysis: miRNA-30a-3p, miRNA-193a-5p, miRNA-143-3p, miRNA-145-5p. The pathways epidermal growth factor, MAPK, TGF-beta and ATM-dependent DNA damage response, were regulated by these miRNAs, but only the MAPK pathway presented higher relevance after a randomic pathway analysis. These findings have the potential to assist in the development of diagnostic tests for AD using miRNAs as biomarkers, as well as provide understanding of the relationship between different pathophysiological mechanisms of AD.
Subject(s)
Alzheimer Disease , Data Mining , Machine Learning , MicroRNAs , Alzheimer Disease/cerebrospinal fluid , Alzheimer Disease/genetics , Alzheimer Disease/diagnosis , Humans , MicroRNAs/cerebrospinal fluid , MicroRNAs/genetics , Biomarkers/cerebrospinal fluidABSTRACT
BACKGROUND: Occupational accidents in the plumbing activity in the construction sector in developing countries have high rates of work absenteeism. The productivity of enterprises is heavily influenced by it. OBJECTIVE: To propose a model based on the Plan, Do, Check, and Act cycle and data mining for the prevention of occupational accidents in the plumbing activity in the construction sector. METHODS: This cross-sectional study was administered on a total of 200 male technical workers in plumbing. It considers biological, biomechanical, chemical, and, physical risk factors. Three data mining algorithms were compared: Logistic Regression, Naive Bayes, and Decision Trees, classifying the occurrences occupational accident. The model was validated considering 20% of the data collected, maintaining the same proportion between accidents and non-accidents. The model was applied to data collected from the last 17 years of occupational accidents in the plumbing activity in a Colombian construction company. RESULTS: The results showed that, in 90.5% of the cases, the decision tree classifier (J48) correctly identified the possible cases of occupational accidents with the biological, chemical, and, biomechanical, risk factors training variables applied in the model. CONCLUSION: The results of this study are promising in that the model is efficient in predicting the occurrence of an occupational accident in the plumbing activity in the construction sector. For the accidents identified and the associated causes, a plan of measures to mitigate the risk of occupational accidents is proposed.
Subject(s)
Accidents, Occupational , Construction Industry , Data Mining , Humans , Data Mining/methods , Cross-Sectional Studies , Accidents, Occupational/prevention & control , Accidents, Occupational/statistics & numerical data , Male , Adult , Colombia/epidemiology , Risk Factors , Bayes Theorem , Decision Trees , Logistic Models , AlgorithmsABSTRACT
Abstract Objective: To summarize data of clinical trials that used silver diamine fluoride (SDF) to prevent and treat caries lesions and dentinal hypersensitivity. Material and Methods: Six electronic databases were searched in May 2022. The concentration of SDF, type of usage (alone/combined), dentition, anterior/posterior teeth, tooth region, dental tissue, number of the treated surfaces, the intervention environment, participants' age, frequency and duration of SDF application, purpose, and outcome were the extracted variables. The type of study, year of publication, authors, journals, and country were also investigated. Results: From 8860 articles, S3 were selected. Most were randomized (n=38), that applied 38% SDF (n=43), alone (n=44), on multiple surfaces (n=44), only in dentin (n=36), of the crown (n=46) of anterior and posterior (n=36) primary teeth (n=39). The studies were preferably carried out outside the clinic (n=3l), only in children (n=33), with reapplication of SDF (n=30), but did not inform the duration of application (n= 19). SDF was most used to treat (n=46) only caries lesions (n=50). They were published between 2001 and 2022, mainly in the Journal of Dentistry (n=10). China (n=19) and Lo E.GM (n=19) were the countries and authors that published the most, respectively. Conclusion: The silver diamine fluoride 38% alone was most used to treat caries lesions in the dentin of the crown of all primary teeth, preferably applied on multiple surfaces, requiring re application, and outside the clinic.
Subject(s)
Cariostatic Agents/chemistry , Dentin Sensitivity/etiology , Dental Caries/etiology , Data MiningABSTRACT
Text mining enables search, extraction, categorisation and information visualisation. This study aimed to identify oral manifestations in patients with COVID-19 using text mining to facilitate extracting relevant clinical information from a large set of publications. A list of publications from the open-access COVID-19 Open Research Dataset was downloaded using keywords related to oral health and dentistry. A total of 694,366 documents were retrieved. Filtering the articles using text mining yielded 1,554 oral health/dentistry papers. The list of articles was classified into five topics after applying a Latent Dirichlet Allocation (LDA) model. This classification was compared to the author's classification which yielded 17 categories. After a full-text review of articles in the category "Oral manifestations in patients with COVID-19", eight papers were selected to extract data. The most frequent oral manifestations were xerostomia (n = 405, 17.8%) and mouth pain or swelling (n = 289, 12.7%). These oral manifestations in patients with COVID-19 must be considered with other symptoms to diminish the risk of dentist-patient infection.
Subject(s)
COVID-19 , Text Messaging , Humans , Data MiningABSTRACT
PURPOSE: There are currently more than 480 primary immune deficiency (PID) diseases and about 7000 rare diseases that together afflict around 1 in every 17 humans. Computational aids based on data mining and machine learning might facilitate the diagnostic task by extracting rules from large datasets and making predictions when faced with new problem cases. In a proof-of-concept data mining study, we aimed to predict PID diagnoses using a supervised machine learning algorithm based on classification tree boosting. METHODS: Through a data query at the USIDNET registry we obtained a database of 2396 patients with common diagnoses of PID, including their clinical and laboratory features. We kept 286 features and all 12 diagnoses to include in the model. We used the XGBoost package with parallel tree boosting for the supervised classification model, and SHAP for variable importance interpretation, on Python v3.7. The patient database was split into training and testing subsets, and after boosting through gradient descent, the predictive model provides measures of diagnostic prediction accuracy and individual feature importance. After a baseline performance test, we used the Class Weighting Hyperparameter, or scale_pos_weight to correct for imbalanced classification. RESULTS: The twelve PID diagnoses were CVID (1098 patients), DiGeorge syndrome, Chronic granulomatous disease, Congenital agammaglobulinemia, PID not otherwise classified, Specific antibody deficiency, Complement deficiency, Hyper-IgM, Leukocyte adhesion deficiency, ectodermal dysplasia with immune deficiency, Severe combined immune deficiency, and Wiskott-Aldrich syndrome. For CVID, the model found an accuracy on the train sample of 0.80, with an area under the ROC curve (AUC) of 0.80, and a Gini coefficient of 0.60. In the test subset, accuracy was 0.76, AUC 0.75, and Gini 0.51. The positive feature value to predict CVID was highest for upper respiratory infections, asthma, autoimmunity and hypogammaglobulinemia. Features with the highest negative predictive value were high IgE, growth delay, abscess, lymphopenia, and congenital heart disease. For the rest of the diagnoses, accuracy stayed between 0.75 and 0.99, AUC 0.46-0.87, Gini 0.07-0.75, and LogLoss 0.09-8.55. DISCUSSION: Clinicians should remember to consider the negative predictive features together with the positives. We are calling this a proof-of-concept study to continue with our explorations. A good performance is encouraging, and feature importance might aid feature selection for future endeavors. In the meantime, we can learn from the rules derived by the model and build a user-friendly decision tree to generate differential diagnoses.
Subject(s)
Primary Immunodeficiency Diseases , Wiskott-Aldrich Syndrome , Humans , Diagnosis, Differential , Machine Learning , Data MiningABSTRACT
Determination of live weight, which is one of the most important features that determine meat production, is a very important issue for herd management and sustainable livestock. In this context, the necessity of finding alternative methods has emerged, especially in rural conditions, due to the difficulties to be experienced in finding the weighing tool. Especially for conditions with no weighing tool, it has been tried to establish relations between the information obtained from body measurements and live weight. Since these studies will differ from species to species and breed to breed, the need for new studies is extremely high. For this aim, it is to evaluate the body measurement information obtained with the present study using several statistical approaches. To implement this aim, several data mining and machine learning algorithms such as multivariate adaptive regression splines (MARS), classification and regression tree (CART), and support vector machine regression (SVR) algorithms were used for training (70%) and test (30%) sets. To predict final body weight, 280 hair sheep breeds (162 female and 118 male) ranging from 2 months to 3 years were used with different data mining and machine learning approaches. Various goodness-of-fit criteria were used to evaluate the performances of the aforementioned algorithms. Although the MARS and SVR algorithms gave the same and highest results in terms of R2 and r values for both the train and the test sets, the SVR algorithm is one of the methods to be recommended as a result of this study, especially when other goodness-of-fit criteria are evaluated. In conclusion, the usage of SVR algorithms may be a useful tool of machine learning approaches for detecting the hair sheep breed standards and may contribute to increasing the sheep meat quality in Mexico.
Subject(s)
Biometry , Sheep, Domestic , Sheep , Animals , Algorithms , Data Mining , Machine Learning , Body WeightABSTRACT
Introducción: En Cuba y en el resto del mundo, las enfermedades cardiovasculares son reconocidas como un problema de salud pública mayúsculo y creciente, que provoca una alta mortalidad. Objetivo: Diseñar un modelo predictivo para estimar el riesgo de enfermedad cardiovascular basado en técnicas de inteligencia artificial. Métodos: La fuente de datos fue una cohorte prospectiva que incluyó 1633 pacientes, seguidos durante 10 años, fue utilizada la herramienta de minería de datos Weka, se emplearon técnicas de selección de atributos para obtener un subconjunto más reducido de variables significativas, para generar los modelos fueron aplicados: el algoritmo de reglas JRip y el meta algoritmo Attribute Selected Classifier, usando como clasificadores el J48 y el Multilayer Perceptron. Se compararon los modelos obtenidos y se aplicaron las métricas más usadas para clases desbalanceadas. Resultados: El atributo más significativo fue el antecedente de hipertensión arterial, seguido por el colesterol de lipoproteínas de alta densidad y de baja densidad, la proteína c reactiva de alta sensibilidad y la tensión arterial sistólica, de estos atributos se derivaron todas las reglas de predicción, los algoritmos fueron efectivos para generar el modelo, el mejor desempeño fue con el Multilayer Perceptron, con una tasa de verdaderos positivos del 95,2 por ciento un área bajo la curva ROC de 0,987 en la validación cruzada. Conclusiones: Fue diseñado un modelo predictivo mediante técnicas de inteligencia artificial, lo que constituye un valioso recurso orientado a la prevención de las enfermedades cardiovasculares en la atención primaria de salud(AU)
Introduction: In Cuba and in the rest of the world, cardiovascular diseases are recognized as a major and growing public health problem, which causes high mortality. Objective: To design a predictive model to estimate the risk of cardiovascular disease based on artificial intelligence techniques. Methods: The data source was a prospective cohort including 1633 patients, followed for 10 years. The data mining tool Weka was used and attribute selection techniques were employed to obtain a smaller subset of significant variables. To generate the models, the rule algorithm JRip and the meta-algorithm Attribute Selected Classifier were applied, using J48 and Multilayer Perceptron as classifiers. The obtained models were compared and the most used metrics for unbalanced classes were applied. Results: The most significant attribute was history of arterial hypertension, followed by high and low density lipoprotein cholesterol, high sensitivity c-reactive protein and systolic blood pressure; all the prediction rules were derived from these attributes. The algorithms were effective to generate the model. The best performance was obtained using the Multilayer Perceptron, with a true positive rate of 95.2percent and an area under the ROC curve of 0.987 in the cross validation. Conclusions: A predictive model was designed using artificial intelligence techniques; it is a valuable resource oriented to the prevention of cardiovascular diseases in primary health care(AU)
Subject(s)
Humans , Male , Female , Primary Health Care , Artificial Intelligence , Prospective Studies , Data Mining/methods , Forecasting/methods , Heart Disease Risk Factors , CubaABSTRACT
BACKGROUND: Although social media has the potential to spread misinformation, it can also be a valuable tool for elucidating the social factors that contribute to the onset of negative beliefs. As a result, data mining has become a widely used technique in infodemiology and infoveillance research to combat misinformation effects. On the other hand, there is a lack of studies that specifically aim to investigate misinformation about fluoride on Twitter. Web-based individual concerns on the side effects of fluoridated oral care products and tap water stimulate the emergence and propagation of convictions that boost antifluoridation activism. In this sense, a previous content analysis-driven study demonstrated that the term fluoride-free was frequently associated with antifluoridation interests. OBJECTIVE: This study aimed to analyze "fluoride-free" tweets regarding their topics and frequency of publication over time. METHODS: A total of 21,169 tweets published in English between May 2016 and May 2022 that included the keyword "fluoride-free" were retrieved by the Twitter application programming interface. Latent Dirichlet allocation (LDA) topic modeling was applied to identify the salient terms and topics. The similarity between topics was calculated through an intertopic distance map. Moreover, an investigator manually assessed a sample of tweets depicting each of the most representative word groups that determined specific issues. Lastly, additional data visualization was performed regarding the total count of each topic of fluoride-free record and its relevance over time, using Elastic Stack software. RESULTS: We identified 3 issues by applying the LDA topic modeling: "healthy lifestyle" (topic 1), "consumption of natural/organic oral care products" (topic 2), and "recommendations for using fluoride-free products/measures" (topic 3). Topic 1 was related to users' concerns about leading a healthier lifestyle and the potential impacts of fluoride consumption, including its hypothetical toxicity. Complementarily, topic 2 was associated with users' personal interests and perceptions of consuming natural and organic fluoride-free oral care products, whereas topic 3 was linked to users' recommendations for using fluoride-free products (eg, switching from fluoridated toothpaste to fluoride-free alternatives) and measures (eg, consuming unfluoridated bottled water instead of fluoridated tap water), comprising the propaganda of dental products. Additionally, the count of tweets on fluoride-free content decreased between 2016 and 2019 but increased again from 2020 onward. CONCLUSIONS: Public concerns toward a healthy lifestyle, including the adoption of natural and organic cosmetics, seem to be the main motivation of the recent increase of "fluoride-free" tweets, which can be boosted by the propagation of fluoride falsehoods on the web. Therefore, public health authorities, health professionals, and legislators should be aware of the spread of fluoride-free content on social media to create and implement strategies against their potential health damage for the population.
Subject(s)
COVID-19 , Social Media , Humans , Communication , Data Mining , Fluorides , Consumer Health Information , Healthy Lifestyle , Infodemic , InfodemiologyABSTRACT
Dark kitchen is a delivery-only restaurant that operates without direct contact with the consumer, has no premises for local consumption and sells exclusively through online platforms. The main objective of this work is to identify and characterise dark kitchens in three urban centres featured in the most used food delivery app in Brazil. To this end, data collection was conducted in two phases. In the first phase, through data mining, we collected information from restaurants in three cities (Limeira, Campinas and São Paulo - Brazil) that were provided in the food delivery app. A total of 22,520 establishments were searched from the central point of each of the cities. In the second phase, the first 1,000 restaurants in each city were classified as dark kitchens, standard, or undefined restaurants. A thematic content analysis was conducted to further distinguish the dark kitchen models. Of the restaurants evaluated, 1,749 (65.2%) were classified as standard restaurants, 727 (27.1%) as dark kitchens, and 206 (7.7%) as undefined. In terms of the characteristics of dark kitchens, they were more dispersed and located further away from the central points compared to standard restaurants. Meals in dark kitchens were cheaper than in standard restaurants, and had a lower number of user reviews. Most of the dark kitchens in São Paulo served Brazilian dishes, while in the smaller cities, Limeira and Campinas, it was mainly snacks and desserts. Six different models of dark kitchen were identified: Independent dark kitchen; shell-type (hub); franchise; virtual kitchen in a standard restaurant (different menu); virtual kitchen in a standard restaurant (similar menu but different name); and home-based dark kitchen. The modelling approach and methodology used to classify and identify dark kitchens is considered a contribution to science as it allows a better understanding of this fast growing sector of the food industry. This in turn can help to develop management strategies and policies for the sector. Our study is also of value to regulators to determine their proliferation through urban planning and to promote appropriate guidelines for dark kitchens as they differ from standard restaurants.
Subject(s)
Meals , Restaurants , Brazil , Data Collection , Data MiningABSTRACT
The power of social media in spreading the idea of wellbeing has already been addressed by several psychologists and scholars through the analysis of the vocabulary; however, the use of the human flourishing (HF) concept in such platforms has not yet been analyzed. This study addresses such a topic by analyzing more than 600 thousand Twitter messages posted by a community of users who associate themselves with HF and comparing them to more than 400 thousand messages in other Twitter lists. The study aims to identify the HF users' interests, the richness in their vocabulary, the feelings and emotions that they share, and the grammar used in their constructions. Such an analysis was conducted through text mining computational methods, including sentiment analysis, natural language processing (NLP), and topic modeling. The results revealed that although HF users employ average vocabulary diversity, they share more positive emotions, and a greater variety of emojis. They also tended to discuss different topics, from more spiritual and health-related subjects to more practical matters related to work and success. Finally, they generally wrote from an empathetic state of mind, caring about people's day-to-day feelings and about the world.
Subject(s)
Social Media , Humans , Data Mining , Emotions , Healthy Volunteers , LinguisticsABSTRACT
Accident investigation reports provide useful knowledge to support companies to propose preventive and mitigative measures. However, the information presented in accident report databases is normally large, complex, filled with errors and has missing and/or redundant data. In this article, we propose text mining and natural language processing techniques to investigate low-quality accident reports. We adopted machine learning (ML) to detect and investigate inconsistencies on accident reports. The methodology was applied to 626 documents collected from an actual hydroelectric power company. The initial ML performances indicated data divergences and concerns related to the report structure. Then, the accident database was restructured to a more proper form confirming the supposition about the quality of the reports investigated. The proposed approach can be used as a diagnostic tool to improve the design of accident investigation reports to provide a more useful source of knowledge to support decisions in the safety context.