RESUMEN
INTRODUCTION: At the beginning of the COVID-19 pandemic, the UK's Scientific Committee issued extreme social distancing measures, termed 'shielding', aimed at a subpopulation deemed extremely clinically vulnerable to infection. National guidance for risk stratification was based on patients' age, comorbidities and immunosuppressive therapies, including biologics that are not captured in primary care records. This process required considerable clinician time to manually review outpatient letters. Our aim was to develop and evaluate an automated shielding algorithm by text-mining outpatient letter diagnoses and medications, reducing the need for future manual review. METHODS: Rheumatology outpatient letters from a large UK foundation trust were retrieved. Free-text diagnoses were processed using Intelligent Medical Objects software (Concept Tagger), which used interface terminology for each condition mapped to Systematized Medical Nomenclature for Medicine-Clinical Terminology (SNOMED-CT) codes. We developed the Medication Concept Recognition tool (Named Entity Recognition) to retrieve medications' type, dose, duration and status (active/past) at the time of the letter. Age, diagnosis and medication variables were then combined to calculate a shielding score based on the most recent letter. The algorithm's performance was evaluated using clinical review as the gold standard. The time taken to deploy the developed algorithm on a larger patient subset was measured. RESULTS: In total, 5942 free-text diagnoses were extracted and mapped to SNOMED-CT, with 13 665 free-text medications (n=803 patients). The automated algorithm demonstrated a sensitivity of 80% (95% CI: 75%, 85%) and specificity of 92% (95% CI: 90%, 94%). Positive likelihood ratio was 10 (95% CI: 8, 14), negative likelihood ratio was 0.21 (95% CI: 0.16, 0.28) and F1 score was 0.81. Evaluation of mismatches revealed that the algorithm performed correctly against the gold standard in most cases. The developed algorithm was then deployed on records from an additional 15 865 patients, which took 18 hours for data extraction and 1 hour to deploy. DISCUSSION: An automated algorithm for risk stratification has several advantages including reducing clinician time for manual review to allow more time for direct care, improving efficiency and increasing transparency in individual patient communication. It has the potential to be adapted for future public health initiatives that require prompt automated review of hospital outpatient letters.
Asunto(s)
Algoritmos , COVID-19 , Minería de Datos , Humanos , COVID-19/prevención & control , Reino Unido , Minería de Datos/métodos , SARS-CoV-2 , Enfermedades Reumáticas/tratamiento farmacológico , Persona de Mediana Edad , Masculino , Reumatología/métodos , Femenino , Anciano , Medición de Riesgo/métodos , Pandemias , AdultoRESUMEN
BACKGROUND: The challenging nature of studies with incarcerated populations and other offender groups can impede the conduct of research, particularly that involving complex study designs such as randomised control trials and clinical interventions. Providing an overview of study designs employed in this area can offer insights into this issue and how research quality may impact on health and justice outcomes. METHODS: We used a rule-based approach to extract study designs from a sample of 34,481 PubMed abstracts related to epidemiological criminology published between 1963 and 2023. The results were compared against an accepted hierarchy of scientific evidence. RESULTS: We evaluated our method in a random sample of 100 PubMed abstracts. An F1-Score of 92.2% was returned. Of 34,481 study abstracts, almost 40.0% (13,671) had an extracted study design. The most common study design was observational (37.3%; 5101) while experimental research in the form of trials (randomised, non-randomised) was present in 16.9% (2319). Mapped against the current hierarchy of scientific evidence, 13.7% (1874) of extracted study designs could not be categorised. Among the remaining studies, most were observational (17.2%; 2343) followed by systematic reviews (10.5%; 1432) with randomised controlled trials accounting for 8.7% (1196) of studies and meta-analysis for 1.4% (190) of studies. CONCLUSIONS: It is possible to extract epidemiological study designs from a large-scale PubMed sample computationally. However, the number of trials, systematic reviews, and meta-analysis is relatively small - just 1 in 5 articles. Despite an increase over time in the total number of articles, study design details in the abstracts were missing. Epidemiological criminology still lacks the experimental evidence needed to address the health needs of the marginalized and isolated population that is prisoners and offenders.
Asunto(s)
Criminales , Prisioneros , Humanos , Minería de Datos , Proyectos de InvestigaciónRESUMEN
PURPOSE: Routinely collected prescription data provides drug exposure information for pharmacoepidemiology, informing start/stop dates and dosage. Prescribing information includes structured data and unstructured free-text instructions, which can include inherent variability, such as "one to two tablets up to four times a day". Preparing drug exposure data from raw prescriptions to a research ready dataset is rarely fully reported, yet assumptions have considerable implications for pharmacoepidemiology. This may have bigger consequences for "pro re nata" (PRN) drugs. Our aim was, using a worked example of opioids and fracture risk, to examine the impact of incorporating narrative prescribing instructions and subsequent drug preparation assumptions on adverse event rates. METHODS: R-packages for extracting free-text medication prescription instructions in a structured form (doseminer) and an algorithm for transparently processing drug exposure information (drugprepr) were developed. Clinical Practice Research Datalink GOLD was used to define a cohort of adult new opioid users without prior cancer. A retrospective cohort study was performed using data between January 1, 2017 and July 31, 2018. We tested the impact of varying drug preparation assumptions by estimating the risk of opioids on fracture risk using Cox proportional hazards models. RESULTS: During the study window, 60 394 patients were identified with 190 754 opioid prescriptions. Free-text prescribing instruction variability, where there was flexibility in the number of tablets to be administered, was present in 42% prescriptions. Variations in the decisions made during preparing raw data for analysis led to marked differences impacting the event number (n = 303-415) and person years of drug exposure (5619-9832). The distribution of hazard ratios as a function of the decisions ranged from 2.71 (95% CI: 2.31, 3.18) to 3.24 (2.76, 3.82). CONCLUSIONS: Assumptions made during the drug preparation process, especially for those with variability in prescription instructions, can impact results of subsequent risk estimates. The developed R packages can improve transparency related to drug preparation assumptions, in line with best practice advocated by international pharmacoepidemiology guidelines.
Asunto(s)
Analgésicos Opioides , Farmacoepidemiología , Adulto , Humanos , Analgésicos Opioides/uso terapéutico , Estudios Retrospectivos , Prescripciones de Medicamentos , AlgoritmosRESUMEN
BACKGROUND: Stroke has an acute onset and a high mortality rate, making it one of the most fatal diseases worldwide. Its underlying biology and treatments have been widely studied both in the "Western" biomedicine and the Traditional Chinese Medicine (TCM). However, these two approaches are often studied and reported in insolation, both in the literature and associated databases. RESULTS: To aid research in finding effective prevention methods and treatments, we integrated knowledge from the literature and a number of databases (e.g. CID, TCMID, ETCM). We employed a suite of biomedical text mining (i.e. named-entity) approaches to identify mentions of genes, diseases, drugs, chemicals, symptoms, Chinese herbs and patent medicines, etc. in a large set of stroke papers from both biomedical and TCM domains. Then, using a combination of a rule-based approach with a pre-trained BioBERT model, we extracted and classified links and relationships among stroke-related entities as expressed in the literature. We construct StrokeKG, a knowledge graph includes almost 46 k nodes of nine types, and 157 k links of 30 types, connecting diseases, genes, symptoms, drugs, pathways, herbs, chemical, ingredients and patent medicine. CONCLUSIONS: Our Stroke-KG can provide practical and reliable stroke-related knowledge to help with stroke-related research like exploring new directions for stroke research and ideas for drug repurposing and discovery. We make StrokeKG freely available at http://114.115.208.144:7474/browser/ (Please click "Connect" directly) and the source structured data for stroke at https://github.com/yangxi1016/Stroke.
Asunto(s)
Medicamentos Herbarios Chinos , Accidente Cerebrovascular , Minería de Datos , Medicamentos Herbarios Chinos/uso terapéutico , Humanos , Medicina Tradicional China , Reconocimiento de Normas Patrones Automatizadas , Publicaciones , Accidente Cerebrovascular/tratamiento farmacológico , Accidente Cerebrovascular/genéticaRESUMEN
BACKGROUND: The efficacy of acetylcholinesterase inhibitors and memantine in the symptomatic treatment of Alzheimer's disease is well-established. Randomised trials have shown them to be associated with a reduction in the rate of cognitive decline. AIMS: To investigate the real-world effectiveness of acetylcholinesterase inhibitors and memantine for dementia-causing diseases in the largest UK observational secondary care service data-set to date. METHOD: We extracted mentions of relevant medications and cognitive testing (Mini-Mental State Examination (MMSE) and Montreal Cognitive Assessment (MoCA) scores) from de-identified patient records from two National Health Service (NHS) trusts. The 10-year changes in cognitive performance were modelled using a combination of generalised additive and linear mixed-effects modelling. RESULTS: The initial decline in MMSE and MoCA scores occurs approximately 2 years before medication is initiated. Medication prescription stabilises cognitive performance for the ensuing 2-5 months. The effect is boosted in more cognitively impaired cases at the point of medication prescription and attenuated in those taking antipsychotics. Importantly, patients who are switched between agents at least once do not experience any beneficial cognitive effect from pharmacological treatment. CONCLUSIONS: This study presents one of the largest real-world examination of the efficacy of acetylcholinesterase inhibitors and memantine for symptomatic treatment of dementia. We found evidence that 68% of individuals respond to treatment with a period of cognitive stabilisation before continuing their decline at the pre-treatment rate.
Asunto(s)
Enfermedad de Alzheimer , Inhibidores de la Colinesterasa , Acetilcolinesterasa/uso terapéutico , Enfermedad de Alzheimer/tratamiento farmacológico , Enfermedad de Alzheimer/psicología , Inhibidores de la Colinesterasa/farmacología , Inhibidores de la Colinesterasa/uso terapéutico , Humanos , Memantina/uso terapéutico , Estudios Retrospectivos , Medicina EstatalRESUMEN
Temporal relation extraction between health-related events is a widely studied task in clinical Natural Language Processing (NLP). The current state-of-the-art methods mostly rely on engineered features (i.e., rule-based modelling) and sequence modelling, which often encodes a source sentence into a single fixed-length context. An obvious disadvantage of this fixed-length context design is its incapability to model longer sentences, as important temporal information in the clinical text may appear at different positions. To address this issue, we propose an Attention-based Bidirectional Long Short-Term Memory (Att-BiLSTM) model to enable learning the important semantic information in long source text segments and to better determine which parts of the text are most important. We experimented with two embeddings and compared the performances to traditional state-of-the-art methods that require elaborate linguistic pre-processing and hand-engineered features. The experimental results on the i2b2 2012 temporal relation test corpus show that the proposed method achieves a significant improvement with an F-score of 0.811, which is at least 10% better than state-of-the-art in the field. We show that the model can be remarkably effective at classifying temporal relations when provided with word embeddings trained on corpora in a general domain. Finally, we perform an error analysis to gain insight into the common errors made by the model.
Asunto(s)
Memoria a Corto Plazo , Alta del Paciente , Humanos , Lenguaje , Procesamiento de Lenguaje Natural , SemánticaRESUMEN
BACKGROUND: Social media provides the potential to engage a wide audience about scientific research, including the public. However, little empirical research exists to guide health scientists regarding what works and how to optimize impact. We examined the social media campaign #datasaveslives established in 2014 to highlight positive examples of the use and reuse of health data in research. OBJECTIVE: This study aims to examine how the #datasaveslives hashtag was used on social media, how often, and by whom; thus, we aim to provide insights into the impact of a major social media campaign in the UK health informatics research community and further afield. METHODS: We analyzed all publicly available posts (tweets) that included the hashtag #datasaveslives (N=13,895) on the microblogging platform Twitter between September 1, 2016, and August 31, 2017. Using a combination of qualitative and quantitative analyses, we determined the frequency and purpose of tweets. Social network analysis was used to analyze and visualize tweet sharing (retweet) networks among hashtag users. RESULTS: Overall, we found 4175 original posts and 9720 retweets featuring #datasaveslives by 3649 unique Twitter users. In total, 66.01% (2756/4175) of the original posts were retweeted at least once. Higher frequencies of tweets were observed during the weeks of prominent policy publications, popular conferences, and public engagement events. Cluster analysis based on retweet relationships revealed an interconnected series of groups of #datasaveslives users in academia, health services and policy, and charities and patient networks. Thematic analysis of tweets showed that #datasaveslives was used for a broader range of purposes than indexing information, including event reporting, encouraging participation and action, and showing personal support for data sharing. CONCLUSIONS: This study shows that a hashtag-based social media campaign was effective in encouraging a wide audience of stakeholders to disseminate positive examples of health research. Furthermore, the findings suggest that the campaign supported community building and bridging practices within and between the interdisciplinary sectors related to the field of health data science and encouraged individuals to demonstrate personal support for sharing health data.
Asunto(s)
Investigación Biomédica/métodos , Difusión de la Información/métodos , Medios de Comunicación Sociales/normas , Análisis de Redes Sociales , HumanosRESUMEN
Chronic pain is highly prevalent and poorly controlled, of which the accurate underlying mechanisms need be further elucidated. Herbal drugs have been widely used for controlling various pain disorders. The systematic integration of pain herbal data resources might be promising to help investigate the molecular mechanisms of pain phenotypes. Here, we integrated large-scale bibliographic literatures and well-established data sources to obtain high-quality pain relevant herbal data (i.e. 426 pain related herbs with their targets). We used machine learning method to identify three distinct herb categories with their specific indications of symptoms, targets and enriched pathways, which were characterized by the efficacy of treatment to the chronic cough related neuropathic pain, the reproduction and autoimmune related pain, and the cancer pain, respectively. We further detected the novel pathophysiological mechanisms of the pain subtypes by network medicine approach to evaluate the interactions between herb targets and the pain disease modules. This work increased the understanding of the underlying molecular mechanisms of pain subtypes that herbal drugs are participating and with the ultimate aim of developing novel personalized drugs for pain disorders.
Asunto(s)
Analgésicos/uso terapéutico , Dolor Crónico/tratamiento farmacológico , Aprendizaje Automático , Umbral del Dolor/efectos de los fármacos , Preparaciones de Plantas/uso terapéutico , Biología de Sistemas , Integración de Sistemas , Analgésicos/química , Analgésicos/clasificación , Animales , Dolor Crónico/metabolismo , Dolor Crónico/fisiopatología , Bases de Datos Factuales , Humanos , Estructura Molecular , Terapia Molecular Dirigida , Farmacopeas como Asunto , Preparaciones de Plantas/química , Preparaciones de Plantas/clasificación , Mapas de Interacción de Proteínas , Transducción de Señal , Relación Estructura-ActividadRESUMEN
BACKGROUND: Temporal relations between clinical events play an important role in clinical assessment and decision making. Extracting such relations from free text data is a challenging task because it lies on between medical natural language processing, temporal representation and temporal reasoning. OBJECTIVES: To survey existing methods for extracting temporal relations (TLINKs) between events from clinical free text in English; to establish the state-of-the-art in this field; and to identify outstanding methodological challenges. METHODS: A systematic search in PubMed and the DBLP computer science bibliography was conducted for studies published between January 2006 and December 2018. The relevant studies were identified by examining the titles and abstracts. Then, the full text of selected studies was analyzed in depth and information were collected on TLINK tasks, TLINK types, data sources, features selection, methods used, and reported performance. RESULTS: A total of 2834 publications were identified for title and abstract screening. Of these publications, 51 studies were selected. Thirty-two studies used machine learning approaches, 15 studies used a hybrid approaches, and only four studies used a rule-based approach. The majority of studies use publicly available corpora: THYME (28 studies) and the i2b2 corpus (17 studies). CONCLUSION: The performance of TLINK extraction methods ranges widely depending on relation types and events (e.g. from 32% to 87% F-score for identifying relations between clinical events and document creation time). A small set of TLINKs (before, after, overlap and contains) has been widely studied with relatively good performance, whereas other types of TLINK (e.g., started by, finished by, precedes) are rarely studied and remain challenging. Machine learning classifiers (such as Support Vector Machine and Conditional Random Fields) and Deep Neural Networks were among the best performing methods for extracting TLINKs, but nearly all the work has been carried out and tested on two publicly available corpora only. The field would benefit from the availability of more publicly available, high-quality, annotated clinical text corpora.
Asunto(s)
Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Minería de Datos , Almacenamiento y Recuperación de la Información , Aprendizaje Automático , TiempoRESUMEN
BACKGROUND: Use of routinely collected patient data for research and service planning is an explicit policy of the UK National Health Service and UK government. Much clinical information is recorded in free-text letters, reports and notes. These text data are generally lost to research, due to the increased privacy risk compared with structured data. We conducted a citizens' jury which asked members of the public whether their medical free-text data should be shared for research for public benefit, to inform an ethical policy. METHODS: Eighteen citizens took part over 3 days. Jurors heard a range of expert presentations as well as arguments for and against sharing free text, and then questioned presenters and deliberated together. They answered a questionnaire on whether and how free text should be shared for research, gave reasons for and against sharing and suggestions for alleviating their concerns. RESULTS: Jurors were in favour of sharing medical data and agreed this would benefit health research, but were more cautious about sharing free-text than structured data. They preferred processing of free text where a computer extracted information at scale. Their concerns were lack of transparency in uses of data, and privacy risks. They suggested keeping patients informed about uses of their data, and giving clear pathways to opt out of data sharing. CONCLUSIONS: Informed citizens suggested a transparent culture of research for the public benefit, and continuous improvement of technology to protect patient privacy, to mitigate their concerns regarding privacy risks of using patient text data.
Asunto(s)
Registros Electrónicos de Salud , Medicina Estatal , Humanos , Difusión de la Información , Privacidad , Reino UnidoRESUMEN
BACKGROUND: The New South Wales Police Force (NSWPF) records details of significant numbers of domestic violence (DV) events they attend each year as both structured quantitative data and unstructured free text. Accessing information contained in the free text such as the victim's and persons of interest (POI's) mental health status could be useful in the better management of DV events attended by the police and thus improve health, justice, and social outcomes. OBJECTIVE: The aim of this study is to present the prevalence of extracted mental illness mentions for POIs and victims in police-recorded DV events. METHODS: We applied a knowledge-driven text mining method to recognize mental illness mentions for victims and POIs from police-recorded DV events. RESULTS: In 416,441 police-recorded DV events with single POIs and single victims, we identified 64,587 events (15.51%) with at least one mental illness mention versus 4295 (1.03%) recorded in the structured fixed fields. Two-thirds (67,582/85,880, 78.69%) of mental illnesses were associated with POIs versus 21.30% (18,298/85,880) with victims; depression was the most common condition in both victims (2822/12,589, 22.42%) and POIs (7496/39,269, 19.01%). Mental illnesses were most common among POIs aged 0-14 years (623/1612, 38.65%) and in victims aged over 65 years (1227/22,873, 5.36%). CONCLUSIONS: A wealth of mental illness information exists within police-recorded DV events that can be extracted using text mining. The results showed mood-related illnesses were the most common in both victims and POIs. Further investigation is required to determine the reliability of the mental illness mentions against sources of diagnostic information.
Asunto(s)
Minería de Datos/métodos , Violencia Doméstica/psicología , Trastornos Mentales/epidemiología , Policia/ética , Adolescente , Adulto , Femenino , Humanos , Masculino , Prevalencia , Reproducibilidad de los Resultados , Adulto JovenRESUMEN
BACKGROUND: Clinical free-text data (eg, outpatient letters or nursing notes) represent a vast, untapped source of rich information that, if more accessible for research, would clarify and supplement information coded in structured data fields. Data usually need to be deidentified or anonymized before they can be reused for research, but there is a lack of established guidelines to govern effective deidentification and use of free-text information and avoid damaging data utility as a by-product. OBJECTIVE: This study aimed to develop recommendations for the creation of data governance standards to integrate with existing frameworks for personal data use, to enable free-text data to be used safely for research for patient and public benefit. METHODS: We outlined data protection legislation and regulations relating to the United Kingdom for context and conducted a rapid literature review and UK-based case studies to explore data governance models used in working with free-text data. We also engaged with stakeholders, including text-mining researchers and the general public, to explore perceived barriers and solutions in working with clinical free-text. RESULTS: We proposed a set of recommendations, including the need for authoritative guidance on data governance for the reuse of free-text data, to ensure public transparency in data flows and uses, to treat deidentified free-text data as potentially identifiable with use limited to accredited data safe havens, and to commit to a culture of continuous improvement to understand the relationships between the efficacy of deidentification and reidentification risks, so this can be communicated to all stakeholders. CONCLUSIONS: By drawing together the findings of a combination of activities, we present a position paper to contribute to the development of data governance standards for the reuse of clinical free-text data for secondary purposes. While working in accordance with existing data governance frameworks, there is a need for further work to take forward the recommendations we have proposed, with commitment and investment, to assure and expand the safe reuse of clinical free-text data for public benefit.
Asunto(s)
Análisis de Datos , Humanos , Estándares de Referencia , Envío de Mensajes de TextoRESUMEN
BACKGROUND: The police attend numerous domestic violence events each year, recording details of these events as both structured (coded) data and unstructured free-text narratives. Abuse types (including physical, psychological, emotional, and financial) conducted by persons of interest (POIs) along with any injuries sustained by victims are typically recorded in long descriptive narratives. OBJECTIVE: We aimed to determine if an automated text mining method could identify abuse types and any injuries sustained by domestic violence victims in narratives contained in a large police dataset from the New South Wales Police Force. METHODS: We used a training set of 200 recorded domestic violence events to design a knowledge-driven approach based on syntactical patterns in the text and then applied this approach to a large set of police reports. RESULTS: Testing our approach on an evaluation set of 100 domestic violence events provided precision values of 90.2% and 85.0% for abuse type and victim injuries, respectively. In a set of 492,393 domestic violence reports, we found 71.32% (351,178) of events with mentions of the abuse type(s) and more than one-third (177,117 events; 35.97%) contained victim injuries. "Emotional/verbal abuse" (33.46%; 117,488) was the most common abuse type, followed by "punching" (86,322 events; 24.58%) and "property damage" (22.27%; 78,203 events). "Bruising" was the most common form of injury sustained (51,455 events; 29.03%), with "cut/abrasion" (28.93%; 51,284 events) and "red marks/signs" (23.71%; 42,038 events) ranking second and third, respectively. CONCLUSIONS: The results suggest that text mining can automatically extract information from police-recorded domestic violence events that can support further public health research into domestic violence, such as examining the relationship of abuse types with victim injuries and of gender and abuse types with risk escalation for victims of domestic violence. Potential also exists for this extracted information to be linked to information on the mental health status.
Asunto(s)
Minería de Datos/métodos , Violencia Doméstica/estadística & datos numéricos , Policia/estadística & datos numéricos , Adulto , Femenino , Humanos , MasculinoRESUMEN
[This corrects the article DOI: 10.2196/11548.].
RESUMEN
BACKGROUND: The consolidation of pathway databases, such as KEGG, Reactome and ConsensusPathDB, has generated widespread biological interest, however the issue of pathway redundancy impedes the use of these consolidated datasets. Attempts to reduce this redundancy have focused on visualizing pathway overlap or merging pathways, but the resulting pathways may be of heterogeneous sizes and cover multiple biological functions. Efforts have also been made to deal with redundancy in pathway data by consolidating enriched pathways into a number of clusters or concepts. We present an alternative approach, which generates pathway subsets capable of covering all of genes presented within either pathway databases or enrichment results, generating substantial reductions in redundancy. RESULTS: We propose a method that uses set cover to reduce pathway redundancy, without merging pathways. The proposed approach considers three objectives: removal of pathway redundancy, controlling pathway size and coverage of the gene set. By applying set cover to the ConsensusPathDB dataset we were able to produce a reduced set of pathways, representing 100% of the genes in the original data set with 74% less redundancy, or 95% of the genes with 88% less redundancy. We also developed an algorithm to simplify enrichment data and applied it to a set of enriched osteoarthritis pathways, revealing that within the top ten pathways, five were redundant subsets of more enriched pathways. Applying set cover to the enrichment results removed these redundant pathways allowing more informative pathways to take their place. CONCLUSION: Our method provides an alternative approach for handling pathway redundancy, while ensuring that the pathways are of homogeneous size and gene coverage is maximised. Pathways are not altered from their original form, allowing biological knowledge regarding the data set to be directly applicable. We demonstrate the ability of the algorithms to prioritise redundancy reduction, pathway size control or gene set coverage. The application of set cover to pathway enrichment results produces an optimised summary of the pathways that best represent the differentially regulated gene set.
Asunto(s)
Algoritmos , Transducción de Señal/genética , Bases de Datos Genéticas , Perfilación de la Expresión Génica , HumanosRESUMEN
BACKGROUND: Vast numbers of domestic violence (DV) incidents are attended by the New South Wales Police Force each year in New South Wales and recorded as both structured quantitative data and unstructured free text in the WebCOPS (Web-based interface for the Computerised Operational Policing System) database regarding the details of the incident, the victim, and person of interest (POI). Although the structured data are used for reporting purposes, the free text remains untapped for DV reporting and surveillance purposes. OBJECTIVE: In this paper, we explore whether text mining can automatically identify mental health disorders from this unstructured text. METHODS: We used a training set of 200 DV recorded events to design a knowledge-driven approach based on lexical patterns in text suggesting mental health disorders for POIs and victims. RESULTS: The precision returned from an evaluation set of 100 DV events was 97.5% and 87.1% for mental health disorders related to POIs and victims, respectively. After applying our approach to a large-scale corpus of almost a half million DV events, we identified 77,995 events (15.83%) that mentioned mental health disorders, with 76.96% (60,032/77,995) of those linked to POIs versus 16.47% (12,852/77,995) for the victims and 6.55% (5111/77,995) for both. Depression was the most common mental health disorder mentioned in both victims (22.25%, 3269) and POIs (18.70%, 8944), followed by alcohol abuse for POIs (12.19%, 5829) and various anxiety disorders (eg, panic disorder, generalized anxiety disorder) for victims (11.66%, 1714). CONCLUSIONS: The results suggest that text mining can automatically extract targeted information from police-recorded DV events to support further public health research into the nexus between mental health disorders and DV.
Asunto(s)
Minería de Datos/métodos , Violencia Doméstica/psicología , Salud Mental/normas , Adulto , Femenino , Humanos , Narración , PoliciaRESUMEN
De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F1-scores of â¼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.
Asunto(s)
Algoritmos , Confidencialidad , Trastornos Mentales/psicología , Health Insurance Portability and Accountability Act , Humanos , Aprendizaje Automático , Estados UnidosRESUMEN
BACKGROUND: Use of the social media website Twitter is highly prevalent and has led to a plethora of Web-based social and health-related data available for use by researchers. As such, researchers are increasingly using data from social media to retrieve and analyze mental health-related content. However, there is limited evidence regarding why people use this emerging platform to discuss mental health problems in the first place. OBJECTIVES: The aim of this study was to explore the reasons why individuals discuss mental health on the social media website Twitter. The study was the first of its kind to implement a study-specific hashtag for research; therefore, we also examined how feasible it was to circulate and analyze a study-specific hashtag for mental health research. METHODS: Text mining methods using the Twitter Streaming Application Programming Interface (API) and Twitter Search API were used to collect and organize tweets from the hashtag #WhyWeTweetMH, circulated between September 2015 and November 2015. Tweets were analyzed thematically to understand the key reasons for discussing mental health using the Twitter platform. RESULTS: Four overarching themes were derived from the 132 tweets collected: (1) sense of community; (2) raising awareness and combatting stigma; (3) safe space for expression; and (4) coping and empowerment. In addition, 11 associated subthemes were also identified. CONCLUSIONS: The themes derived from the content of the tweets highlight the perceived therapeutic benefits of Twitter through the provision of support and information and the potential for self-management strategies. The ability to use Twitter to combat stigma and raise awareness of mental health problems indicates the societal benefits that can be facilitated via the platform. The number of tweets and themes identified demonstrates the feasibility of implementing study-specific hashtags to explore research questions in the field of mental health and can be used as a basis for other health-related research.