RESUMEN
BACKGROUND: The challenging nature of studies with incarcerated populations and other offender groups can impede the conduct of research, particularly that involving complex study designs such as randomised control trials and clinical interventions. Providing an overview of study designs employed in this area can offer insights into this issue and how research quality may impact on health and justice outcomes. METHODS: We used a rule-based approach to extract study designs from a sample of 34,481 PubMed abstracts related to epidemiological criminology published between 1963 and 2023. The results were compared against an accepted hierarchy of scientific evidence. RESULTS: We evaluated our method in a random sample of 100 PubMed abstracts. An F1-Score of 92.2% was returned. Of 34,481 study abstracts, almost 40.0% (13,671) had an extracted study design. The most common study design was observational (37.3%; 5101) while experimental research in the form of trials (randomised, non-randomised) was present in 16.9% (2319). Mapped against the current hierarchy of scientific evidence, 13.7% (1874) of extracted study designs could not be categorised. Among the remaining studies, most were observational (17.2%; 2343) followed by systematic reviews (10.5%; 1432) with randomised controlled trials accounting for 8.7% (1196) of studies and meta-analysis for 1.4% (190) of studies. CONCLUSIONS: It is possible to extract epidemiological study designs from a large-scale PubMed sample computationally. However, the number of trials, systematic reviews, and meta-analysis is relatively small - just 1 in 5 articles. Despite an increase over time in the total number of articles, study design details in the abstracts were missing. Epidemiological criminology still lacks the experimental evidence needed to address the health needs of the marginalized and isolated population that is prisoners and offenders.
Asunto(s)
Criminales , Prisioneros , Humanos , Minería de Datos , Proyectos de InvestigaciónRESUMEN
Despite the potential benefits of sequential designs, studies evaluating treatments or experimental manipulations in preclinical experimental biomedicine almost exclusively use classical block designs. Our aim with this article is to bring the existing methodology of group sequential designs to the attention of researchers in the preclinical field and to clearly illustrate its potential utility. Group sequential designs can offer higher efficiency than traditional methods and are increasingly used in clinical trials. Using simulation of data, we demonstrate that group sequential designs have the potential to improve the efficiency of experimental studies, even when sample sizes are very small, as is currently prevalent in preclinical experimental biomedicine. When simulating data with a large effect size of d = 1 and a sample size of n = 18 per group, sequential frequentist analysis consumes in the long run only around 80% of the planned number of experimental units. In larger trials (n = 36 per group), additional stopping rules for futility lead to the saving of resources of up to 30% compared to block designs. We argue that these savings should be invested to increase sample sizes and hence power, since the currently underpowered experiments in preclinical biomedicine are a major threat to the value and predictiveness in this research domain.
Asunto(s)
Investigación Biomédica , Proyectos de InvestigaciónRESUMEN
BACKGROUND: The New South Wales Police Force (NSWPF) records details of significant numbers of domestic violence (DV) events they attend each year as both structured quantitative data and unstructured free text. Accessing information contained in the free text such as the victim's and persons of interest (POI's) mental health status could be useful in the better management of DV events attended by the police and thus improve health, justice, and social outcomes. OBJECTIVE: The aim of this study is to present the prevalence of extracted mental illness mentions for POIs and victims in police-recorded DV events. METHODS: We applied a knowledge-driven text mining method to recognize mental illness mentions for victims and POIs from police-recorded DV events. RESULTS: In 416,441 police-recorded DV events with single POIs and single victims, we identified 64,587 events (15.51%) with at least one mental illness mention versus 4295 (1.03%) recorded in the structured fixed fields. Two-thirds (67,582/85,880, 78.69%) of mental illnesses were associated with POIs versus 21.30% (18,298/85,880) with victims; depression was the most common condition in both victims (2822/12,589, 22.42%) and POIs (7496/39,269, 19.01%). Mental illnesses were most common among POIs aged 0-14 years (623/1612, 38.65%) and in victims aged over 65 years (1227/22,873, 5.36%). CONCLUSIONS: A wealth of mental illness information exists within police-recorded DV events that can be extracted using text mining. The results showed mood-related illnesses were the most common in both victims and POIs. Further investigation is required to determine the reliability of the mental illness mentions against sources of diagnostic information.
Asunto(s)
Minería de Datos/métodos , Violencia Doméstica/psicología , Trastornos Mentales/epidemiología , Policia/ética , Adolescente , Adulto , Femenino , Humanos , Masculino , Prevalencia , Reproducibilidad de los Resultados , Adulto JovenRESUMEN
BACKGROUND: The police attend numerous domestic violence events each year, recording details of these events as both structured (coded) data and unstructured free-text narratives. Abuse types (including physical, psychological, emotional, and financial) conducted by persons of interest (POIs) along with any injuries sustained by victims are typically recorded in long descriptive narratives. OBJECTIVE: We aimed to determine if an automated text mining method could identify abuse types and any injuries sustained by domestic violence victims in narratives contained in a large police dataset from the New South Wales Police Force. METHODS: We used a training set of 200 recorded domestic violence events to design a knowledge-driven approach based on syntactical patterns in the text and then applied this approach to a large set of police reports. RESULTS: Testing our approach on an evaluation set of 100 domestic violence events provided precision values of 90.2% and 85.0% for abuse type and victim injuries, respectively. In a set of 492,393 domestic violence reports, we found 71.32% (351,178) of events with mentions of the abuse type(s) and more than one-third (177,117 events; 35.97%) contained victim injuries. "Emotional/verbal abuse" (33.46%; 117,488) was the most common abuse type, followed by "punching" (86,322 events; 24.58%) and "property damage" (22.27%; 78,203 events). "Bruising" was the most common form of injury sustained (51,455 events; 29.03%), with "cut/abrasion" (28.93%; 51,284 events) and "red marks/signs" (23.71%; 42,038 events) ranking second and third, respectively. CONCLUSIONS: The results suggest that text mining can automatically extract information from police-recorded domestic violence events that can support further public health research into domestic violence, such as examining the relationship of abuse types with victim injuries and of gender and abuse types with risk escalation for victims of domestic violence. Potential also exists for this extracted information to be linked to information on the mental health status.
Asunto(s)
Minería de Datos/métodos , Violencia Doméstica/estadística & datos numéricos , Policia/estadística & datos numéricos , Adulto , Femenino , Humanos , MasculinoRESUMEN
[This corrects the article DOI: 10.2196/11548.].
RESUMEN
BACKGROUND: Vast numbers of domestic violence (DV) incidents are attended by the New South Wales Police Force each year in New South Wales and recorded as both structured quantitative data and unstructured free text in the WebCOPS (Web-based interface for the Computerised Operational Policing System) database regarding the details of the incident, the victim, and person of interest (POI). Although the structured data are used for reporting purposes, the free text remains untapped for DV reporting and surveillance purposes. OBJECTIVE: In this paper, we explore whether text mining can automatically identify mental health disorders from this unstructured text. METHODS: We used a training set of 200 DV recorded events to design a knowledge-driven approach based on lexical patterns in text suggesting mental health disorders for POIs and victims. RESULTS: The precision returned from an evaluation set of 100 DV events was 97.5% and 87.1% for mental health disorders related to POIs and victims, respectively. After applying our approach to a large-scale corpus of almost a half million DV events, we identified 77,995 events (15.83%) that mentioned mental health disorders, with 76.96% (60,032/77,995) of those linked to POIs versus 16.47% (12,852/77,995) for the victims and 6.55% (5111/77,995) for both. Depression was the most common mental health disorder mentioned in both victims (22.25%, 3269) and POIs (18.70%, 8944), followed by alcohol abuse for POIs (12.19%, 5829) and various anxiety disorders (eg, panic disorder, generalized anxiety disorder) for victims (11.66%, 1714). CONCLUSIONS: The results suggest that text mining can automatically extract targeted information from police-recorded DV events to support further public health research into the nexus between mental health disorders and DV.
Asunto(s)
Minería de Datos/métodos , Violencia Doméstica/psicología , Salud Mental/normas , Adulto , Femenino , Humanos , Narración , PoliciaRESUMEN
De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F1-scores of â¼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.
Asunto(s)
Algoritmos , Confidencialidad , Trastornos Mentales/psicología , Health Insurance Portability and Accountability Act , Humanos , Aprendizaje Automático , Estados UnidosRESUMEN
INTRODUCTION: Most data extraction efforts in epidemiology are focused on obtaining targeted information from clinical trials. In contrast, limited research has been conducted on the identification of information from observational studies, a major source for human evidence in many fields, including environmental health. The recognition of key epidemiological information (e.g., exposures) through text mining techniques can assist in the automation of systematic reviews and other evidence summaries. METHOD: We designed and applied a knowledge-driven, rule-based approach to identify targeted information (study design, participant population, exposure, outcome, confounding factors, and the country where the study was conducted) from abstracts of epidemiological studies included in several systematic reviews of environmental health exposures. The rules were based on common syntactical patterns observed in text and are thus not specific to any systematic review. To validate the general applicability of our approach, we compared the data extracted using our approach versus hand curation for 35 epidemiological study abstracts manually selected for inclusion in two systematic reviews. RESULTS: The returned F-score, precision, and recall ranged from 70% to 98%, 81% to 100%, and 54% to 97%, respectively. The highest precision was observed for exposure, outcome and population (100%) while recall was best for exposure and study design with 97% and 89%, respectively. The lowest recall was observed for the population (54%), which also had the lowest F-score (70%). CONCLUSION: The generated performance of our text-mining approach demonstrated encouraging results for the identification of targeted information from observational epidemiological study abstracts related to environmental exposures. We have demonstrated that rules based on generic syntactic patterns in one corpus can be applied to other observational study design by simple interchanging the dictionaries aiming to identify certain characteristics (i.e., outcomes, exposures). At the document level, the recognised information can assist in the selection and categorization of studies included in a systematic review.
Asunto(s)
Automatización , Minería de Datos , Literatura de Revisión como AsuntoRESUMEN
BACKGROUND: Free-text medication prescriptions contain detailed instruction information that is key when preparing drug data for analysis. The objective of this study was to develop a novel model and automated text-mining method to extract detailed structured medication information from free-text prescriptions and explore their variability (e.g. optional dosages) in primary care research databases. METHODS: We introduce a prescription model that provides minimum and maximum values for dose number, frequency and interval, allowing modelling variability and flexibility within a drug prescription. We developed a text mining system that relies on rules to extract such structured information from prescription free-text dosage instructions. The system was applied to medication prescriptions from an anonymised primary care electronic record database (Clinical Practice Research Datalink, CPRD). RESULTS: We have evaluated our approach on a test set of 220 CPRD prescription free-text directions. The system achieved an overall accuracy of 91 % at the prescription level, with 97 % accuracy across the attribute levels. We then further analysed over 56,000 most common free text prescriptions from CPRD records and found that 1 in 4 has inherent variability, i.e. a choice in taking medication specified by different minimum and maximum doses, duration or frequency. CONCLUSIONS: Our approach provides an accurate, automated way of coding prescription free text information, including information about flexibility and variability within a prescription. The method allows the researcher to decide how best to prepare the prescription data for drug efficacy and safety analyses in any given setting, and test various scenarios and their impact.
Asunto(s)
Investigación Biomédica/métodos , Bases de Datos Factuales , Registros Electrónicos de Salud , Prescripción Electrónica , Aplicaciones de la Informática Médica , Atención Primaria de Salud , Anonimización de la Información , HumanosRESUMEN
Heart disease is the leading cause of death globally and a significant part of the human population lives with it. A number of risk factors have been recognized as contributing to the disease, including obesity, coronary artery disease (CAD), hypertension, hyperlipidemia, diabetes, smoking, and family history of premature CAD. This paper describes and evaluates a methodology to extract mentions of such risk factors from diabetic clinical notes, which was a task of the i2b2/UTHealth 2014 Challenge in Natural Language Processing for Clinical Data. The methodology is knowledge-driven and the system implements local lexicalized rules (based on syntactical patterns observed in notes) combined with manually constructed dictionaries that characterize the domain. A part of the task was also to detect the time interval in which the risk factors were present in a patient. The system was applied to an evaluation set of 514 unseen notes and achieved a micro-average F-score of 88% (with 86% precision and 90% recall). While the identification of CAD family history, medication and some of the related disease factors (e.g. hypertension, diabetes, hyperlipidemia) showed quite good results, the identification of CAD-specific indicators proved to be more challenging (F-score of 74%). Overall, the results are encouraging and suggested that automated text mining methods can be used to process clinical notes to identify risk factors and monitor progression of heart disease on a large-scale, providing necessary data for clinical and epidemiological studies.
Asunto(s)
Enfermedades Cardiovasculares/epidemiología , Minería de Datos/métodos , Complicaciones de la Diabetes/epidemiología , Registros Electrónicos de Salud/organización & administración , Narración , Procesamiento de Lenguaje Natural , Anciano , Enfermedades Cardiovasculares/diagnóstico , Estudios de Cohortes , Comorbilidad , Seguridad Computacional , Confidencialidad , Complicaciones de la Diabetes/diagnóstico , Femenino , Humanos , Incidencia , Estudios Longitudinales , Masculino , Persona de Mediana Edad , Reconocimiento de Normas Patrones Automatizadas/métodos , Medición de Riesgo/métodos , Semántica , Reino Unido/epidemiología , Vocabulario ControladoRESUMEN
A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.
Asunto(s)
Seguridad Computacional , Confidencialidad , Registros Electrónicos de Salud/organización & administración , Narración , Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas/métodos , Estudios de Cohortes , Simulación por Computador , Minería de Datos/métodos , Aprendizaje Automático , Modelos Estadísticos , Reino Unido , Vocabulario ControladoRESUMEN
Aim: Few studies have examined the characteristics of domestic violence (DV) committed by people with dementia. We provide an overview of DV perpetrated by people with dementia in the community based on police reports of attendances at DV events. Method: A text mining method was used on 416,441 New South Wales (NSW) police narratives of DV events from January 2005 to December 2016 to extract information for Persons of Interest (POIs) with mentions of dementia. Results: Events involving those with dementia accounted for a relatively low proportion of total DV events (<1%). Of the 260 DV events with a dementia mention for the POI, the most common abuse types were assault (49.7%) and verbal abuse (31.6%). Spouses were the largest group of victims (50.8%) followed by children (8.8%). Physical abuse was common, occurring in 82.4% of events, but injuries were relatively mild. Although weapons were infrequently used, they were involved in 5% of events, mostly by POIs aged 75 years and older. Similarly, the POIs were mainly aged 75+ years (60%), however the proportion of those aged <65 was relatively high (20.8%) compared to the reported prevalence of dementia in that age group. Conclusions: This study demonstrates that some cases of DV perpetrated by people with reported dementia are significant enough to warrant police involvement. This highlights the need to proactively discuss the potential for violence as part of the holistic management and support family members, particularly those caring for people with young-onset dementias.
RESUMEN
BACKGROUND: The emerging field of epidemiological criminology studies the intersection between public health and justice systems. To increase the value of and reduce waste in research activities in this area, it is important to perform transparent research priority setting considering the needs of research beneficiaries and end users along with a systematic assessment of the existing research activities to address gaps and harness opportunities. OBJECTIVE: In this study, we aimed to examine published research outputs in epidemiological criminology to assess gaps between published outputs and current research priorities identified by prison stakeholders. METHODS: A rule-based method was applied to 23,904 PubMed epidemiological criminology abstracts to extract the study determinants and outcomes (ie, "themes"). These were mapped against the research priorities identified by Australian prison stakeholders to assess the differences from research outputs. The income level of the affiliation country of the first authors was also identified to compare the ranking of research priorities in countries categorized by income levels. RESULTS: On an evaluation set of 100 abstracts, the identification of themes returned an F1-score of 90%, indicating reliable performance. More than 53.3% (11,927/22,361) of the articles had at least 1 extracted theme; the most common was substance use (1533/11,814, 12.97%), followed by HIV (1493/11,814, 12.64%). The infectious disease category (2949/11,814, 24.96%) was the most common research priority category, followed by mental health (2840/11,814, 24.04%) and alcohol and other drug use (2433/11,814, 20.59%). A comparison between the extracted themes and the stakeholder priorities showed an alignment for mental health, infectious diseases, and alcohol and other drug use. Although behavior- and juvenile-related themes were common, they did not feature as prison priorities. Most studies were conducted in high-income countries (10,083/11,814, 85.35%), while countries with the lowest income status focused half of their research on infectious diseases (47/91, 52%). CONCLUSIONS: The identification of research themes from PubMed epidemiological criminology research abstracts is possible through the application of a rule-based text mining method. The frequency of the investigated themes may reflect historical developments concerning disease prevalence, treatment advances, and the social understanding of illness and incarcerated populations. The differences between income status groups are likely to be explained by local health priorities and immediate health risks. Notable gaps between stakeholder research priorities and research outputs concerned themes that were more focused on social factors and systems and may reflect publication bias or self-publication selection, highlighting the need for further research on prison health services and the social determinants of health. Different jurisdictions, countries, and regions should undertake similar systematic and transparent research priority-setting processes.
RESUMEN
Nonfatal strangulation (NFS) is a common form of domestic violence (DV) that frequently leaves no visible signs of injury and can be a portent for future fatality. A validated text mining approach was used to analyze a police dataset of 182,949 DV events for the presence of NFS. Results confirmed NFS within intimate partner relationships is a gendered form of violence. The presence of injury and/or other (non-NFS) forms of physical abuse, emotional/verbal/social abuse, and the perpetrator threatening to kill the victim, were associated with significantly higher odds of NFS perpetration. Police data contain rich information that can be accessed using automated methodologies such as text mining to add to our understanding of this pressing public health issue.
Asunto(s)
Violencia Doméstica , Violencia de Pareja , Minería de Datos/métodos , Humanos , Nueva Gales del Sur , Policia , PrevalenciaRESUMEN
BACKGROUND AND OBJECTIVES: The police are often the first to attend domestic violence events in New South Wales (NSW), Australia, recording related details as structured information (e.g., date of the event, type of incident, premises type) and text narratives which contain important information (e.g., mental health status, abuse types) for victims and perpetrators. This study examined the characteristics of victims and persons of interest (POIs) suspected and/or charged with perpetrating a domestic violence-related crime in residential care facilities. RESEARCH DESIGN AND METHODS: The study employed a text mining method that extracted key information from 700 police-recorded domestic violence events in NSW residential care facilities. RESULTS: Victims were mostly female (65.4%) and older adults (median age 80.3). POIs were predominantly male (67.0%) and were younger than the victims (median age 57.0). While low rates of mental illnesses were recorded (29.1% in victims; 17.4% in POIs), "dementia" was the most common condition among POIs (55.7%) and victims (73.0%). "Physical abuse" was the most common abuse type (80.2%) with "bruising" the most common injury (36.8%). The most common relationship between perpetrator and victim was "carer" (76.6%). DISCUSSION AND IMPLICATIONS: These findings highlight the opportunity provided by police text-based data to offer insights into elder abuse within residential care facilities.
Asunto(s)
Víctimas de Crimen , Violencia Doméstica , Anciano , Anciano de 80 o más Años , Australia , Minería de Datos/métodos , Femenino , Humanos , Masculino , Nueva Gales del Sur/epidemiología , PoliciaRESUMEN
BACKGROUND: To better understand domestic violence, data sources from multiple sectors such as police, justice, health, and welfare are needed. Linking police data to data collections from other agencies could provide unique insights and promote an all-of-government response to domestic violence. The New South Wales Police Force attends domestic violence events and records information in the form of both structured data and a free-text narrative, with the latter shown to be a rich source of information on the mental health status of persons of interest (POIs) and victims, abuse types, and sustained injuries. OBJECTIVE: This study aims to examine the concordance (ie, matching) between mental illness mentions extracted from the police's event narratives and mental health diagnoses from hospital and emergency department records. METHODS: We applied a rule-based text mining method on 416,441 domestic violence police event narratives between December 2005 and January 2016 to identify mental illness mentions for POIs and victims. Using different window periods (1, 3, 6, and 12 months) before and after a domestic violence event, we linked the extracted mental illness mentions of victims and POIs to clinical records from the Emergency Department Data Collection and the Admitted Patient Data Collection in New South Wales, Australia using a unique identifier for each individual in the same cohort. RESULTS: Using a 2-year window period (ie, 12 months before and after the domestic violence event), less than 1% (3020/416,441, 0.73%) of events had a mental illness mention and also a corresponding hospital record. About 16% of domestic violence events for both POIs (382/2395, 15.95%) and victims (101/631, 16.01%) had an agreement between hospital records and police narrative mentions of mental illness. A total of 51,025/416,441 (12.25%) events for POIs and 14,802/416,441 (3.55%) events for victims had mental illness mentions in their narratives but no hospital record. Only 841 events for POIs and 919 events for victims had a documented hospital record within 48 hours of the domestic violence event. CONCLUSIONS: Our findings suggest that current surveillance systems used to report on domestic violence may be enhanced by accessing rich information (ie, mental illness) contained in police text narratives, made available for both POIs and victims through the application of text mining. Additional insights can be gained by linkage to other health and welfare data collections.
RESUMEN
BACKGROUND: Epidemiological criminology refers to health issues affecting incarcerated and nonincarcerated offender populations, a group recognized as being challenging to conduct research with. Notwithstanding this, an urgent need exists for new knowledge and interventions to improve heath, justice, and social outcomes for this marginalized population. OBJECTIVE: To better understand research outputs in the field of epidemiological criminology, we examined the lead author's affiliation by analyzing peer-reviewed published outputs to determine countries and organizations (eg, universities, governmental and nongovernmental organizations) responsible for peer-reviewed publications. METHODS: We used a semiautomated approach to examine the first-author affiliations of 23,904 PubMed epidemiological studies related to incarcerated and offender populations published in English between 1946 and 2021. We also mapped research outputs to the World Justice Project Rule of Law Index to better understand whether there was a relationship between research outputs and the overall standard of a country's justice system. RESULTS: Nordic countries (Sweden, Norway, Finland, and Denmark) had the highest research outputs proportional to their incarcerated population, followed by Australia. University-affiliated first authors comprised 73.3% of published articles, with the Karolinska Institute (Sweden) being the most published, followed by the University of New South Wales (Australia). Government-affiliated first authors were on 8.9% of published outputs, and prison-affiliated groups were on 1%. Countries with the lowest research outputs also had the lowest scores on the Rule of Law Index. CONCLUSIONS: This study provides important information on who is publishing research in the epidemiological criminology field. This has implications for promoting research diversity, independence, funding equity, and partnerships between universities and government departments that control access to incarcerated and offending populations.
RESUMEN
Family and Domestic violence (FDV) is a global problem with significant social, economic, and health consequences for victims including increased health care costs, mental trauma, and social stigmatization. In Australia, the estimated annual cost of FDV is $22 billion, with one woman being murdered by a current or former partner every week. Despite this, tools that can predict future FDV based on the features of the person of interest (POI) and victim are lacking. The New South Wales Police Force attends thousands of FDV events each year and records details as fixed fields (e.g., demographic information for individuals involved in the event) and as text narratives which describe abuse types, victim injuries, threats, including the mental health status for POIs and victims. This information within the narratives is mostly untapped for research and reporting purposes. After applying a text mining methodology to extract information from 492,393 FDV event narratives (abuse types, victim injuries, mental illness mentions), we linked these characteristics with the respective fixed fields and with actual mental health diagnoses obtained from the NSW Ministry of Health for the same cohort to form a comprehensive FDV dataset. These data were input into five deep learning models (MLP, LSTM, Bi-LSTM, Bi-GRU, BERT) to predict three FDV offense types ("hands-on," "hands-off," "Apprehended Domestic Violence Order (ADVO) breach"). The transformer model with BERT embeddings returned the best performance (69.00% accuracy; 66.76% ROC) for "ADVO breach" in a multilabel classification setup while the binary classification setup generated similar results. "Hands-off" offenses proved the hardest offense type to predict (60.72% accuracy; 57.86% ROC using BERT) but showed potential to improve with fine-tuning of binary classification setups. "Hands-on" offenses benefitted least from the contextual information gained through BERT embeddings in which MLP with categorical embeddings outperformed it in three out of four metrics (65.95% accuracy; 78.03% F1-score; 70.00% precision). The encouraging results indicate that future FDV offenses can be predicted using deep learning on a large corpus of police and health data. Incorporating additional data sources will likely increase the performance which can assist those working on FDV and law enforcement to improve outcomes and better manage FDV events.
RESUMEN
In Australia, domestic violence reports are mostly based on data from the police, courts, hospitals, and ad hoc surveys. However, gaps exist in reporting information such as victim injuries, mental health status and abuse types. The police record details of domestic violence events as structured information (e.g., gender, postcode, ethnicity), but also in text narratives describing other details such as injuries, substance use, and mental health status. However, the voluminous nature of the narratives has prevented their use for surveillance purposes. We used a validated text mining methodology on 492,393 police-attended domestic violence event narratives from 2005 to 2016 to extract mental health mentions on persons of interest (POIs) (individuals suspected/charged with a domestic violence offense) and victims, abuse types, and victim injuries. A significant increase was observed in events that recorded an injury type (28.3% in 2005 to 35.6% in 2016). The pattern of injury and abuse types differed between male and female victims with male victims more likely to be punched and to experience cuts and bleeding and female victims more likely to be grabbed and pushed and have bruises. The four most common mental illnesses (alcohol abuse, bipolar disorder, depression schizophrenia) were the same in male and female POIs. An increase from 5.0% in 2005 to 24.3% in 2016 was observed in the proportion of events with a reported mental illness with an increase between 2005 and 2016 in depression among female victims. These findings demonstrate that extracting information from police narratives can provide novel insights into domestic violence patterns including confounding factors (e.g., mental illness) and thus enable policy responses to address this significant public health problem.