Analyzing Domestic Abuse
using Natural Language Processing on Social Media Data
Introduction
Social media websites, such as Twitter, have frequently been used as a
source of information for predicting and characterizing various societal and health issues. It is clear that social media is an effective
tool for gathering high volumes of data quickly, and its use in previous
research is indicative of its effectiveness. However, analyzing the dynamics of abusive relationships using social media data is largely unexplored. In this thesis, new datasets discussing abuse are collected and developed.Computational methods are applied on these data to integrate quantitative results with findings from clinical literature for a qualitative understanding of the characteristics of domestic abuse.
Motivation
Globally, 30% of women 15 and older have experienced physical and/or sexual intimate partner violence at some point in their life [20]. While domestic abuse tends to have greater prevalence in low-income and non-western The data used to calculate such statistics are often derived from costly, timeconsuming,and potentially dangerous to participate in population-based surveys that primarily seek to obtain insight into the prevalence, consequences, and risk factors of domestic abuse. Due to the safety concerns of having victims of abuse answer survey questions while potentially being in the relationship in question, these surveys follow strict guidelines set by the World Health Organization . Great care must be taken by the researchers to ensure the safety of the participants, and therefore the number of participants is often quite small . One way to avoid the cost of large scale surveys whilel maintaining appropriate research conditions is to leverage the abundance of data publicly available on the web. Such data provide researchers with an opportunity to better understand domestic abuse in order to provide resources for victims and efficiently implement preventative measures. While the age groups 0-17 and 55+ will be significantly underrepresented based on user demographics of these websites , the prevalence of intimate partner violence acts is most prominent between the ages of 18 and 24, aligning with the most active social-media using ages.
Hypotheses
1. Using unstructured1 social media input from relevant sources of language
data, natural language processing (NLP) methods and machine
learning classifiers can detect language related to domestic abuse.
2. Analysis of these classifiers, along with data inspection, can reveal
meaningful structural and semantic, linguistic, and textual characteristics, including actions, stakeholders, and situations involved in abusive relationships.
Studies in Computational Social Science
Social media sites are an emerging source of data for public health research. These sites provide less intimidating and more accessible channels for reporting, collectively processing, and making sense of traumatic and stigmatizing experiences. Several previous works have studied public health issues intersecting with domestic abuse, including depression and post-traumatic stress disorder. Many researchers have focused on Twitter data, due to its prominent presence, accessibility, and the characteristics of tweets (short texts, timestamped, trend-associated properties like retweets, hashtags, and user mentions, and potentially geotags).
For example, in De Choudhury et al., the authors examined a set of tweets to predict the onset of depression. Using Amazon Mechanical Turk, gold-standard labels of depression and non-depression were applied to Twitter users. The depressed users’ tweets were collected for a year before the onset of their self-reported depression. Using various statistical and machine learning models, the significant features used in predicting the onset of depression were determined, contributing a radial basis function (RBF) support vector machine (SVM) classifier, with principal component analysis (PCA) dimensionality reduction, that achieved 70% classification accuracy with a precision of 0.74. Features included the presence of known depression terms in tweets, social network features, prevalence of medication terms, tweet volume over time, the frequency of 1st, 2nd, and 3rd person pronouns, linguistic inquiry and word count (LIWC) scores, and the prevalence of swear words. Using the model for finding depression-indicative tweets on a corpus of millions of tweets within the United States, the authors then created a Social Media Depression Index (SMDI) for calculating levels of depression within regions of the United States. They found high correlation with depression statistics reported by the Centers for Disease Control and Prevention (CDC) .