Studies in Computational Social Science
Social media sites are an emerging source of data for public health research. These sites provide less intimidating and more accessible channels for reporting, collectively processing, and making sense of traumatic and stigmatizing experiences. Several previous works have studied public health issues intersecting with domestic abuse, including depression and post-traumatic stress disorder. Many researchers have focused on Twitter data, due to its prominent presence, accessibility, and the
characteristics of tweets (short texts, timestamped, trend-associated properties like retweets, hashtags, and user mentions, and potentially geotags). For example, in De Choudhury et al, the authors examined a set of tweets to predict the onset of depression. Using Amazon Mechanical Turk, gold-standard labels of depression and non-depression were applied to Twitter users. The depressed users’ tweets were collected for a year before the onset of their self-reported depression. Using various statistical and machine learning models, the significant features used in predicting the onset of depression were determined, contributing a radial basis function (RBF) support vector machine (SVM) classifier, with principal component analysis (PCA) dimensionality reduction, that achieved 70% classification accuracy with a precision of 0.74.
Features included the presence of known depression terms in tweets, social network features, prevalence of medication terms, tweet volume over time, the frequency of 1st, 2nd, and 3rd person pronouns, linguistic inquiry and word count (LIWC) scores, and the prevalence of swear words. Using the model for finding depression-indicative tweets on a corpus of millions of tweets within the United States, the authors then created a Social Media Depression Index (SMDI) for calculating levels of depression within regions of the United States. They found high correlation with depression statistics reported by the Centers for Disease
Control and Prevention (CDC).
Related to the above study is an analysis of high and low distress tweets in the New York City area. Distress was examined as it has been shown to be a key risk factor for suicide, and is observable in the writing of microblog users. An SVM trained on uni-, bi-, and trigrams appearing in their corpus achieved a precision of 0.59 and a recall of 0.71 using expert-annotated tweets in predicting distressed versus non-distressed tweets. While a precision of 0.59 in binary prediction is low, erring on the side of caution with a high recall score is beneficial due to the goal of discovering at-risk individuals. This task was challenging, considering the difficulty of recognizing conceptually subjective distress from a few informal tweets.
Other researchers have focused on different health issues including Post- Traumatic Stress Disorder, early detection of epidemics , and bullying tweets. These studies use ngram bag-of-word models as features, and attempt to improve upon them with additional feature engineering or further lexical or semantic features. Adding part of speech tags to ngrams is often attempted, as well as creating word classes via data inspection, using morphosyntactic features, and exploiting the sentiment of text instances. In Xu et al.
linear models with ngrams are recommended for their simplicity and high accuracy, though in Lamb et al. word classes, Twitter-specific stylometry (retweet counts, hashtags, user mentions, and emoticons), and an indicator for phrases beginning with a verb were found to be helpful over ngrams on two different tasks. Reddit has been studied less in this area, with work mainly focusing on mental health. In Pavalanathan and De Choudhury, a large number of subreddits on the topic of mental health were identified and used to determine the differences in discourse between throwaway and regular accounts. They observed almost 6 times more throwaway submissions in mental health subreddits over control subreddits, and found that throwaway accounts exhibit considerable disinhibition in discussing sensitive aspects of the self. This motivates the present work in analyzing Reddit submissions on domestic abuse, which can be assumed to have similar levels of throwaway accounts and discussion. Additionally, in a study by Balani and De Choudhury, the authors used standard ngram features, along with submission and author attributes to classify a submission as high or low self-disclosure with a perceptron classifier. They achieved 78% accuracy, 0.74 precision, and 0.86 recall.