Twitter is a relatively new service, made public in 2006, which allows users to post 140 character updates. They can follow other users, such as friends, celebrities, or companies to receive a live digest of those users’ updates. Users can engage in public or private conversations with these short messages, forward messages they think their followers will be interested in by retweeting, or just post whatever they are doing, thinking, or want to write . Twitter has reported having over 140 million active users and 340 million tweets per day, meaning there is an incredible amount of information and text exchanged on Twitter . The Twitter service provides two special keyword annotations of note. First is the @-username construct, often used in a conversation between two users, noting that a tweet is a reply to some thing another user has said, as a method of bringing a message to the attention of another user, or, in a retweet, noting the original author of a quoted phrase or message. An at sign precedes a series of up to 15 alphanumeric characters and underscores which correspond with the username of a Twitter account. The entire token, including the at sign is hyperlinked, pointing to the associated home page of that Twitter account. The second construct is the hashtag, denoted with a ‘#’ symbol, which provides a tagging interface for use in tweets. The hashtag symbol is followed by a keyword or phrase (no spaces) that is relevant to the tweet. The hyperlink created from the hashtag points to a page that lists all other tweets with the same tag. As hashtags can occur anywhere within a tweet, they make the process of cleaning a tweet into a standard language sentence somewhat difficult, as there is no strict rule whether or not the hashtag is a part of a sentence or auxiliary. Hashtags that are most used are generally short. They are frequently abbreviated or are short word phrases with the spaces removed . Commonly, hashtags at the end of a tweet are dropped from sentence cleaning, and those within sentences are treated as relevant words and have the ‘#’ symbol stripped for analysisvpurposes. Gimpel et al. found 35% of hashtags were treated as words rather than tags .In July 2011, Twitter crossed the one million mark for developer applications registered to use the Twitter API. Twitter provides developers and researchers a robust API with which to interact with accounts and access user information and tweets. Every user defines a username,and optionally a real name, description, and location. Also available are the account’s associated timezone and the account creation timestamp. Each tweet is associated with several pieces of information in addition to the message, such as its timestamp, the Twitter client it was posted from, if it was part of a conversation, a retweet, and the count of people who retweeted it. However,Twitter does not elicit other data about the author that might be useful for latent attribute analysis,such as age or other demographic information.Several corporate entities have published various studies of the demographics on Twitter. Most use data mining techniques to extract a set of demographic features from the defined user attributes,relying on instances where users have published their age, gender, or location as part of their profile or somewhere in their tweets. Others, such as the Pew Research Center, utilize other forms of data collection. In their internet and social media use survey, they used phone interviews to get data on internet and social media (Twitter included) use and demographics . Consumers of this information tend to be in marketing, as companies are always seeking the best way to advertise to their target audiences.
A variety of work has been published that focuses on linguistic analysis for author age, much of which focuses on lexical and contextual clues, such as analyzing topic and genre or n-gram patterns. N-gram patterns can refer to several elements of linguistic analysis. On a lexical level,n-grams are groupings of length n of word tokens, found adjacently in text. They are also referred to as unigrams, bigrams, trigrams, etc. for of , and respectively. On a character level,n-grams can refer to groupings of adjacent characters within a word, in much the same way as groupings of words. Depending on the approach, special characters may be used as marks at word boundaries. As an example, Cavnar presents trigrams for the word text, such as , and . This work, as many others, focus on token analysis. Tokens, as defined for this work, consist of sets of characters, generally separated by spaces in the original text, but not always. Punctuation tokens (those consisting of only punctuation characters) are separated from adjoining word tokens. Additionally, words recognized as contractions are separated into two word tokens, e.g. “shouldn’t” ! “should” and “n’t”.Garera and Yarowsky used linguistic features (amount of speech in conversation, length of utterances, usage of passive tense, etc.) for characterizing types of speech in telephone conversa tions between partners in their research. They found that such sociolinguistic features improved
the accuracy of binary attribute classification for speaker age, gender, and native language . Many features that are available in an audio corpus, such as prosody and vocal inflections, are not available in a purely textual corpus, making related classification problems more challenging.
Garera and Yarowsky were able to get about 20% improvement over guessing the most common class when classifying phone conversations for age with a binary classifier . Nguyen et al. went a step beyond many other studies and classified age as a continuous variable in online texts and transcribed telephone conversations. They found that stylistic, unigram, and part
of speech characteristics were all indicative of author age with mean absolute errors between 4.1 and 6.8 years .
Rosenthal and McKeown analyzed online behavior associated with blogs (i.e. usually larger depth than tweets) and found that behavior (number of friends, posts, time of posts, etc.) could effectively be used in binary age classifiers, in addition to linguistic analysis techniques similar to those mentioned above. Similarly, many works investigating linguistic gender and age indicators focus on non-contextual and deeper analysis, such as through statistical language models. A statistical language model is a probability distribution over words, sentences, phrases, or characters in a language. A language model might hold probabilities representing n-grams. Those probabilities can be used in various types of linguistic analysis.
With respect to examining another demographic feature, Sarawgi et al. explored non-contextual syntactic patterns and morphological patterns to find if gender differences extended further than topic analysis and word usage could indicate. They used probabilistic context-free grammars, token-based statistical language models, and character-level language models, that learn morpho logical patterns on short text spans. With these, they found that gender is evident in patterns at the
character-level, even in modern scientific papers . Much of linguistic analysis that has been completed focuses on formal writing or conversation transcripts, which generally conform to standard English corpora and dialects, syntax, and orthog
raphy. Recently, more works have begun to look at new written and online texts which do not tend toward prescriptive standards, including SMS messages and social networking blurbs, such as Facebook and Twitter messages. There are various challenges when trying to analyze these typ- ically noisy texts. Misspellings, unusual syntax, and word and phrase abbreviations are common in these texts, which many linguistic analysis tools do not deal with. Rao et al. found n-gram and sociolinguistic cues, such as a series of exclamation marks, ellipses, character repetition, use of possessives, etc., in unaltered Twitter messages could be used to determine age (binary: over or under 30), gender, region, and political orientation, similar to works that have focused on more formal writing. These textual sociolinguistic features yielded improvements over relative baselines. These improvements are similar to those found in this work. In the best cases, classifiers examined in this work using only numeric abbreviation features performed almost 5% better. Abbreviation features combined with n-gram features showed improvements of as much as 66.8%. Gimpel et al. developed a part-of-speech tagger designed to handle the unique Twitter lexicon by extending the traditionally labeled parts of speech to include new types of text such as emoticons
and special abbreviations . Part-of-speech analysis can be used as a part of normalizing noisy text, or the part-of-speech patterns can be used themselves as features for classification.
Some research takes a different approach to noisy text, such as that found on Twitter. Before performing traditional text analysis, noisy texts are often first cleaned or normalized. There are various ways to approach the text normalization problem, such as treating it as a spell-checking problem, a machine translation problem, in which messages are translated from a noisy origin language to a target language, or as an automatic speech recognition (ASR) problem . ASR is often useful for analysis of texts such as SMS, since many of the OOV words are phoneme ab breviations using numbers . Kaufmann and Kalita presented a system for normalizing Twitter messages into standard English. They observed that pre-processing tweets for orthographic modi fications and twitter-specific elements (@-usernames and # hashtags) and then applying a machine
translation approach worked well. Gouws et al. built on top of the techniques of Contractor et al. using the pre-processing techniques of Kaufman and Kalita] to determine types of lexical transformations used to create OOV tokens in Twitter messages. Such transformations include phonemic character substitutions (“see” ! “c”; “late” ! “l8”), dropping trailing characters or vowels (“saying” ! “sayin”), and phrase abbreviations (“laughing out loud” ! “lol”). These transformations are discussed further in subsection 5.4. Gouws et al. analyzed patterns in usage of these transformations compared to user time zone and Twitter client to see if there was a correlation. The analysis showed that variation in usage of these transformations were correlated with user region and Twitter client . In sum, prior work suggests that text-based age prediction is tenable and leaves room for addi tional study. This thesis seeks to extend prior work and analyze these transformations with respect
to user age.
There are presently some techniques to determine latent user attributes from general texts, but few that specifically target Twitter messages and their unique corpus characteristics. Of those works that have focused on Twitter messages, they have two main types of shortcomings: they focus on a small set of gathered data from hand-picked users, where latent attributes are determined from limited descriptions on user profiles or key tweeted phrases and are entered by human annotators, as opposed to by the Twitter users providing the information themselves; or ] they use the full set of Twitter users and messages, but tend to be limited to the latent attributes that are provided through the Twitter API.
Based on these observations, first, I present my solution to these issues through collection of a new, more robust data set where the tweeters themselves label their Twitter feeds with demo graphic information. Second, based on the work of Gouws et al., I hypothesize that word and phrase abbreviation patterns used to write tweets are indicative of user age, as they are indicative of a user’s region and Twitter client . Third and last, I hypothesize that usage of these abbre viations changes as a user ages or spends more time using the Twitter service, similar to the ways in which language changes as a person ages and community language use evolves. I present my experimental analysis of collected data seeking to examine these hypotheses.