As online texts are increasingly the object of linguistic study, researchers have come to recognize the situated nature of such texts and the uniqueness of the social relations they represent. Rather than unproblematically extending traditional objects of analysis to internet data, attention is turned to the unique possibilities and practices of interaction from which they emerge. The following reviews sociolinguistic perspectives on gender in Internet communication and on alternatives to study of identity that have emerged in internet study. Then, it overviews writing specific to Twitter and to the interactive features of the text that will
later be analyzed in this original study.
Internet researchers have clearly established that communities exist online, but also that they take unexpected or novel forms. For example, as Kozinets (2010) reports from the field of Internet ethnography, internet researchers identify indistinct boundaries to community membership in online participation; yet, users create cohesive social orders and develop norms
specific to their online environments. Baym and boyd (2011) recognize online communities as complex, multi-layered sites of interaction, where interaction may be more or less public based on audience and participation that emerges; as Marwick and boyd wrote of Twitter in 2011, this continuum of contexts is even collapsed into individual Tweets, which are shown
indiscriminately to followers of more or less familiarity with Twitter users in the many domains they Tweet about. In a 2011 paper, Gruzd et al wrote about the experience of Twitter through the eyes and connections of one user, concluding that the platform offered an “imagined community” that shared norms, which was characterized by being both collective (in that Tweets are able to be accessed by the whole platform at any time) and by being personal (because users are situated and findable within their own unique identities). Wu et al (2011) introduce the term “masspersonal communication” in the context of Twitter, where communication is neither clearly mass (one to many) nor personal (one to one) but must serve both communicative functions.Research on online communication increasingly asks how its saliencies, shaped by the unique context of platform, differ from offline communication. For example, Androutsopoulos’ discourse analytic work approaches MySpace users’ identity through code variation with attention to the platform’s specific, interactive functions, rather than by researching users’
offline macro-demographics; as he urges, analysis of language in social media must be based in an understanding of the functions and salient features of the platforms where the data is produced (2010). In the case of Twitter, the research looks to the performance of multiple interpersonal involvements in a shared environment. Specifically to Twitter, Gillen and Merchant’s 2011 autoethnography arrives at the conclusion that participation in the platform implies the possibility of participation in social connections (ex: being retweeted or replied to), and a limiting of potential actions to those supported by Twitter; this openness to interaction and potential polyvocality within the whole group leads to a “partial intersubjectivity.” Java  (2009) provide an account of Twitter by asking what user’s intentions are; they posit that users participate in multiple different communities split between different interests, and that they are led primarily by three intentions: information sharing, information receiving, and
friend-wise communication. Marwick and boyd (2011) write about this as “context collapse” – users negotiate person-to-person and topic-specific connections within an environment that users negotiate person-to-person and topic-specific connections within an environment that does not allow separation of audiences and contexts; Twitter users must imagine their audiences, as they cannot know which follows read which Tweets, the often negotiate audience design based on Tweet content.A study that contrasts THEY with gendered pronouns must ask if gender is performed differently or more easily left unspecified online; according to previous research, online communication has not made gender irrelevant but rather reinforces it in unique ways. As Herring (2003) writes, online communication creates the potential for genderless self-portrayal. Since it is text-based, online communication lacks the visual and auditory gender cues of other forms of communication. Theoretically, gender could remain anonymous or irrelevant. Still, Herring’s work finds that actual practices continue to be gendered, as users make their gender overt or “give themselves away” with styles typically associated with gender, and that the gendered discourse produced exhibited the same asymmetries as offline communication. Other research suggests that the possibilities of genderless representation online actually promote greater foregrounding of gender. In Sundén’s 2002 research on a text-based online program of Multi-User Dungeons, she writes that “textual talk” constitutes gender and online bodies for users. Despite a plurality of gender options, most characters conform to a male/female gender binary. Players who select other options (such as “neutral” or “plural”) are characterized by repeated attention to gender and sex in their texts, and one player who prefers the pronoun “it”is referred to with feminine pronouns by a familiar user. Sundén writes that the environment of uncertainty about physical bodies leads users “to textually re-inscribe familiar categories on the level of sex and gender, to insist on a system of recognizable differences. (301)” In 2012,  Bamman et al locate gender in Twitter users’ patterns of lexical items. Though some features do not follow sex-based patterns precisely as found in offline communication, most users’ gender can be accurately predicted from an analysis of their style. Interestingly, those whose cannot tend to have Twitter connections mostly to members of the opposite sex, whose style they mirror. This suggests that Twitter, which does not formally solicit gender information from
users, is not a genderless environment, but rather emergent norms in the community lead to self-revelation.
Much more, research on Twitter communication has focused on creation of coherence through the textually based tagging system that connects Tweets and users. Possible connections are both direct and diffuse, and the hashtag and @-mention have been of special interest to linguists. According to Zappavigna (2011, 2012), relations are created when users
include platform-specific tags (hashtags, @-tags) to create “searchable talk” in their Tweets.
Although the platform is not unique among computer-mediated discourse for being easily searchable by keyword, the use of these tags makes searchability social in Twitter. In the case of the hashtag, users create “ambient affiliation” between disparate texts including the tag, forming a cloud of affiliation based on content rather than one-to-one, user-based connections.
Page (2012) writes that hashtags project potential interaction and enable visibility through the search connections they create. Page’s article also informs my view of the Tweet texts from a trending topic as a co-constructed discourse by writing that hashtags create an asymmetrical, not dialogic connection, by broadcasting talk “about” rather than creating talk “with” others. In a 2009 study of the tag, Honeycutt and Herring found coherent, dyadic conversations centered around the @-tag in about a third of its uses. Tweets appear across Twitter in the order in which they are submitted and without regard to topic, creating a lack of turn adjacency that could make coherence difficult. However, the @-tag in Tweets often sparked coherent and
collaborative threads, whose content is significantly less self-focused and more addressive than Tweets that do not include an @-tag. Sousa et al 2010 explore these ties quantitatively, tying them to social networks as they observe that users with smaller networks @-mention to create social ties with other across topics, creating a dense network, while users with larger networks
show somewhat more disjointed mentioning practices, mentioning users based on topic of Tweets (though the authors maintain that these topic-motivated ties are still significantly social).
Retweeting, especially, has inspired study that ranges from qualitative inquiry into computational modeling of message dissemination or user influence. boyd, Goldan, and Lotan (2010) identify retweeting as a social and conversational practice. Though the practice of retweeting has changed significantly since the time of their writing (especially since the platform now supports a native Retweet functionality; when they wrote, users manually copied and prefaced messages), the authors show that audience design is a crucial consideration in retweeting, as users attempt to spread messages to broader audiences or specifically, to their own followers. Retweets also served to open a line of commentary or conversation on the content of a Tweet or to validate or publicly agree with another user. By examining which types of third person portrayals are retweeted, I observe the shareability of messages. However, a retweeted text suggests that it’s not specific to the original author, but rather has broader applicability and resonance, suggesting less personalization. Using quantitative models, several authors have attempted to uncover why certain messages spread or to define influence within the platform. Wu et al (2011) expose a great imbalance in the platform: in asking “who Tweets what to whom?” uncover not a comprehensive view of Tweets, but the finding that .05% of users attract 50% of attention, mostly after their Tweets have been passed through intermediary users. The authors prefer the term “information sharing network” to “social network” because of the pervasiveness non-reciprocal ties. Cha et al (2010), noting that tags, as directed links, can display multiple stances, nevertheless develop a taxonomy of tag-based influence: influence measured on frequency of Retweets shows that the content is influential, while mentions of a user betray that user’s name value. The authors give agency to Twitter users, stating that influence is won only through Twitter users’ personal effort and involvement, focusing on a single topic to concentrate influence. They also note that influence comes only from the susceptibility of society. Leavitt et al (2009) also distinguish between conversational (@-mention-based) and content (Retweet-based) influence, and add a conception of this movement of Tweets/user links as a social action. Suh et al (2010), looking to create a program that could engineer Retweets, look for correlation like the number of past Tweets by the author. These authors often look at both social network context and content features of Tweets to determine retweetability. Hashtags and URLS are found by several to significantly increase a Tweet’s likelihood of propagation (Suh et al 2010, Naveed et al 2011). The concept of “interestingness” is advanced by a few authors, who calculate this based on similarity of words
in retweets to words in a user’s original Tweets (Yang et al, 2010, showing a correlation), and a cluster of features like interpersonal and popular topic, question marks, and negativity (Naveed et al, 2011). Naveed et al are, in fact, able to model the probability of Retweeting based solely on the content of Tweets without using any social network data. They determine that “a tweet is likely to be retweeted when it is about a general, public topic instead of a narrow, personal topic,” or goes so far as to be addressive.In this tradition, my study eventually turns to the propagation of messages within Twitter. However, this paper approaches Twitter through textual data with only limited information disclosed about social networks. That is, only connections (@-mention user-to-user connections; shared hashtags connecting texts) explicitly performed in the text are available for
analysis, in comparison to the collection techniques of many other researches that crawls user networks. My collection technique, though lacking the depth of data of those studies, offers a perspective of the “surface level” connections that are easily viewable on the platform. For most users, the connections of the other, geographically disparate participants in the hashtag are unknowable and obscure, while the connections made in their texts are the primary experience of the stream. Here, I approach the question from linguistic and social premises, specifically asking what insight the propagation I find offers into linguistic form, a topic left largely unexplored in these studies.


Approaches to correlating pronouns with extralinguistic features

The following study emerges from the basic premise that the patterns of alternation of linguistic items offer insight into extra-semantic information about them. In other words, it will attempt to locate a pattern around its pronoun variables in order to analyze it for contributions to the meaning of the variable. This tenet is adapted from several linguistic subfields to fit the
scope and type of data analyzed here. The following review first introduces principles about the denotational value of words (including pronouns) and then moves on to two larger-scale, quantitative approaches, one from corpus pragmatics and one from sociolinguistic variation, that guide my approach to the volume of data in my Twitter dataset. This search for extra information in terms of reference, which fuel small-scale, qualitative studies of identity and positioning, has also been applied to larger, corpora-based studies. Brown and Gilman’s seminal 1960 crosslinguistic study explored the tendencies of
second-person pronouns. Based on questionnaire data, the authors examined variation in second person, “T” and “V” pronouns (so named after the French “tu” and “vous”). This correlated with “objective relationships” between speaker and hearer, and the authors found that larger patterns in the use of one pronoun or the other could be used to create speakers’
“expressive styles.” The authors named T and V pronouns after the relationships they projected: (respectively) the “pronouns of power and solidarity.”
Later, further research emerged in the style of sociolinguistic variation that tied words’ meaning to the tendencies in surrounding, extralinguistic information. Variationist study is rooted in the study of strictly semantically equivalent variants like allophones. However, its objects of study have been rethought, the principles underlying the field found to be fruitful in
the study of not only semantically equivalent variants, but of differences in meaning, and of not only macro-demographic categories but also of the creation of social distinction. Since at least Lavandera’s 1978 essay, the implications of expanding the scope of the sociolinguistic variable have been considered, observing that the “tendencies and frequencies” of many levels of linguistic variation may carry social meaning. Lavandera suggests a standard of “functional comparability” for variants rather than semantic equivalence. Increasingly, variationist methods have been used to analyze variants’ meanings in practice rather than to correlate different realizations of the variable with distinct demographic groups. This approach to variation in practice especially applies to the online variation where macro-demographics do not have the same salience as offline, and are certainly not always knowable. It seems especially applicable to a study of third-person singular pronouns, which formally have the same truth conditions (and different pragmatic felicity conditions), but are prima facie used differently in discourse.
Several previous, variationist studies have successfully uncovered associations of their variables by studying their contexts. For example, in their 1995 study of constructed dialogue, Ferrara and Bell analyzed patterns of “BE + like” in corpora of spoken narratives as compared to other variants like “say.” The differing correlations with person and advancement of the narrative show that “BE + like” is linked to internal states. In 2004, the relationships between speaker and hearer in Keisling’s study of the word “dude” in fraternity men’s speech and in a corpus collected by his students showed the word to be associated with a stance of “cool solidarity.”
Finally, in a 2010 paper reviewing the study of variation in larger units of language, Pichler puts it succinctly: “in some cases, function may even exert a more important constraint on discourse
variability than social factors” (597).
Indexical, social meaning is also increasingly sought in variationist studies. Third-wave variationst work now seeks to situate variants’ meaning in their social context; meaning and sociality are seen as inseparable. Arguing that meaning – including meaning of variables – is emergent in interaction, Eckert (2008) identifies the study of local social meaning and
indexicalities of variants as the ultimate goal of variationist study. In a 2012 essay, Eckert further defines this ‘third wave’ of variationist research. This wave focuses on how combinations of linguistic features create meaning in interaction, “foreground(ing) the relation between language use and the kinds of social moves that lead to the inscription of new
categories and social meanings (95).” This study, once lead through a simple examination of patterns in tokens, looks to the corpus level to the inscription of meaning by the textual practices that constitute the corpus. In this way, this paper draws from the premise that corpora offer insight into of the emergence of meaning in discourse.
I also draw upon other corpus-based approaches, especially those used to examine pragmatic content. This approach to meaning in use has been studied increasingly in computational studies, which approach semantic and pragmatic questions in the tendencies of large corpora. Word meanings are taken as coming partially from the words’ collocation with
other types of words and phrases, for example, as Stubbs (2001) argues, “our knowledge of a language is not only a knowledge of individual words, but of their predictable combinations, and of the cultural knowledge which these combinations often encapsulate” (3). This has, in at least one case, meant correlating extralinguistic features with target words to discern those words’ force: In a 2009 paper, Constant et al explore the use of expressives based on their use in Amazon reviews. The term expressives, describes words with, among other properties, pragmatic content about emotional states independent of truth values (cf Potts 2007), like “damn” and “bastard.” The authors of the 2009 paper correlate the frequency of several such
words with the extralinguistic feature of level and valence of emotion in a corpus of reviews, as indicated by the number of stars in each review where an expressive is used. These tendencies allow for a more nuanced understanding of conditions of use and of the meaning expressives contribute to the texts in which they’re found. Though pronouns are not
addressed in the 2009, corpus-based paper, one co-author, Potts, writes about their expressive properties in a 2007 paper. Reminiscent of the assertions made in above, sociolinguistic citations, Potts writes that second person T and V pronouns create additional expressive meanings in addition to propositional content: “the expressive setting — the indicated
relationship between speaker and addressee — is different”. This thesis, in examining the content about relationships created by the Retweet and @-tag, draws upon the work of this corpus-based approach to expressive meaning and expands it to the area of pronouns. In thistudy of pronouns, I attempt to apply this principle to the tendencies of THEY, with a similar
attention to how features of the corpus above the level of individual texts offer insight into elements of those texts.


The following section describes the collection of a Twitter corpus collected based on a common, referent-introducing hashtag, which is analyzed around the third-person singular pronouns linked to that hashtag. It proceeds by detailing the collection techniques of the initial corpus of 9031 Tweets, then providing an introduction to the sort of referent constituted in the hashtag
all Tweets have in common. Then, the annotation of Tweets by third-person pronoun is discussed and the methods of further, token-level annotation used in previous studies are shown to be poorly applicable to this data. Finally, I turn to a view of the texts as interrelated, clustering the corpus based on “core” texts within Tweets, thereby finding a clear distinction in  the scale of dissemination of several Tweets that include THEY.

Collecting the #oomf-corpus

The corpus for this study was collected on February 5, 2013, between 6:54 and 7:21 PM EST,  after I saw #oomf in the US Trends box while using the platform. Using ten calls to the Twitter Search API with the search term #oomf, which returns Tweets including that precise string in  the Tweet text, over 10,000 public Tweets were collected. The corpus was filtered for
duplicates using Tweet-specific id numbers, leaving 9031 unique Tweets to comprise the corpus,which the timestamps indicate were all tweeted within 32 minutes of each other.
The primary unit of analysis for this research is the Tweet text (that is, the brief textual strings submitted by users when they tweet) of those 9031 Tweets. Each of those tokens constitutes a unique action and a unique object within the platform and is open equally for propagation and interaction. This social uniqueness is not reflected on a textual level, however, and many of the strings making up Tweet texts are non-unique: the 9031 tokens are found to represent 6001 “types” as defined by complete string identity (that is, the Tweet texts being precisely the same combination of characters; see section V-iii. Sub-corpus composition for a continued discussion of how similarity between Tweet texts is analyzed in this research). The 9031 Tweets originate from 8031 unique accounts, meaning that users averaged 1.1 Tweets each in the corpus.
Querying the language codes attached to Tweets indicates that 8546 of them, or nearly 95%, are in English. However, as the language code is attached to Tweets based on the users’ interface language and not individual Tweet texts, my informal survey of the corpus suggests that the share of Tweets in English is actually quite close to 100% (perhaps because #oomf is an English-language abbreviation, see next section). At any rate, as this is a study of English- language pronouns in Tweet texts, only Tweet texts determined to include English pronouns in English texts were identified for further analysis in the next phases.This corpus offers a sample of #oomf-including, public Tweets produced during a short time period. It serves as a sample of naturally occurring data, but it is not a complete set of Tweets using the hashtag, due to several factors in the sampling. Both the timing of the sampling and Twitter’s Search API itself make this an incomplete set. First, Tweets from short
periods during the collection may have been missed, based on when the calls were sent to the Search API (that is, if there was a gap between the last Tweet in one batch of results and the first Tweet in the next). Second, Twitter’s Search API does not use any authentication and therefore indexes only public tweets. Some users’ Tweets are viewable only to followers who
they have approved, but none of those are included in the present dataset. Third, the results that the Search API returns are “focused in relevance and not completeness” (Twitter n.d., Twitter 2012a). The site is vague about what parameters it uses to include (or exclude) certain results in its search, though it refers explicitly to SPAM filters. Still, there is no reason to believe
that any of these limitations on the corpus affects the pattern of pronouns found in the Tweets collected, and much less that they would affect different pronouns unequally. I conceptualize the corpus as a collection of naturally occurring language in which I can explore possibilities and tendencies emerging from a comparable context.

Introducing #oomf, a Twitter-specific antecedent

The corpus was collected to bring together texts about referents introduced with the hashtag #oomf. Such a referent, in this environment, is specific and is not gender-marked: #oomf stands for “one of my followers” but conventions of use restrict it to definite followers who are not named. The hashtag has a meaning parallel to the English meaning of the tag, which develops
out of users’ affiliation with the trend. As explored in the literature review, hashtags are affiliative; this paper’s corpus captures Tweets that overwhelmingly follow implicit conventions in meaning and tag position. This concerted use of #oomf allows me to treat the sort of referents introduced by #oomf as of a certain type unless otherwise indicated, where individual Tweets are often too short to include disambiguating context in and of themselves. By then studying pronouns linked to #oomf as an antecedent, it is possible to hold that aspect of the referents studied constant1.#oomf is a Twitter acronym that has been conventionalized to introduce a specific referent. It abbreviates “one of my followers,” but the hashtag has a more specialized use than the English phrase. As discussed in the introduction to this thesis, Twitter users asymmetrically create connections to each other by subscribing to other users’ Tweets or, in Twitter terms, “following” other users (becoming other users’ “followers”). Each Tweet by a user who is being followed is included in the followers’ streams, where the follower may see and interact with them. The tag #oomf is used in order to talk about one of these followers who will see the Tweet. Since followers are, by definition, Twitter users, tweeting about #oomf represents a choice to refer to the follower in a way that becomes part of a larger, shallow network of Tweets including the hashtag; unlike using an @-mention to refer to a follower, it does not
create direct connections between users.
#oomf is not completely equivalent with “one of my followers.” The English phrase “one of my followers,” has non-specific uses, meaning roughly “any of my followers” as in “One of my followers has my phone, so let me know if it’s you”, or “an indeterminate one of my followers” as in “One of my followers is lying to me, and I am going to be so angry when I find out who.”The Twitter hashtag, in comparison, is conventionalized for use in only the specific sense of the phrase (meaning roughly, “a specific one of my followers”, as in the Tweet, “#oomf is such a flirt, and he knows it.”). On, a website where users offer and vote on the accuracy of definitions for Twitter hashtags, one user-submitted entry from March 2011 defines #oomf as “talking about someone without using their real name. so you use #oomf so nobody knows .
the persons actual name. EX. “i really like #oomf””. Buzzfeed, a site that aggregates “best of” lists of internet artifacts like pictures and YouTube videos, devotes a November 2012 article to #oomf after it spiked in popularity over the previous days. Tweeting with #oomf is defined as “hashtagged, institutionalized subtweeting” (that is, the practice of ‘subliminal tweeting,’ or
Tweeting about another person without using that person’s name or username name or username) (Herman 2012). These sources also hint that #oomf is usually used within a sentence rather than outside of it to tag the Tweet. To this, I add five of my own observations based on the corpus analyzed in this study.