Millions of Twitter users take advantage of Twitter’s ability to broadcast information pertaining to whatever topics they choose; likewise, millions of Twitter users receive an unending flood of undercategorized information. Computer-mediated social networking services such as Twitter are increasingly popular and, due to the vast amount of textual data contained in them, are also increasingly relevant to linguists and other members of the Natural Language Processing (NLP) community. The specific issues that such social networks present are diverse and challenging; this work is an attempt to name, quantify, and resolve one such challenge, i.e., the task of implementing a means of extracting more data from Twitter communications.
Twitter as a Medium of Communication
Java and colleagues were among the first to rigorously examine some features of Twitter. By utilizing network-theoretic methods to study certain topological properties of Twitter’s social network, they derived a typology of Twitter users that featured three main categories of information sources, information seekers, and generalized friends. Additionally, they argued for a classification of user intent over individual tweets, and described four main categories of such intent: daily chatter composed of one-off tweets, conversations arising from such chatter and which utilize the @ symbol to name participants, and information sharing by disseminating URLs and/or updates on current events. Naaman and colleagues extended this user and tweet analysis and found a bifurcated user behavioral structure based on tweet content, concluding that while some users are primarily information sources, a majority of users post tweets of a more self-centered nature. More recently, Bandari and colleagues examined a different aspect of social media: that of its role in propagating online news items. They collected a data set of articles via a news feed and then found the number of times each article was linked to on Twitter.
From a training subset of the articles, they also identified a number of features that considered together were found to be the most relevant in terms of predicting an article’s Twitter popularity, as measured by the number of times its link was included in a tweet. In testing, they achieved an overall accuracy of 84% success in classifying an article’s membership in one of three popularity tiers.
The parameters of twitter
Twitter belongs to the class of computer-mediated social networking services that have evolved from the BBS ecosystem of the internet’s early days to the current environment dominated by services predicated on a network architecture. As Naaman and colleagues noted, these services, exemplified by Twitter, Facebook, and others, constitute a new type of communication technology that is distinguishable by three factors: (a) the semi-public characteristics of the discourse, (b) the brevity of the discourse content, and (c) the network-driven nature of the discourse. Hereinafter these services will be generically referred to as Social Network Services (SNSs), the reader is encouraged to refer to previous academic research pertinent to the unique aspects of these emerging communication systems, a brief examination of some specific aspects of Twitter is given; such an overview will highlight some issues inherent with the research that is being discussed in this work.