In order to explore public concerns regarding rapidly evolving H1N1 activity, we
collected and stored a large sample of public tweets beginning April 29, 2009
that matched a set of pre-specified search terms: flu, swine, influenza,
vaccine, tamiflu, oseltamivir, zanamivir, relenza, amantadine, rimantadine,
pneumonia, h1n1, symptom, syndrome,
and illness.
Additional keywords were used to examine other aspects of public concern,
including disease transmission in particular social contexts (i.e., keywords
travel, trip, flight, fly, cruise and
ship), disease countermeasures (i.e., keywords wash,
hand, hygiene
and mask), and consumer concerns
about pork consumption (i.e., keywords pork and
bacon). Each tweet is time-stamped and geolocated using the
author's self-declared home location. A client-side JavaScript application
was created to display a continuously-updated Google map with the 500 most
recently matched tweets, yielding a real-time view of flu-related public
sentiment in geographic context. Anyone visiting the web site could read any
tweet by placing the cursor over its corresponding color-coded (by search terms)
dot on the map (Figure
1
).
Beginning on October 1, 2009, we collected an expanded sample of tweets using
Twitter's new streaming application programmer's interface (API) [10] with the
intent of estimating influenza activity. In addition, following discussions with
public health officials, new search terms were added to investigate concerns
about vaccine side effects and/or vaccine shortages: guillain,
barré, barre, shortage, hospital
, and
infection.
Note that the Twitter stream is filtered in accordance with Twitter's API
documentation; hence the tweets analyzed here still constitute a representative
subset of the stream as opposed to the entire stream.
Moreover, because our main interest was to monitor influenza-related traffic
within the United States, we also excluded all tweets tagged as originating
outside the U.S., tweets from users with a non-U.S. timezone, and any tweets not
written in English. We also excluded all tweets having less than 5 characters,
those containing non-ASCII characters, and tweets sent by a client identifying
itself as “API” (the latter are usually generated by computer and
therefore tend to be “spam”). The remaining tweets were used to
produce a dictionary of English words, from which all commonly-used keywords
comprising Twitter's informal messaging conventions (e.g., #hashtag, @user,
RT, links, etc.) were removed. Porter's Stemming Algorithm [11] was
used to reduce inflected words to their root forms (e.g., “knowing”
becomes “know”) in order to compress the size of the dictionary. We
then compiled daily and weekly usage statistics for each dictionary term (i.e.,
number of tweets in which each term occurred), both nationally (by aggregating
data for all valid locations) and at the CDC's influenza reporting region
level [12] .
Finally, because the volume of posts on Twitter varies over time as well as
across geographic regions, usage statistics were expressed in terms of the
fraction of the total tweets emitted within the corresponding time interval and
geographic region.
Free full text: Click here