The raw data for Hurricane Sandy comprise two distinct sets of messages. We obtained the data sets through the analytics company Topsy Labs. The first set consists of messages with the hashtag “#sandy” posted between 15 October and 12 November 2012. The data include the text of the messages and a range of additional information, such as message identifiers, user identifiers, follower counts, retweet statuses, self-reported or automatically detected location, time stamps, and sentiment scores. The second data set has a similar structure and was collected within the same time frame; however, instead of a hashtag, it includes all messages that contain one or more instances of specific keywords that are considered to be relevant to the event and its consequences (“sandy,” “hurricane,” “storm,” “superstorm,” “flooding,” “blackout,” “gas,” “power,” “weather,” “climate,” etc.; see table S1 for the full list). In total, for Hurricane Sandy, we have 52.55 million messages from 13.75 million unique users.
Data for the additional disasters were obtained in two ways. For the disasters that occurred during 2013, the data were purchased from Gnip, a Twitter subsidiary data reseller. For each disaster, we used the geographic boundary of the affected region and collected all messages that contained a preselected set of keywords (“storm,” “rain,” “flood,” “wind,” “tornado,” “mudslide,” “landslide,” “quake,” “fema”). Data for the events from 2014 are extracted from continuously collected geo-tagged tweets from the United States via Twitter’s Streaming Application Programing Interface (API).
Data sets obtained from data providers (Topsy and Gnip) are the subsets of full historical data (“high fidelity”). Streaming API offers almost complete coverage because only about 1 to 1.5% of all messages are geo-enabled and more than 90% of natively geo-coded messages are captured when geographic boundary is used in a request (69 ).