Published on December 5th, 2012 | by EJC0
Predicting the Credibility of Disaster Tweets Automatically
“Predicting Information Credibility in Time-Sensitive Social Media” is one of this year’s most interesting and important studies on “information forensics”. The analysis, co-authored by my QCRI colleague ChaTo Castello, will be published in Internet Research and should be required reading for anyone interested in the role of social media for emergency management and humanitarian response. The authors study disaster tweets and find that there are measurable diﬀerences in the way they propagate. They show that “these diﬀerences are related to the news-worthiness and credibility of the information conveyed,” a finding that en-abled them to develop an automatic and remarkably accurate way to identify credible information on Twitter.
The new study builds on this previous research, which analyzed the veracity of tweets during a major disaster. The research found “a correlation between how information propagates and the credibility that is given by the social network to it. Indeed, the reﬂection of real-time events on social media reveals propagation patterns that surprisingly has less variability the greater a news value is.” The graphs below depict this information propagation behavior during the 2010 Chile Earthquake.
The graphs depict the re-tweet activity during the first hours following earth-quake. Grey edges depict past retweets. Some of the re-tweet graphs reveal interesting patterns even within 30-minutes of the quake. “In some cases tweet propagation takes the form of a tree. This is the case of direct quoting of infor-mation. In other cases the propagation graph presents cycles, which indicates that the information is being commented and replied, as well as passed on.” When studying false rumor propagation, the analysis reveals that “false rumors tend to be questioned much more than conﬁrmed truths [...].”
Building on these insights, the authors studied over 200,000 disaster tweets and identified 16 features that best separate credible and non-credible tweets. For example, users who spread credible tweets tend to have more followers. In addition, “credible tweets tend to include references to URLs which are included on the top-10,000 most visited domains on the Web. In general, credible tweets tend to include more URLs, and are longer than non credible tweets.” Further-more, credible tweets also tend to express negative feelings whilst non-credible tweets concentrate more on positive sentiments. Finally, question- and exclama-tion-marks tend to be associated with non-credible tweets, as are tweets that use first and third person pronouns. All 16 features are listed below.
• Average number of tweets posted by authors of the tweets on the topic in past.
• Average number of followees of authors posting these tweets.
• Fraction of tweets having a positive sentiment.
• Fraction of tweets having a negative sentiment.
• Fraction of tweets containing a URL that contain most frequent URL.
• Fraction of tweets containing a URL.
• Fraction of URLs pointing to a domain among top 10,000 most visited ones.
• Fraction of tweets containing a user mention.
• Average length of the tweets.
• Fraction of tweets containing a question mark.
• Fraction of tweets containing an exclamation mark.
• Fraction of tweets containing a question or an exclamation mark.
• Fraction of tweets containing a “smiling” emoticons.
• Fraction of tweets containing a ﬁrst-person pronoun.
• Fraction of tweets containing a third-person pronoun.
• Maximum depth of the propagation trees.
Using natural language processing (NLP) and machine learning (ML), the authors used the insights above to develop an automatic classifier for finding credible English-language tweets. This classifier had a 86% AUC. This measure, which ranges from 0 to 1, captures the classifier’s predictive quality. When applied to Spanish-language tweets, the classifier’s AUC was still relatively high at 82%, which demonstrates the robustness of the approach.
About the Author: