Wikipedia is one of the world’s biggest information sources, disseminating information on virtually every topic on earth. Yet, Wikipedia is compromised by a relatively small number of vandals. Vandalism is not limited to Wikipedia itself, but is widespread in most social networks. In our paper, we present a system called VEWS for early identification of vandals before any human or known vandalism detection system reports vandalism.
VEWS exploits the differences in the editing behavior of different types of editors to differentiate between them. VEWS uses a combination of distinguishing behavioral features and automatically creating a transition probability matrix - the former was created by statistically analyzing the dataset and the latter captures each editor's footprint of the sequence of her edits.
Here we only present a few interesting results. For details, please have a look at the paper.
Insights of vandal editing behavior:
We found very interesting insights into the behavior of vandals and benign users through our analysis:
VEWS is very effective in early identification of vandals:
VEWS gets an accuracy of 87.82% in classifying vandals and benign users, using 10-fold cross validation. Even classifying based only on the first edit, the accuracy is 77.4%. As the number of edits used for classification increases, the accuracy of classification increases as well. When the information about editor's reversion is used, the performance improves further. So, overall, VEWS can effectively determine whether the editor is a vandal or not by her first few edits!
The performance improves by using ClueBot NG and STiki:
VEWS handily beats today's leaders in vandalism detection - ClueBot NG and STiki - at the task of detecting vandals. On average, VEWS predicts a vandal after it makes (on average) 2.13 edits, while ClueBot NG needs 3.78 edits.
However, VEWS benefits from adding features from ClueBot NG. This gives a fully automated system. The performance further increases when adding two features from STiki!
Download the PDF.
See the presentation slides.
Abstract
We study the problem of detecting vandals on Wikipedia before any human or known vandalism detection system reports flagging a potential vandals so that such users can be presented early to Wikipedia administrators. We leverage multiple classical ML approaches, but develop 3 novel sets of features. Our Wikipedia Vandal Behavior (WVB) approach uses a novel set of user editing patterns as features to classify some users as vandals. Our Wikipedia Transition Probability Matrix (WTPM) approach uses a set of features derived from a transition probability matrix and then reduces it via a neural net auto-encoder to classify some users as vandals. The VEWS approach merges the previous two approaches. Without using any information (e.g. reverts) provided by other users, these algorithms each have over 85% classification accuracy. Moreover, when temporal recency is considered, accuracy goes to almost 90%. We carry out detailed experiments on a new data set we have created consisting of about 33,000 Wikipedia users (including both a black list and a white list of authors) and containing 770,000 edits. We describe specific behaviors that distinguish between vandals and non-vandals. We show that VEWS beats ClueBot NG and STiki, the best known algorithms today for vandalism detection. Moreover, VEWS detects far more vandals than ClueBot NG and on average, detects them 2.39 edits before ClueBot NG when both detect the vandal. However, we show that the combination of VEWS and ClueBot NG can give a fully automated vandal early warning system with even higher accuracy.
Bibtex
@inproceedings{kumar2015vews, author = {Kumar, Srijan and Spezzano, Francesca and Subrahmanian, V.S.}, title = {VEWS: A Wikipedia Vandal Early Warning System}, booktitle = {Proceedings of the 21th ACM SIGKDD international conference on Knowledge discovery and data mining}, year = {2015}, organization = {ACM}, }
Download the generalized UMDWikipedia dataset.
Download the exact UMDWikipedia dataset in the paper, and codes.
Please feel free to contact us for any additional data and code that would be useful for you!