26 May 2013
Among all the interesting events taking place today, one is the Closing Ceremony of 2013 Cannes Film Festival.
If you already have seen our recent paper on Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data, you already know that I’m a big fan of movies.
In that paper we investigated the possibility of predicting the future success of movies based on the activity level of Wikipedia editors in combination with page view statistics. We applied a very simple linear model on a very rich set of Wikipedia transactional data and, well, at the end could make rather good “post-dictions” about a sample of USA movies released in 2010.
We all know that “Prediction is very difficult, especially about the future!”, so, the question is weather we could use the method we used in that paper to predict anything about movie success in future?
This is not what I want to talk about now! But in an adventures Saturday evening, I did some data collection to see whether Wikipedia could give me a hint on the award winners of tonight Cannes closing ceremony.
There are 20 movies in the Competition section. All of them have an article in English Wikipedia, though some very short. First I collected some of the activity measures: Length of the article for each movie, how many times the page has been edited, and by how many distinct editors, how many times the page has been viewed from the beginning of the Festival (by editors and random readers), and finally how many different Wikipedia language editions have an article about the movie.
An interactive visualisation of the data is here (click on it!)
All pages together have been viewed more than 600,000 times. That’s a big number. However I was surprised looking at the small number of edits by even smaller number of editors: 15 articles are edited less than 50 times and by around only 5 editors! The average length of all 20 pages is 3700 bytes, just slightly more than a page. Most of the movies have an article in 3 or 4 different languages and no more (including English).
Well, most of the movies are not released yet, that might explain why they are so much under-represented in Wikipedia at the moment. Nevertheless, there are already interesting patterns.
The top-4 movies in respect of page views are also among the top-4 in number of edits, editors, language versions, and are also relatively longer. There is an exception though: The Past (the new drama of Oscar winner Asghar Farhadi) which is 8th in page view ranking, but has comparable activity parameters to the top-4.
Play around with the visualization, you may see other patterns.
Now let’s focus on the top-3 of the most viewed articles, which are well separated from the rest of the movies: Only God Forgives a Thriller by Nicolas Winding Refn, Inside Llewyn Davis The Coen Brothers‘ Drama, and Behind the Candelabra by Steven Soderbergh.
The first movie of these 3 is released on 22 May in France and that might explain why is that so popular. See the diagram below (clickable), which shows the daily page views from a week before the Festival opening until yesterday (click to enlarge).
The first peak is clearly due to the nomination announcement on 18 April and the second peak of Only God Forgives is due to its release. So, what I’m saying is that may be Coen’s have done a better job and we only need to wait until it reaches the market. We will see how the Juries think about it!
Now you may think I’m a Coen’s fan, but No! My favourite directors among these 20 (actually 21, counting Coen Brothers 2!) are Roman Polanski and Asghar Farhadi with Venus in Fur and The Past this year. Talking about directors, let’s have a look at the Wikipedia page view statistics of directors and compare them to their movies. The following figures show the daily views for those two directors and the movies they brought to Cannes this year. Yellow lines are the movies an red ones for the corresponding directors (click to enlarge).
That’s interesting. Isn’t it? The Wikipedia article of Asghar Farhadi and his movie (right panel) are not only at the same level of “popularity” but also their fluctuations are heavily correlated (the second peak comes from the movie release in France), whereas Roman Polanski (left panel) seems to be much more famous than his movie with weird up and downs in his data!
The last piece is on the main Wikipedia article about the event: 2013 Cannes Film Festival with more than 123,000 visitors within the last 2 months. If someone wants to have a baseline to do details fluctuation analysis on individual movies, I would recommend the following diagram, which clearly shows the main events and the overall public interest in them.
And Finally, don’t forget to take a look at our paper:
Mestyán, M., Yasseri, T., and Kertész, J. (2012) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. Forthcoming.