Quantifying and analyzing controversiality on Wikipedia articles.
Recently I have been messing around with the Wikidata API (https://www.wikidata.org/wiki/Wikidata:REST_API) which gives access to essentially all information from Wikipedia articles. After poking around with sentiment analysis and analyzing general trends I stumbled across a 2012 paper by Sumi & Yasseri which analyzed edit wars in Wikipedia and defines a controversiality metric. After seeing that I decided it might be interesting to quantify the controversiality of some data and visualize it.
I decided to look at 4 different categories:
- 222 different countries and dependencies as listed in this CSV
- The current 20 teams in the English Premier League soccer (football?) league
- Theorems as listed on Wikipedias List of Theorems. I filtered to only include articles with at least 500 edits resulting in 43 unique pages.
All the data comes from the English Wikipedia and is extracted using their REST API.
Controversiality Metric
The 2012 paper presents a method for quantifying the likelihood of an undesirable edit war within a Wikipedia article by creating a Controversiality Metric. This controversiality metric was designed in order to identify articles that have large factions of reputable editors that seem to have genuine disagreements about facts within the article. To do this, the algorithm highly rewards edits that were reverted by editors that had contributed a lot to the article. The use case being to give Wikipedia the ability to quickly identify pages that have conflict within them and see if there is anything that can be done to resolve it.
It is defined as
Where
- M: the Final measure of controversiality for a Wikipedia article.
- E: Number of editor pairs who have ever reverted each other’s edits in the article
- N_id: The total number of edits in the given article by the user who edited revision i
- N_jr: The total number of edits in the given article by the user who edited revision j
However, as a page increases in popularity, contributing to a page grows, the potential pairs of editors, E, will increase quadratically since each new editor can potentially revert changes made by any of the previous editors. This is a problem as it will make it difficult for us to compare controversiality between articles of varying levels of popularity. To mitigate this I found that I could effectively normalize the data by dividing it by the total number of edits squared. Resulting in the normalized controversiality Metric defined as:
The difference between the unnormalized and normalized controversiality metric can be seen below in Fig. 1. With the original M calculation, we see values with high edits with values ranging from 0 to 1 billion with a noticeable quadratic trend when looking at M versus the total number of edits. In contrast, with our normalized M_normed we see a much more uniform spread of controversiality.
For the remainder of this analysis, all controversiality metrics will be using the normalized equation.
All Data
First things first lets just take a look at all of the data sorted by the number of edits and categorized by colour.
There are a few things we can note from the chart. First, theorems are clearly the least controversial and least edited category of the ones selected. Also, 9of the 10 most controversial pages are countries or dependencies with them being:
- Poland
- Gibraltar
- Afghanistan
- Falkland Islands
- Macedonia
- United States
- Chile
- Canada
- Turkey
- Manchester United F.C.
Comparing Categories
In order to see if any categories truly have a different spread of data, I have created a boxplot of the data in Fig. 3.
From this we can see that Premier League Teams have the highest median controversiality followed by Countries and theormes being fair behind. However, Countries have a fairly large spread of notable outliers above the upper fence.
Now that we’ve looked at the data combined, I am going to take a deeper look within some of the categories.
Countries
For countries I thought it would be fun to plot a choropleth map of each countries controversiality to see if there is any clear geographic trends.
From Fig. 7. there are a few things of note. First, it is clear that Africa is the least controversial continent. This may be due to the fact that this data is coming from the English Wikipedia and many of the editors simply do not have strong opinions on individual African nations. Another geographical area of note is North America, with all 3 of the major countries, Canada, the USA and Mexico being relatively controversial articles.
As with presidents, here are the most and least controversial Country articles:
Top 10 most controversial Countries (and regions):
Top 10 least controversial Countries (and regions):
It is clear to see that most of the least controversial regions are small islands.
Premier League Teams
Finally, lets take a look at the premier league teams from the 2023–24 season. In order to make things interesting I have decided to plot the controversiality alongside the total payroll of each team according to https://www.spotrac.com/epl/payroll/.
There does appear to be a noticeable correlation between payroll and controversiality. This is likely due to the teams with the highest salaries also being the ones with the highest public profile and most fans, and haters.
As for the rankings in full, here is how they stack up:
Hairy Ball Theorem
During this analysis I briefly looked at a few of the actual edit histories of some of the articles to see what these “controversies” really look like. Most relate to disagreements over specific wordings, for example, Gibraltar has had a long standing and very heated edit war between people stating it is a “Spanish Autonomous City” or a “British Overseas Territory. Or swapping “Brazil” to “Brasil”. These debates can get quite heated, often devolving into petty insults and people tossing their credentials back and forth. Nevertheless, both sides tend to be editing in good faith and the conclusions tend to work themselves out.
However, when looking at theorems I noticed that the 2nd most controversial article was named “Hairy Ball Theorem”. This, naturally, peaked my interest.
Now the actual theorem is relating to a mathematical theorem that colloquially states “you can’t comb a hairy ball flat without creating a cowlick”. However, when going into the edit history I saw a humorous and predictable edit war raging.
One faction, the vandals, were repeatedly changing the opening line of the page to “the hairy ballsack theorem” however, the other faction, the purists, worked to keep the sanctity of the page in tact by reverting “ballsack” back to “ball”. (Un)Fortunately, the purists have been winning. Although this edit war was, for the most part, won almost a decade ago there have still been some lone attempts. One in 2017 where the opening line was changed to “Colin’s hairy balls theorem” and another attempt in January of this year to change it the the more subtle “hairy balls theorem” although that too, was quickly reverted.
Additional Analysis
In order to avoid veering into Politics as is against the guidelines of the Better Programming publication guidelines, the editors and I decided it would be best if I removed an additional analysis section on the controversiality of Presidents Wikipedia pages. However, if you would be interested in this analysis, you can find it self published here: https://medium.com/@vertadam/what-makes-a-controversial-wikipedia-article-presidents-4c5d4be2d7ab
Code
All the code used to generate this analysis can be found here: https://github.com/VertAdam/wikipediaControvesialityAnalysis.
The code used to extract the data can be found in retrieve_data.py, whereas the code used to create the analysis can be found within the Jupyter notebooks. Most notably, exploratory_findings.ipynb for the analyses of all data, Premier League teams.ipynb for the analysis of Premier League Teams and countries.ipynb for the analysis of countries.
All the data used can be found in the various .CSV files stored within the repo.
Connect with me!
You can find me on LinkedIn at https://www.linkedin.com/in/adam-vert/. I’m always looking to meet more people interested in data science, so feel free to reach out!
What Makes a Controversial Wikipedia Article? was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.