SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • October 7, 2023
  • Rss Fetcher

Quantifying and analyzing controversiality on Wikipedia articles.

Photo by Oberon Copeland @veryinformed.com on Unsplash

Recently I have been messing around with the Wikidata API (https://www.wikidata.org/wiki/Wikidata:REST_API) which gives access to essentially all information from Wikipedia articles. After poking around with sentiment analysis and analyzing general trends I stumbled across a 2012 paper by Sumi & Yasseri which analyzed edit wars in Wikipedia and defines a controversiality metric. After seeing that I decided it might be interesting to quantify the controversiality of some data and visualize it.

I decided to look at 4 different categories:

  1. 222 different countries and dependencies as listed in this CSV
  2. The current 20 teams in the English Premier League soccer (football?) league
  3. Theorems as listed on Wikipedias List of Theorems. I filtered to only include articles with at least 500 edits resulting in 43 unique pages.

All the data comes from the English Wikipedia and is extracted using their REST API.

Controversiality Metric

The 2012 paper presents a method for quantifying the likelihood of an undesirable edit war within a Wikipedia article by creating a Controversiality Metric. This controversiality metric was designed in order to identify articles that have large factions of reputable editors that seem to have genuine disagreements about facts within the article. To do this, the algorithm highly rewards edits that were reverted by editors that had contributed a lot to the article. The use case being to give Wikipedia the ability to quickly identify pages that have conflict within them and see if there is anything that can be done to resolve it.

It is defined as

Original Controversiality Metric

Where

  • M: the Final measure of controversiality for a Wikipedia article.
  • E: Number of editor pairs who have ever reverted each other’s edits in the article
  • N_id: The total number of edits in the given article by the user who edited revision i
  • N_jr: The total number of edits in the given article by the user who edited revision j

However, as a page increases in popularity, contributing to a page grows, the potential pairs of editors, E, will increase quadratically since each new editor can potentially revert changes made by any of the previous editors. This is a problem as it will make it difficult for us to compare controversiality between articles of varying levels of popularity. To mitigate this I found that I could effectively normalize the data by dividing it by the total number of edits squared. Resulting in the normalized controversiality Metric defined as:

Normalized Controversiality Metric

The difference between the unnormalized and normalized controversiality metric can be seen below in Fig. 1. With the original M calculation, we see values with high edits with values ranging from 0 to 1 billion with a noticeable quadratic trend when looking at M versus the total number of edits. In contrast, with our normalized M_normed we see a much more uniform spread of controversiality.

For the remainder of this analysis, all controversiality metrics will be using the normalized equation.

All Data

First things first lets just take a look at all of the data sorted by the number of edits and categorized by colour.

There are a few things we can note from the chart. First, theorems are clearly the least controversial and least edited category of the ones selected. Also, 9of the 10 most controversial pages are countries or dependencies with them being:

  1. Poland
  2. Gibraltar
  3. Afghanistan
  4. Falkland Islands
  5. Macedonia
  6. United States
  7. Chile
  8. Canada
  9. Turkey
  10. Manchester United F.C.

Comparing Categories

In order to see if any categories truly have a different spread of data, I have created a boxplot of the data in Fig. 3.

From this we can see that Premier League Teams have the highest median controversiality followed by Countries and theormes being fair behind. However, Countries have a fairly large spread of notable outliers above the upper fence.

Now that we’ve looked at the data combined, I am going to take a deeper look within some of the categories.

Countries

For countries I thought it would be fun to plot a choropleth map of each countries controversiality to see if there is any clear geographic trends.

From Fig. 7. there are a few things of note. First, it is clear that Africa is the least controversial continent. This may be due to the fact that this data is coming from the English Wikipedia and many of the editors simply do not have strong opinions on individual African nations. Another geographical area of note is North America, with all 3 of the major countries, Canada, the USA and Mexico being relatively controversial articles.

As with presidents, here are the most and least controversial Country articles:

Top 10 most controversial Countries (and regions):

Top 10 most Controversial Countries (and regions)

Top 10 least controversial Countries (and regions):

Top 10 least controversial countries and regions

It is clear to see that most of the least controversial regions are small islands.

Premier League Teams

Finally, lets take a look at the premier league teams from the 2023–24 season. In order to make things interesting I have decided to plot the controversiality alongside the total payroll of each team according to https://www.spotrac.com/epl/payroll/.

There does appear to be a noticeable correlation between payroll and controversiality. This is likely due to the teams with the highest salaries also being the ones with the highest public profile and most fans, and haters.

As for the rankings in full, here is how they stack up:

Controversiality for all Premier league teams

Hairy Ball Theorem

During this analysis I briefly looked at a few of the actual edit histories of some of the articles to see what these “controversies” really look like. Most relate to disagreements over specific wordings, for example, Gibraltar has had a long standing and very heated edit war between people stating it is a “Spanish Autonomous City” or a “British Overseas Territory. Or swapping “Brazil” to “Brasil”. These debates can get quite heated, often devolving into petty insults and people tossing their credentials back and forth. Nevertheless, both sides tend to be editing in good faith and the conclusions tend to work themselves out.

However, when looking at theorems I noticed that the 2nd most controversial article was named “Hairy Ball Theorem”. This, naturally, peaked my interest.

Now the actual theorem is relating to a mathematical theorem that colloquially states “you can’t comb a hairy ball flat without creating a cowlick”. However, when going into the edit history I saw a humorous and predictable edit war raging.

One faction, the vandals, were repeatedly changing the opening line of the page to “the hairy ballsack theorem” however, the other faction, the purists, worked to keep the sanctity of the page in tact by reverting “ballsack” back to “ball”. (Un)Fortunately, the purists have been winning. Although this edit war was, for the most part, won almost a decade ago there have still been some lone attempts. One in 2017 where the opening line was changed to “Colin’s hairy balls theorem” and another attempt in January of this year to change it the the more subtle “hairy balls theorem” although that too, was quickly reverted.

Additional Analysis

In order to avoid veering into Politics as is against the guidelines of the Better Programming publication guidelines, the editors and I decided it would be best if I removed an additional analysis section on the controversiality of Presidents Wikipedia pages. However, if you would be interested in this analysis, you can find it self published here: https://medium.com/@vertadam/what-makes-a-controversial-wikipedia-article-presidents-4c5d4be2d7ab

Code

All the code used to generate this analysis can be found here: https://github.com/VertAdam/wikipediaControvesialityAnalysis.

The code used to extract the data can be found in retrieve_data.py, whereas the code used to create the analysis can be found within the Jupyter notebooks. Most notably, exploratory_findings.ipynb for the analyses of all data, Premier League teams.ipynb for the analysis of Premier League Teams and countries.ipynb for the analysis of countries.

All the data used can be found in the various .CSV files stored within the repo.

Connect with me!

You can find me on LinkedIn at https://www.linkedin.com/in/adam-vert/. I’m always looking to meet more people interested in data science, so feel free to reach out!


What Makes a Controversial Wikipedia Article? was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Previous Post
Next Post

Recent Posts

  • AI may already be shrinking entry-level jobs in tech, new research suggests
  • WordPress has formed an AI team
  • Report: TuSimple sent sensitive self-driving data to China after US national security agreement
  • URx 2025 recap: Reimagining early career recruiting in the age of AI
  • Salesforce acquires Informatica for $8 billion

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.