SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • August 6, 2025
  • Rss Fetcher

TF-IDF (Term Frequency-Inverse Document Frequency) is commonly used in natural language processing to extract important words. The idea behind the statistic is that a word is important if it occurs frequently in a particular document but not frequently in the corpus of documents the document came from.

The term-frequency (TF) of a word in a document is the probability of selecting that word at random from the document, i.e. the number of times the word appears in the document divided by the total number of words in the document.

Inverse document frequency (IDF) is not quite what the name implies. You might reasonably assume that inverse document frequency is the inverse (i.e. reciprocal) of document frequency, where document frequency is the proportion of documents containing the word. Or in other words, the reciprocal of the probability of selecting a document at random containing the word. That’s almost right, except you take the logarithm.

TF-IDF for a word and a document is the product of TF and IDF for that word and document. You could say

TF-IDF = TF * IDF

where the “-” on the left side is a hyphen, not a minus sign.

To try this out, let’s look at the King James Bible. The text is readily available, for example from Project Gutenberg, and it divides into 66 documents (books).

Note that if a word appears in every document, in our case every book of the Bible, then IDF = log(1) = 0. This means that common words like “the” and “and” that appear in every book get a zero score.

Here are the most important words in Genesis, as measured by TF-IDF.

laban: 0.0044
abram: 0.0040
joseph: 0.0037
jacob: 0.0034
esau: 0.0032
rachel: 0.0031
said: 0.0031
pharaoh: 0.0030
rebekah: 0.0029
duke: 0.0028

It’s surprising that Laban comes out on top. Surely Joseph is more important than Laban, for example. Joseph appears more often in Genesis than does Laban, and so has a higher TF score. But Laban only appears in two books, whereas Joseph appears in 23 books, and so Laban has a higher IDF score.

Note that TF-IDF only looks at sequences of letters. It cannot distinguish, for example, the person named Laban in Genesis from the location named Laban in Deuteronomy.

Another oddity above is the frequency of “duke.” In the language of the KJV, a duke was the head of a clan. It wasn’t a title of nobility as it is in contemporary English.

The most important words in Revelation are what you might expect.

angel: 0.0043
lamb: 0.0034
beast: 0.0033
throne: 0.0028
seven: 0.0028
dragon: 0.0025
angels: 0.0025
bottomless: 0.0024
overcometh: 0.0023
churches: 0.0022

You can find the top 10 words in each book here.

Related posts

  • Angles between words
  • Named entity recognitiion
  • Thou, thee, you, and ye

The post Using TF-IDF to pick out important words first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • The Browser Company launches a $20 monthly subscription for its AI-powered browser
  • Using TF-IDF to pick out important words
  • Two teens charged in attack on former DOGE official Edward “Big Balls” Coristine
  • Upwork is buying its way into corporate staffing beyond freelancers
  • Google denies AI search features are killing website traffic

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.