TF-IDF (Term Frequency-Inverse Document Frequency) is commonly used in natural language processing to extract important words. The idea behind the statistic is that a word is important if it occurs frequently in a particular document but not frequently in the corpus of documents the document came from.
The term-frequency (TF) of a word in a document is the probability of selecting that word at random from the document, i.e. the number of times the word appears in the document divided by the total number of words in the document.
Inverse document frequency (IDF) is not quite what the name implies. You might reasonably assume that inverse document frequency is the inverse (i.e. reciprocal) of document frequency, where document frequency is the proportion of documents containing the word. Or in other words, the reciprocal of the probability of selecting a document at random containing the word. That’s almost right, except you take the logarithm.
TF-IDF for a word and a document is the product of TF and IDF for that word and document. You could say
TF-IDF = TF * IDF
where the “-” on the left side is a hyphen, not a minus sign.
To try this out, let’s look at the King James Bible. The text is readily available, for example from Project Gutenberg, and it divides into 66 documents (books).
Note that if a word appears in every document, in our case every book of the Bible, then IDF = log(1) = 0. This means that common words like “the” and “and” that appear in every book get a zero score.
Here are the most important words in Genesis, as measured by TF-IDF.
laban: 0.0044
abram: 0.0040
joseph: 0.0037
jacob: 0.0034
esau: 0.0032
rachel: 0.0031
said: 0.0031
pharaoh: 0.0030
rebekah: 0.0029
duke: 0.0028
It’s surprising that Laban comes out on top. Surely Joseph is more important than Laban, for example. Joseph appears more often in Genesis than does Laban, and so has a higher TF score. But Laban only appears in two books, whereas Joseph appears in 23 books, and so Laban has a higher IDF score.
Note that TF-IDF only looks at sequences of letters. It cannot distinguish, for example, the person named Laban in Genesis from the location named Laban in Deuteronomy.
Another oddity above is the frequency of “duke.” In the language of the KJV, a duke was the head of a clan. It wasn’t a title of nobility as it is in contemporary English.
The most important words in Revelation are what you might expect.
angel: 0.0043
lamb: 0.0034
beast: 0.0033
throne: 0.0028
seven: 0.0028
dragon: 0.0025
angels: 0.0025
bottomless: 0.0024
overcometh: 0.0023
churches: 0.0022
You can find the top 10 words in each book here.
Related posts
The post Using TF-IDF to pick out important words first appeared on John D. Cook.