Natural language processing represents words as high-dimensional vectors, on the order of 100 dimensions. For example, the glove-wiki-gigaword-50
set of word vectors contains 50-dimensional vectors, and the the glove-wiki-gigaword-200
set of word vectors contains 200-dimensional vectors.
The intent is to represent words in such a way that the angle between vectors is related to similarity between words. Closely related words would be represented by vectors that are close to parallel. On the other hand, words that are unrelated should have large angles between them. The metaphor of two independent things being orthogonal holds almost literally as we’ll illustrate below.
Cosine similarity
For vectors x and y in two dimensions,
where θ is the angle between the vectors. In higher dimensions, this relation defines the angle θ in terms of the dot product and norms:
The right-hand side of this equation is the cosine similarity of x and y. NLP usually speaks of cosine similarity rather than θ, but you could always take the inverse cosine of cosine similarity to compute θ. Note that cos(0) = 1, so small angles correspond to large cosines.
Examples
For our examples we’ll use gensim with word vectors from the glove-twitter-200
model. As the name implies, this data set maps words to 200-dimensional vectors.
First some setup code.
import gensim import numpy as np word_vectors = api.load("glove-twitter-200") def norm(word): v = word_vectors[word] return np.dot(v, v)**0.5 def cosinesim(word0, word1): v = word_vectors[word0] w = word_vectors[word1] return np.dot(v, w)/(norm(word0)*norm(word1))
Using this mode, the cosine similarity between “dog” and “cat” is 0.832, which corresponds to about a 34° angle. The cosine similarity between “dog” and “wrench” is 0.145, which corresponds to an angle of 82°. A dog is more like a cat than like a wrench.
The similarity between “dog” and “leash” is 0.487, not because a dog is like a leash, but because the word “leash” is often used in the same context as the word “dog.” The similarity between “cat” and “leash” is only 0.328 because people speaking of leashes are more likely to also be speaking about a dog than a cat.
The cosine similarity between “uranium” and “walnut” is only 0.0054, corresponding to an angle of 89.7°. The vectors associated with the two words are very nearly orthogonal because the words are orthogonal in the metaphorical sense.
Note that opposites are somewhat similar. Uranium is not the opposite of walnut because things have to have something in common to be opposites. The cosine similarity of “expensive” and “cheap” is 0.706. Both words are adjectives describing prices and so in some sense they’re similar, though they have opposite valence. “Expensive” has more in common with “cheap” than with “pumpkin” (similarity 0.192).
The similarity between “admiral” and “general” is 0.305, maybe less than you’d expect. But the word “general” is kinda general: it can be used in more contexts than military office. If you add the vectors for “army” and “general”, you get a vector that has cosine similarity 0.410 with “admiral.”
Related posts
The post Angles between words first appeared on John D. Cook.