Cosine similarity is a measure of how similar two vectors are that’s widely used in textual analysis. The basic idea is that
- when two normalized vectors are exactly the same, then their dot product is \(1\),
- when two normalized vectors are exactly opposite, then their dot product is \(-1\),
- when one normalized vector is modified just slightly and then renormalized to obtain a second vector, the dot product of the two vectors should be close to \(1\), and
- when two normalized vectors have little relationship, then their dot product should be close to zero.
Thus, we can use the dot product to measure how closely related two vectors are. I suppose the next question might be - where would these vectors come from in the context of textual analysis?
Well, let’s illustrate the idea by computing similarities between the Wikipedia articles for Mathematics, Physics, Chemistry, Food science, Nutritional science, and Baking.
Word counts
One very simple way to translate text into a vector is to simply create a list of word counts for a preset collection of common words.
| Article |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Mathematics |
4725 |
1800 |
86 |
58 |
0 |
46 |
0 |
0 |
0 |
0 |
... |
3 |
0 |
3 |
1 |
1 |
2 |
0 |
0 |
1 |
3 |
| Physics |
3206 |
1358 |
13 |
27 |
0 |
0 |
29 |
14 |
4 |
0 |
... |
2 |
0 |
2 |
4 |
4 |
3 |
1 |
1 |
4 |
2 |
| Chemistry |
3924 |
1576 |
2 |
16 |
0 |
1 |
18 |
39 |
40 |
36 |
... |
0 |
4 |
0 |
0 |
0 |
0 |
4 |
4 |
0 |
0 |
| Food science |
681 |
378 |
0 |
0 |
68 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| Nutritional science |
357 |
220 |
0 |
0 |
12 |
0 |
0 |
1 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| Baking |
1731 |
998 |
0 |
0 |
20 |
0 |
0 |
0 |
0 |
1 |
... |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
6 rows × 302 columns
I’m sure you’ve seen exactly this kind of thing used to build word clouds like the following:
It’s not too hard to find words common to both word clouds.
Resulting cosine similarity
Now, it’s easy to transform the rows of word counts to normalized vectors. You can then compute the pairwise dot products to generate the following matrix of values:
Not surprisingly, the three sciences are closely related, as are the three culinary topics. The relationships between the sciences and the culinary arts seem to be much weaker.
Surely, there must be more that can be said about the relationship between cooking and chemistry?