Cosine similarity

Published

November 3, 2025

Cosine similarity is a measure of how similar two vectors are that’s widely used in textual analysis. The basic idea is that

Thus, we can use the dot product to measure how closely related two vectors are. I suppose the next question might be - where would these vectors come from in the context of textual analysis?

Well, let’s illustrate the idea by computing similarities between the Wikipedia articles for Mathematics, Physics, Chemistry, Food science, Nutritional science, and Baking.

Word counts

One very simple way to translate text into a vector is to simply create a list of word counts for a preset collection of common words.

_total_tokens _unique_tokens mathematical theory food geometry matter energy atoms reaction ... subfields gases greeks dark theorists measure conditions smallest universal role
Article
Mathematics 4725 1800 86 58 0 46 0 0 0 0 ... 3 0 3 1 1 2 0 0 1 3
Physics 3206 1358 13 27 0 0 29 14 4 0 ... 2 0 2 4 4 3 1 1 4 2
Chemistry 3924 1576 2 16 0 1 18 39 40 36 ... 0 4 0 0 0 0 4 4 0 0
Food science 681 378 0 0 68 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Nutritional science 357 220 0 0 12 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
Baking 1731 998 0 0 20 0 0 0 0 1 ... 0 1 0 0 0 0 0 0 0 0

6 rows × 302 columns

I’m sure you’ve seen exactly this kind of thing used to build word clouds like the following:

Word cloud for mathematics

Word cloud for Physics

It’s not too hard to find words common to both word clouds.

Resulting cosine similarity

Now, it’s easy to transform the rows of word counts to normalized vectors. You can then compute the pairwise dot products to generate the following matrix of values:

Not surprisingly, the three sciences are closely related, as are the three culinary topics. The relationships between the sciences and the culinary arts seem to be much weaker.

Surely, there must be more that can be said about the relationship between cooking and chemistry?

Sentence transformers

While Natural Language Parsing has been developing for decades, the current rage of Large Language Models and their associated chatbots has been largely driven by the innovation of the Transformer.

The goal of NLP is to break text into tokens so that the text can be represented mathematically — usually as vectors. The objective then is to use the model to extract meaning from the text. When we use word counting, the tokens are simply words. Words are reasonable tokens for simple models and for many LLMs, but transformers typically use smaller pieces — subwords — as their basic tokens. Parts of words (like pre in premed, precalc, or prehistoric) can be tokens, too. There are many more potential tokens, as well.

The true innovation of the Transformer lies in how text is mapped into a vector space. The coordinates of that vector space don’t correspond to tokens, as in the word count example. Rather, the coordinates correspond to the nodes in the final layer of a neural network that’s trained to understand the relationships between the tokens. As a result, Transformers can tell when different words mean the same or similar things. This allows the Transformer to recognize similarities that simple word counts miss.

I’m no expert in this topic but there’s a Python transformer library that makes it easy to map text into a 384 dimensional vector space using this process. Here’s the resulting cosine similarity between these articles: