Cosine similarity

Published

November 3, 2025

Cosine similarity is a measure of how similar two vectors are that’s widely used in textual analysis. The basic idea is that

when two normalized vectors are exactly the same, then their dot product is \(1\),
when two normalized vectors are exactly opposite, then their dot product is \(-1\),
when one normalized vector is modified just slightly and then renormalized to obtain a second vector, the dot product of the two vectors should be close to \(1\), and
when two normalized vectors have little relationship, then their dot product should be close to zero.

Thus, we can use the dot product to measure how closely related two vectors are. I suppose the next question might be - where would these vectors come from in the context of textual analysis?

Well, let’s illustrate the idea by computing similarities between the Wikipedia articles for Mathematics, Physics, Chemistry, Food science, Nutritional science, and Baking.

Word counts

One very simple way to translate text into a vector is to simply create a list of word counts for a preset collection of common words.

	_total_tokens	_unique_tokens	theory	mathematical	food	geometry	matter	energy	atoms	reaction	...	kinds	systematic	universal	play	measuring	latter	statement	constituents	stated	models
Article
Mathematics	4728	1800	58	86	0	46	0	0	0	0	...	0	2	1	4	1	3	3	0	3	4
Physics	3206	1358	27	13	0	0	29	14	4	0	...	0	0	4	0	0	2	2	3	2	1
Chemistry	3924	1576	16	2	0	1	18	39	40	36	...	4	3	0	1	0	0	0	2	0	0
Food science	681	378	0	0	68	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Nutritional science	357	220	0	0	12	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
Baking	1731	998	0	0	20	0	0	0	0	1	...	1	0	0	0	4	0	0	0	0	0

6 rows × 302 columns

This is is exactly the kind of data needed to build word clouds like the following clouds built by Jason Davies d3-cloud:

It’s not too hard to find words common to both word clouds.

Resulting cosine similarity

Now, it’s easy to transform the rows of word counts to normalized vectors. You can then compute the pairwise dot products to generate the following matrix of values:

Not surprisingly, the three sciences are closely related, as are the three culinary topics. The relationships between the sciences and the culinary arts seem to be much weaker.

Surely, there must be more that can be said about the relationship between cooking and chemistry?

Sentence transformers

While Natural Language Parsing has been developing for decades, the current rage of Large Language Models and their associated chatbots has been largely driven by the innovation of the Transformer.

The goal of NLP is to break text into tokens so that the text can be represented mathematically — usually as vectors. The objective then is to use the model to extract meaning from the text. When we use word counting, the tokens are simply words. Words are reasonable tokens for simple models and for many LLMs, but transformers typically use smaller pieces — subwords — as their basic tokens. Parts of words (like pre in premed, precalc, or prehistoric) can be tokens, too. There are many more potential tokens, as well.

The true innovation of the Transformer lies in how text is mapped into a vector space. The coordinates of that vector space don’t correspond to tokens, as in the word count example. Rather, the coordinates correspond to the nodes in the final layer of a neural network that’s trained to understand the relationships between the tokens. As a result, Transformers can tell when different words mean the same or similar things. This allows the Transformer to recognize similarities that simple word counts miss.

I’m no expert in this topic but there’s a Python transformer library that makes it easy to map text into a 384 dimensional vector space using this process. Here’s the resulting cosine similarity between these articles: