MML Discourse archived in May, 2026

One Hot Encoding

mark

Suppose I've got a categorical color variable that can take any of the three values
red, yellow, or blue.

  1. Describe conceptually how one hot encoding would be set up for that variable.
  2. Given your description, what would be the encoding of the vector
    [\text{red}, 1, \text{yel}]^{\mathsf{T}},
    where the middle value is the value of some separate numeric variable.
User 003

Since color is categorical and we cannot easily rank them, each value is encoded in an n-1-dimensional vector where n is the number of possible values and where one of the positions gets the value 1 and all other values are 0. There is also a default value in which each slot is 0.

In this case, we have three possibilities, so the encoding could be:
\text{red}=[1, 0]^{\mathsf{T}}
\text{yellow}=[0, 1]^{\mathsf{T}}
\text{blue}=[0, 0]^{\mathsf{T}}

Therefore the encoding of the given vector could be:

\left[\begin{bmatrix}1 \\ 0 \end{bmatrix},\;1,\;\begin{bmatrix}0 \\ 1 \end{bmatrix}\right]^{\mathsf{T}}
mark

This is one of those things that's not super hard, once you see and do it, but it's also not immediately intuitive either. Thus, it's really nice to get a response here - so thanks!

Our presentation describes one-hot encoding on this slide and there's an example on the next. In the example, there are four colors but those yield only vectors of length three. So, I would expect to be able to encode this information just a bit more efficiently using vectors of length only two.

User 003

My misunderstanding was to impose another condition where the encoding vector for each color value needed to have the same magnitude, and I ignored the "default value."

mark

Much better, now - yes!

I also like the way you've written it as a list with vectors and other data. It's worth mentioning that in the software implementation the data would be internally encoded as a single vector. Maybe something like

\begin{bmatrix} 1 & 0 & 1 & 0 & 1 \end{bmatrix}^{\mathsf{T}}.

There would also be a separate data structure to keep track of what's what for interpretation after the model is fit.

But, again, the way you've written it is preferred for other humans to read.