penguin_data = fetch("https://marksmath.org/data/penguins.csv").then(async function (r) {
const t = await r.text();
return d3.csvParse(t);
})
Inputs.table(penguin_data)
And K-Nearest Neighbors
Mon, Jan 12, 2026
Machine learning consists of a class of algorithms that can consume data to produce a function that performs some desired task.
Colloquially, we say that the algorithm “learns” from data.
Sometimes, the data is labeled by class names and the objective is to classify similar data into those same classes. This process is called classification.
Sometimes, the data is labeled by continuous numeric values and the objective is computation the corresponding value for similar data. That process is called regression.
In this intro presentation, we’re going to take a look at a couple of classification problems and a surprisingly simple yet effective machine learning algorithm to solving such problems. This is a real technique in machine learning because:
To begin, let’s clearly state what we mean by data and how we need it formatted in the context of machine learning.
Generally, we are interested in observational level data. Furthermore,
Data formatted in this way is often called tidy data. I’ll typically refer to this type of data as a data table or data frame.
As a concrete example, we’ll start the now widely used Palmer penguin data set.
This data was collected in a 2014 study at the Palmer Research station in Antarctica, as described in this paper.
It was collected for use as an R data package, as described on Allison Horst’s Github page.
We’ll use it a few times throughout this semester.
The penguin data set looks like so:
penguin_data = fetch("https://marksmath.org/data/penguins.csv").then(async function (r) {
const t = await r.text();
return d3.csvParse(t);
})
Inputs.table(penguin_data)
This data has 342 observations corresponding to actual penguins; these appear as rows in the table. There are seven attributes associated with each penguin that appear as columns in the table.
The table is interactive so that you can scroll through the complete dataset, if you like.
To be clear about what we mean by observational level, tidy data, let’s draw a distinction between it an summary level data.
To generate summary level data for the penguin data set, we might group the penguins by species, compute the average mass per species, and present that as a summary table. The result might look like so:
{
const rollups = d3.rollups(
penguin_data,
a => d3.mean(a, o => o.body_mass_g),
o => o.species
);
console.log(rollups)
const table = d3.create('table');
const thead = table.append('tr');
thead.append('th').text('Species');
thead.append('th').text('Avg Mass');
rollups.forEach(function([s,m]) {
const tr = table.append('tr')
tr.append('td').text(s)
tr.append('td').text(`${d3.format('0.2f')(m)} g`)
})
return table.node()
}
This might very well be quite useful. It’s not tidy data, though, and it’s not the type of data that we build machine learning algorithms with.
Now that we know what data is and what type we’re dealing with, we can describe the K-Nearest Neighbors algorithm.
Just to avoid potential confusion down the road, it might be worth distinguishing K Nearest Neighbors from K-Means, which is a clustering algorithm. The two techniques are really quite different but sometimes confused due to their similar names.
The fundamental idea behind KNN is to examine other data near the observation point under consideration and to guess the label based on those other values. We might examine 5 nearest neighbors, 8 nearest neighbors, or 15 nearest neighbors. In general, we examine \(k\) nearest neighbors, which is where the name comes from.
How we determine our guess for the label depends on the nature of the data.
Into which species would you classify the first entry ???
species | island | sex | culmen_length_mm | body_mass_g | culmen_depth_mm | flipper_length_mm |
---|---|---|---|---|---|---|
??? | Dream | MALE | 48 | 4850 | 15 | 222 |
... | ... | ... | ... | ... | ... | ... |
Gentoo | Biscoe | FEMALE | 46.1 | 4500 | 13.2 | 211 |
Adelie | Biscoe | MALE | 37.7 | 3600 | 18.7 | 180 |
Gentoo | Biscoe | MALE | 50.0 | 5700 | 16.3 | 230 |
Adelie | Biscoe | FEMALE | 37.8 | 3400 | 18.3 | 174 |
Chinstrap | Dream | FEMALE | 46.5 | 3500 | 17.9 | 192 |
Adelie | Dream | FEMALE | 39.5 | 3250 | 16.7 | 178 |
Adelie | Dream | MALE | 37.2 | 3900 | 18.1 | 178 |
Gentoo | Biscoe | FEMALE | 48.7 | 4450 | 14.1 | 210 |
Chinstrap | Dream | MALE | 51.3 | 3650 | 19.2 | 193 |
Chinstrap | Dream | MALE | 50.0 | 3900 | 19.5 | 196 |
Here’s a plot of that same data in the body_mass/culmen_depth plane. The classification is even more obvious, now.
Here’s a so-called classification plot of the full dataset:
In that classification plot, the algorithm
As we can see, the algorithm can easily distinguish between Gentoo and the other two species. It doesn’t distinguish between Adelie and Chinstrap very well. We can use more variables to improve it, though!
We can now describe how KNN works for digit recognition. The algorithm is built on the now classic MNIST Digit Dataset. This data consists of two key components:
Again, the data consists of a list of 784 dimensional vectors of grayscale values with labels. It’s easy to render those vectors as grayscale images and display them with their labels. Here’s a short subset of the results:
Once we understand both KNN and the MNIST dataset, it’s pretty easy to describe digit classification by KNN. There are two essential steps:
That’s really all there is to it.
There might be some further questions, though.
We’ve got plenty to address this semester when it comes to machine learning algorithms.
Of course, this is a math class and, in fact, the answers to those most of those questions really lie in that domain.
Here are a few of the algorithms that we’ll take a look at this semester: