Mon, Mar 31, 2025
Today, we’ll take our first real look at neural networks. We’ll focus today on the foundational feed forward neural network - how we represent it, how we compute with it, how we code it, and what can do with it.
Here’s the kind of image of a neural network that you might often see. It’s a bit… mysterious!
Neural networks and AI more generally are often described with somewhat mysterious language. Even the term “learning” in Machine Learning anthropomorphizes what’s really going on.
These descriptions are growing every more metaphorical in media these days as we now talk about algorithms that dream or hallucinate. This type of language is now embedded in many reliable descriptions of AI, like Wikipedia’s description of DeepDream, for example.
This type of language might serve a reasonable purpose when speaking to laypeople or the general public. If we really want to understand how neural networks work and create some ourselves, though, it’s important to demystify them.
Despite their complexity, neural networks are simply another example of a supervised learning algorithm. Thus,
They basic questions are much like we had for linear and logistic regression
We now jump into the structure of the basic feed-forward neural network. Like all neural networks, that structure consists of several layers of nodes that are connected from left to right. The real purpose of the network is to help us visualize and understand how the network builds its model function and performs a computation.
Here’s a basic illustration of the structure of a feed-forward neural network:
Let’s label the layers, nodes, and edges:
The neural network consists of four layers:
In addition, there are edges between consecutive layers that are labeled with weights. (There are only a few edge labels shown to prevent cluttering of the diagram.)
The inputs and weights in the input layer determine the values in the first hidden layer. Those values determine the values in the next layer, etc, so that the values propagate from left to right and ultimately determine the output.
We have four columns of nodes. The nodes in the two hidden layers are indexed by i and j to obtain hi,j where
The input and output layers need only one index, though, we’ll modify the notation so that they also use two indices soon.
The nodes labeled 1 allow for a constant term, as we’ll see soon.
There are three columns of edges between the columns of nodes. The edges are indexed by i, j, and k to obtain wi,j,k where
Each edge in the diagram has such a weight but, again, only a few weights are shown to prevent cluttering of the diagram.
The formula to determine the values h1,k for k=1,…,5 in the first hidden layer is
h1,k=g1(3∑j=1w0,j,k×xj+w0,0,k×1).
Note that we have here a linear function of the inputs with coefficients determined by the weights together with a constant term due to the node with value 1 in the input layer. The value of that affine function is passed to a so-called activation function g1. There are a number of possibilities for g1, which we’ll discuss soon.
This is illustrated for k=2 on the next slide.
h1,2=g1(3∑j=1w0,j,2×xj+w0,0,2×1).
The formula to determine the values h2,k for k=1,…,4 in the second hidden layer is
h2,k=g2(5∑j=1w1,j,k×h1,j+w1,0,k×1).
Note that there’s a second activation function g2. Generally, activation functions are common to all nodes in a particular layer but can change between layers.
This is illustrated for k=3 on the next slide.
h2,3=g2(5∑j=1w1,j,3×h1,j+w1,0,3×1)
The formula to determine the values yk for k=1 or k=2 in the output layer is
yk=g3(4∑j=1w2,j,k×h2,j+w2,0,k×1).
There’s a third activation function g3. Often, the final activation function is used to massage the final output into the desired form.
This is illustrated for k=1 on the next slide.
yk=g3(4∑j=1w2,j,k×h2,j+w2,0,k×1).
At each layer, we apply an activation function. This function need not be linear which means that neural networks are, indeed, more general than linear models. Common choices for the activation function are
ReLU: g(x)={x≥00x<0
Sigmoid: g(x)=11+e−x
Sometimes these functions might include parameters that can be optimized via cross-validation.
Let’s take a look at a simple example of neural network with just two inputs, one hidden layer with three nodes, and a single output.
The network of interest is shown on the next page with all edge weights specified and two input values set. In addition, we’ll apply a ReLU activation after the hidden layer and a sigmoid activation to the output.
Our mission is to find the values in the hidden layer and the output after propagation.
The values are then entered into the next step as shown on the next slide.
The next step is very similar. We now have just one output, though, and non-zero weight constants. y=2×12+(−22)×1=24−22=2. We then apply the sigmoid activation function to get the final output: y=11+e−2≈0.880797. The final state of the network is shown on the next slide.
Note the large value of the constant weight at the final step. This has the effect of shifting the value computed by the linear part.
That constant term is often called the bias.
In the process of forward propagation, the transformation of each layer into the next can be described as multiplication of a matrix of weights by the vector of inputs followed by application of the activation function. It’s worth seeing how this is described mathematically and can be implemented in code.
In the process, we’ll generalize the notation for a neural network to account for more hidden layers and to simplify it so that the input and output layers are no longer special cases.
Recall the labeled neural network we met before:
Let’s relabel the nodes so that all layers use a consistent notation:
Note that, in general, we’ll have some number N of hidden layers. Thus, we have N+2 layers in total - including the input and output. We’ll number those layers i=0,1,2,…,N,N+1. In the figure, these appear as columns of nodes.
Each of the N+2 columns will have some number of nodes. Let’s say that the ith column has ni nodes plus a constant node with value 1. We’ll number the nodes in the ith column 0,1,2,…,ni. We don’t really need a constant node in the output column but it doesn’t hurt if it’s there.
Note that the value of the kth node in the ith column of nodes involves the sum of the weights pointing into that node times the values of the nodes those weights come from + the constant weight. Finally, we apply the activation function gi. In symbols:
xi,k=gi(Ni−1∑j=1wi−1,j,k×xi−1,j+wi−1,0,k×1).
We can simplify that a bit by defining xi−1,0=1 for all i; essentially fixing the constant values. The formula becomes
xi,k=gi(Ni−1∑j=0wi−1,j,k×xi−1,j).
Now for a given i, we can build an Ni−1×(Ni−1) matrix indexed by the edges pointing from nodes in column i−1 to the non-constant nodes in column i. The entry in row j and column k of that matrix should be the weight associated with the edge from node j in layer i−1 to node k in layer i.
If we then place the node values in the i−1st layer in a row vector and form the dot product of that vector with the kth column of the matrix, we get
Ni−1∑j=0wi−1,j,k×xi−1,j
That’s exactly the linear portion in the formula on the previous slide; thus, we should be able to express that portion in a conveniently compact notation.
We’ve got a feed forward neural network with 1 input layer, N hidden layers, and 1 output layer. We suppose that layer i has ni+1 nodes numbered 0,1,…,n and we want to describe how the values propagate from layer i−1 to layer i.
Form the matrix Wi=[wi,j,k]j,k of weights from layer i−1 to layer i.
Suppose also that Xi=[1,xi−1,1,…,xi−1,ni] is the row vector of values in the ith column. Then,
Xi=gi(Xi−1Wi).
This is just the activation function applied termwise to the result of a matrix product. Better yet, X0 could be a matrix of rows of inputs that we want to evaluate; we can then use the same formulae to evaluate a whole slew of inputs en masse.
Let’s take a look at another small and real example that yields an actual and important computation - the implementation of the elementary logic gates.
To be clear, these are very simple functions with much simpler implementations than the development of a neural network. Nonetheless, their implementation as neural networks is interesting and illustrates some critical concepts, such as
The logic gates are binary, Boolean functions. Thus, they take a pair of variables whose values can be 0 or 1 and they return a single result whose value can also be 0 or 1. Since there are only four possible inputs, it’s easy to directly define these by simply specifying the outputs. Here are the standard definitions of And, Or, and Implies:
And
(1,1)→1(1,0)→0(0,1)→0(0,0)→0
Or
(1,1)→1(1,0)→1(0,1)→1(0,0)→0
Implies
(1,1)→1(1,0)→0(0,1)→1(0,0)→1
We can model those first three logic gates with just about the simplest neural network you could imagine. We obviously need the input and output layers but we’ll have no hidden layers. In addition, we’ll use an activation function that applies the sigmoid and then rounds the result to obtain a 0 or 1.
This type of setup is called a perceptron and, schematically, looks like so:
Applying the neural network interpretation we’ve developed, we find that the perceptron produces values via the formula y=w1x1+w2x2+w0. This produces a real number. The sigmoid maps that to the open interval (0,1), with zero mapping to the midpoint 1/2, negative numbers mapping to the bottom half, and positive numbers to the top. Finally, rounding yields a 0 or 1 result as desired.
Perhaps a simpler way to describe the result is that the model produces
0 when y=w1x1+w2x2+w0<0 and 1 when y=w1x1+w2x2+w0>0.One way (of many) to pick weights to yield the And gate is w1=w2=2 and w0=−3. We can then compute y=2x1+2x2−3 and classify according to whether y<0 or y>0. To see this in action is a simple matter of plugging in: y(1,1)=1>0⟹output=1y(1,0)=−1<0⟹output=0y(0,1)=−1<0⟹output=0y(0,0)=−3<0⟹output=0.
The reader is invited to explore possibilities to produce the Or and Implies gates.
Here’s one more logic gate, namely Exclusive or, often denoted by Xor:
(1,1)→0(1,0)→1(0,1)→1(0,0)→0.Perhaps surprisingly at first glance, Xor throws a bit of wrench into all this happiness. Fiddle as you might with the weights, you won’t be able to model Xor with the perceptron.
To understand how this classification works (or not), we can draw the points in the input space and color them blue for 1 or red for 0. Here’s what the picture looks like for the And gate:
Now, let’s plot those same points together with the line 2x1+2x2=3. We can see clearly how that line breaks the set of inputs into the two classifications
Now, here are the inputs for Xor colored by classification. Do you see the issue?
The perceptron effectively generates classifiers with linear boundaries. We need something a bit more complicated. We can achieve that by simply adding a hidden layer so that the perceptron becomes a more general neural network. Here’s how that looks:
I’m going to implement the neural network in the preceding picture with low-level NumPy code. Let’s begin by importing NumPy and defining the sigmoid activation function.
# The matrix of inputs
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
# The matrix of weights from the
# input layer 0 to hidden layer 1
W0 = np.array([
[-3,-6], # From node 0 - the constant
[6,4], # From node 1
[6,4] # From node 2
])
# The matrix of weights from hidden
# layer 1 to output layer 2
W1 = np.array([[-3],[7],[-8]])
Now we perform the computation. At each step, we augment the values with a column of 1s ones on the left, form the matrix multiplication and then the sigmoid. In addition, we round the results for the final output.
X0 = np.column_stack([np.ones(len(X)),X])
X1 = sigmoid(X0 @ W0)
X1 = np.column_stack([np.ones(len(X)), X1])
X2 = sigmoid(X1 @ W1)
np.round(X2).flatten()
array([0., 1., 1., 0.])
The output is exactly what we want for Xor!
The contours of the classification function look somewhat different now:
A reasonable question, of course, is - how did you find those weights?
By fitting the model to the desired output, of course!!
The plots in the last column of slides illustrate an important point: In low dimensions, we can illuminate the possibilities and limitations of a classifier by examining the contours produced by various data sets.
In this last column of slides we’re going to do exactly that for a couple of sample data sets that illustrate why it is that classification by neural networks might work better than logistic regression in some cases.
You might remember the penguins data set I’ve got on my webpage:
species | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
103 | Adelie | Biscoe | 37.8 | 20.0 | 190.0 | 4250.0 | MALE |
189 | Chinstrap | Dream | 52.0 | 20.7 | 210.0 | 4800.0 | MALE |
41 | Adelie | Dream | 40.8 | 18.4 | 195.0 | 3900.0 | MALE |
209 | Chinstrap | Dream | 49.3 | 19.9 | 203.0 | 4050.0 | MALE |
318 | Gentoo | Biscoe | 48.4 | 14.4 | 203.0 | 4625.0 | FEMALE |
Here’s a scatter plot of the first two principal components of the numerical data:
If we fit a classifier to that data, we can apply that classifier to the x,y-points in the rectangle and color them according to the classification to get an idea of how well the classifier works.
The next pair of slides shows this process applied to the Palmer penguins using a logistic classifier and a Neural network classifier.
Here’s another data set that’s artificially constructed to come in intertwined portions: