Mon, Apr 06, 2026
Last time, we learned a bit about networks and graph theory. We ended with abstract syntax trees, which provide a graph theoretic way to view algebraic expressions. Since trees can be manipulated as data structures, this yields a framework for us to perform algebra on the computer.
Today, we’re going to take a look at an alternative way to represent algebraic expressions as graphs - the so-called expression graph, also called a computation graph. This alternative is better suited for efficient numerical computation of values and derivatives.
An expression graph is a representation of a mathematical expression in the form of a directed acyclic graph or DAG where
Let’s just jump in and see what these things look like! Here’s the expression graph for \[ (x+y)\times(x+1): \]
We can plug values in for \(x\) and \(y\). \[ (2+1)\times(2+1): \]
Any values that we plug in propagate forward to produce a final value. \[ (2+1)\times(2+1) = 9: \]
Note how the values and operators combine to form subsequent values and how those values propagate through the directed graph to form a final result of the computation.
This process is called forward propagation.
Expression graphs are concretely distinct from syntax trees, which start with the top level operator, end at the inputs, and is in the form of a tree.
Let’s look at a few more examples - starting with my favorite function \(e^{-x^2}\):
If we multiply that last example by \(x^2\), we see how that value is reused. Thus, this is the expression graph for \(x^2 e^{-x^2}\):
This reuse of subexpressions is one of the features that makes expression graphs more efficient.
Here’s the expression graph for the polynomial
\[ x^4 - 2x^3 - 3x^2 - 4x - 5: \]
Here’s the expression graph of the normal distribution, which has three symbolic inputs with values \(x=0\), \(\mu=0\), and \(\sigma=1\): \[ \text{exp}(-(x-\mu)^2/(2\sigma^2))/\sqrt{2\pi\sigma^2}. \]
With a just little more information, we can use the expression graph to compute the values of the partial derivatives of the function. This process is called backpropagation.
Back in my day, long before ML was so cool, this was called automatic differentiation.
Consider the function \(f(x)=e^{-x^2}\), which can be computed in three steps: \[ x \xrightarrow{\text{pow}(2)} x^2 \xrightarrow{\text{neg}} -x^2 \xrightarrow{\text{exp}} e^{-x^2}. \]
Thus, application of the chain rule results in a product of three terms:
\[ \frac{d}{dx} e^{-x^2} = (2x)\times(-1)\times \left(e^{-x^2}\right). \]
If we compute this at a value like \(x=1/4\), then we get a numeric result like \[ \begin{aligned} \frac{d}{dx} e^{-x^2}\Big|_{x=1/4} &= \left(2\times\frac{1}{4}\right)\times(-1)\times \left(e^{-(1/4)^2}\right) \\ &= (\frac{1}{2}) \times (-1) \times (0.939413) = -0.469707 \end{aligned} \]
We can use the graph to keep track of these steps:
Read from right to left:
The fact that we multiply these altogether is exactly the chain rule.
When an expression is reused, it will produce multiple arrows all of which sum together to contribute to the accumulated derivative. That’s the sum rule!
We could illustrate this with same function we just used by writing \[ e^{-x^2} = e^{-(x\times x)}: \]
In practice, all this takes place during a graph traversal. At any step, there is a node under consideration whose function value and grad value are known. There are one or two nodes pointing in whose function values are known but whose grad values are yet to be determined. We then update those grad values according to these rules:
if node.op in {"input", "const"}:
continue
elif node.op == "add":
a, b = node.inputs
a.grad += node.grad
b.grad += node.grad
elif node.op == "sub":
a, b = node.inputs
a.grad += node.grad
b.grad -= node.grad
elif node.op == "mul":
a, b = node.inputs
a.grad += node.grad * b.value
b.grad += node.grad * a.value
elif node.op == "div":
a, b = node.inputs
a.grad += node.grad * (1 / b.value)
b.grad += node.grad * (-a.value / (b.value ** 2))
elif node.op == "pow_const":
(a,) = node.inputs
p = node.param
a.grad += node.grad * p * (a.value ** (p - 1))
elif node.op == "neg":
(a,) = node.inputs
a.grad -= node.grad
Rule for addition:
if node.op == "add":
a, b = node.inputs
a.grad += node.grad
b.grad += node.grad
Result in practice:
Rule for subtraction:
if node.op == "sub":
a, b = node.inputs
a.grad += node.grad
b.grad -= node.grad
Result in practice:
Rule for multiplication:
if node.op == "mul":
a, b = node.inputs
a.grad += node.grad * b.value
b.grad += node.grad * a.value
Result in practice:
Rule for powers:
elif node.op == "pow_const":
(a,) = node.inputs
p = node.param
a.grad += node.grad * p * (a.value ** (p - 1))
Result in practice:
Here are a few more examples of expression graphs with backpropagation, starting with a revisit to our initial example \((x+y)\times(x+1)\) at the values \(x=2\) and \(y=1\):
In this example, \(f(x,y) = x^2y - xy\). Thus, \[ \begin{eqnarray} f_x(x,y) = 2xy-y & \text{ so } & f_x(2,1) = 2\times2\times1 - 1 = 3 \\ f_y(x,y) = x^2 - x & \text{ so } & f_y(2,1) = 2^2 - 2 = 2. \end{eqnarray} \]
The sigmoid function \[ \sigma(x) = \frac{1}{1+e^{-x}} \] plays a special role in machine learning. We’ve already seen how it appears as a cumulative distribution function in logistic regression.
The sigmoid is also used as a common “activation function” in neural networks for binary classification. In that context, we represent \[ \sigma(aw+bx+cy+dz) \] with something like so:
The bottom line is that we need to be able to incorporate the sigmoid into our expression graphs. This turns out to be easy, since the sigmoid satisfies \[ \sigma'(x) = \sigma(x)(1-\sigma(x)). \]
Proving this is a matter of direct computation:
\[ \begin{aligned} \sigma'(x) &= \frac{d}{dx} \frac{1}{1+e^{-x}} = \frac{e^{-x}}{(1+e^{-x})^2} \\ &= \frac{1}{1+e^{-x}}\frac{e^{-x}}{1+e^{-x}} = \sigma(x)(1-\sigma(x)). \end{aligned} \]
Here’s what the use of this function in an expression graph looks like:
Note that the value of the sigmoid after computation is \(0.817574\) in this example. When we propagate back, the value of the derivative is \[0.817574(1-0.817574) = 0.149146.\]