Expression graphs

Mon, Apr 06, 2026

Recap and look ahead

Last time, we learned a bit about networks and graph theory. We ended with abstract syntax trees, which provide a graph theoretic way to view algebraic expressions. Since trees can be manipulated as data structures, this yields a framework for us to perform algebra on the computer.

Today, we’re going to take a look at an alternative way to represent algebraic expressions as graphs - the so-called expression graph, also called a computation graph. This alternative is better suited for efficient numerical computation of values and derivatives.

Description

An expression graph is a representation of a mathematical expression in the form of a directed acyclic graph or DAG where

The input nodes represent variables or constants.
Intermediate nodes represent primitive computations, such as \(+\), \(\times\), or exp.
Variables can be set to specific values to initiate the computation.
The computation flows along the directed edges indicating how the output of one node is used as input to another.
The computation ends at a distinguished output node.
Subexpressions can be reused and shared, which is why they are DAGs rather than trees.

First example

Let’s just jump in and see what these things look like! Here’s the expression graph for \[ (x+y)\times(x+1): \]

Initial values

We can plug values in for \(x\) and \(y\). \[ (2+1)\times(2+1): \]

Propagating values forward

Any values that we plug in propagate forward to produce a final value. \[ (2+1)\times(2+1) = 9: \]

Forward propagation

Note how the values and operators combine to form subsequent values and how those values propagate through the directed graph to form a final result of the computation.

This process is called forward propagation.

Key distinctions from ASTs

Expression graphs are concretely distinct from syntax trees, which start with the top level operator, end at the inputs, and is in the form of a tree.

More examples

Let’s look at a few more examples - starting with my favorite function \(e^{-x^2}\):

Computation reuse

If we multiply that last example by \(x^2\), we see how that value is reused. Thus, this is the expression graph for \(x^2 e^{-x^2}\):

This reuse of subexpressions is one of the features that makes expression graphs more efficient.

A polynomial

Here’s the expression graph for the polynomial

\[ x^4 - 2x^3 - 3x^2 - 4x - 5: \]

The normal distribution

Here’s the expression graph of the normal distribution, which has three symbolic inputs with values \(x=0\), \(\mu=0\), and \(\sigma=1\): \[ \text{exp}(-(x-\mu)^2/(2\sigma^2))/\sqrt{2\pi\sigma^2}. \]

Backward propagation

With a just little more information, we can use the expression graph to compute the values of the partial derivatives of the function. This process is called backpropagation.

Back in my day, long before ML was so cool, this was called automatic differentiation.

Example of backpropagation (1)

Consider the function \(f(x)=e^{-x^2}\), which can be computed in three steps: \[ x \xrightarrow{\text{pow}(2)} x^2 \xrightarrow{\text{neg}} -x^2 \xrightarrow{\text{exp}} e^{-x^2}. \]

Thus, application of the chain rule results in a product of three terms:

\[ \frac{d}{dx} e^{-x^2} = (2x)\times(-1)\times \left(e^{-x^2}\right). \]

If we compute this at a value like \(x=1/4\), then we get a numeric result like \[ \begin{aligned} \frac{d}{dx} e^{-x^2}\Big|_{x=1/4} &= \left(2\times\frac{1}{4}\right)\times(-1)\times \left(e^{-(1/4)^2}\right) \\ &= (\frac{1}{2}) \times (-1) \times (0.939413) = -0.469707 \end{aligned} \]

Example of backpropagation (2)

We can use the graph to keep track of these steps:

Read from right to left:

The gradient at the end is initialized to \(1\),
The derivative of an exponential is just that same value (\(\frac{d}{dx}e^x = e^x\)),
The negation contributes \(\times(-1)\) to the derivative (\(\frac{d}{dx} cx = c\)), and
The derivative of \(x^2\) contributes \(\times2x\), where \(x=\frac{1}{4}\).

The fact that we multiply these altogether is exactly the chain rule.

Expression reuse

When an expression is reused, it will produce multiple arrows all of which sum together to contribute to the accumulated derivative. That’s the sum rule!

We could illustrate this with same function we just used by writing \[ e^{-x^2} = e^{-(x\times x)}: \]

Now, the value in the \(\times\) node is \(0.25\times0.25\), rather than \(0.25^2\) and
the grad in the x node is \(\frac{1}{4}\times(-0.939413) + \frac{1}{4}\times(-0.939413)\),
rather than \(2\times\frac{1}{4}\times(-0.939413)\).

Rules in code

In practice, all this takes place during a graph traversal. At any step, there is a node under consideration whose function value and grad value are known. There are one or two nodes pointing in whose function values are known but whose grad values are yet to be determined. We then update those grad values according to these rules:

if node.op in {"input", "const"}:
    continue
elif node.op == "add":
    a, b = node.inputs
    a.grad += node.grad
    b.grad += node.grad
elif node.op == "sub":
    a, b = node.inputs
    a.grad += node.grad
    b.grad -= node.grad
elif node.op == "mul":
    a, b = node.inputs
    a.grad += node.grad * b.value
    b.grad += node.grad * a.value
elif node.op == "div":
    a, b = node.inputs
    a.grad += node.grad * (1 / b.value)
    b.grad += node.grad * (-a.value / (b.value ** 2))
elif node.op == "pow_const":
    (a,) = node.inputs
    p = node.param
    a.grad += node.grad * p * (a.value ** (p - 1))
elif node.op == "neg":
    (a,) = node.inputs
    a.grad -= node.grad

Addition

Rule for addition:

if node.op == "add":
  a, b = node.inputs
  a.grad += node.grad
  b.grad += node.grad

Result in practice:

Subtraction

Rule for subtraction:

if node.op == "sub":
  a, b = node.inputs
  a.grad += node.grad
  b.grad -= node.grad

Result in practice:

Multiplication

Rule for multiplication:

if node.op == "mul":
  a, b = node.inputs
  a.grad += node.grad * b.value
  b.grad += node.grad * a.value

Result in practice:

Powers

Rule for powers:

elif node.op == "pow_const":
  (a,) = node.inputs
  p = node.param
  a.grad += node.grad * p * (a.value ** (p - 1))

Result in practice:

More backpropagation examples

Here are a few more examples of expression graphs with backpropagation, starting with a revisit to our initial example \((x+y)\times(x+1)\) at the values \(x=2\) and \(y=1\):

Another

In this example, \(f(x,y) = x^2y - xy\). Thus, \[ \begin{eqnarray} f_x(x,y) = 2xy-y & \text{ so } & f_x(2,1) = 2\times2\times1 - 1 = 3 \\ f_y(x,y) = x^2 - x & \text{ so } & f_y(2,1) = 2^2 - 2 = 2. \end{eqnarray} \]

The sigmoid

The sigmoid function \[ \sigma(x) = \frac{1}{1+e^{-x}} \] plays a special role in machine learning. We’ve already seen how it appears as a cumulative distribution function in logistic regression.

Neural activator

The sigmoid is also used as a common “activation function” in neural networks for binary classification. In that context, we represent \[ \sigma(aw+bx+cy+dz) \] with something like so:

The derivative

The bottom line is that we need to be able to incorporate the sigmoid into our expression graphs. This turns out to be easy, since the sigmoid satisfies \[ \sigma'(x) = \sigma(x)(1-\sigma(x)). \]

Proving this is a matter of direct computation:

\[ \begin{aligned} \sigma'(x) &= \frac{d}{dx} \frac{1}{1+e^{-x}} = \frac{e^{-x}}{(1+e^{-x})^2} \\ &= \frac{1}{1+e^{-x}}\frac{e^{-x}}{1+e^{-x}} = \sigma(x)(1-\sigma(x)). \end{aligned} \]

The sigmoid in an expression graph

Here’s what the use of this function in an expression graph looks like:

Note that the value of the sigmoid after computation is \(0.817574\) in this example. When we propagate back, the value of the derivative is \[0.817574(1-0.817574) = 0.149146.\]