Multivariable Calculus

Wed, Jan 22, 2025

Why multivariable calculus?

Last week, we discussed single variable calculus that we all know from Calc I emphasizing the specific elements that are of particular importance in machine learning. In part, that includes optimizing functions using numerical techniques.

In the context of machine learning, though, the functions that we wish to optimize often have many inputs and optimizing those kinds of functions lies in the domain of multivariable calculus.

In this presentation, we’ll survey just enough of Calc III to discuss optimization of functions in more than one variable. For now, that boils down to solving systems of equations produced by partial derivatives. There’s plenty more multivariable calculus to come, though!

3D coordinates

The 3D coordinate system is a lot like the 2D system with an extra axis. There are three axes meeting at the origin at right angles.

Note that the axes must obey the right hand rule.

Plotting points

Plotting points is very similar to the way we plot points in 2D - just follow the specified distances along the axes.

Constant planes

If we set one of the variables equal to a constant, we determine a plane perpendicular to the axis for that variable.

More general planes

We can use more general equations to determine other planes. The plane below has the equation \[x+y+3z=3.\]

The distance formula

The distance between two points \((x_1,y_1,z_1)\) and \((x_2,y_2,z_2)\) in three-dimensional space is \[d = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2 + (z_1-z_2)^2}.\] Thus, the equation of a sphere with center \((x_0,y_0,z_0)\) and radius \(r\) is \[(x-x_0)^2 + (y-y_0)^2 + (z-z_0)^2 = r^2. \]

A sphere

The sphere below has has center \((3,2,2)\) and radius \(r=2\). Thus, an equation for the sphere is \[(x-3)^2 + (y-2)^2 + (z-2)^2 = 4.\]

Multivariable functions

We are often interested real valued functions of several or even many variables. In general, our functions may have \(n\) variables and we write \[f:\mathbb R^n \to \mathbb R.\]

For example, \[f(w,x,y,z) = w + x^2 + y^3 + z^4 - 5 \sin(w + x^2 + y^3 + z^4)\] has four variables and maps \(\mathbb R^4 \to \mathbb R.\)

For the time being, we’ll stick with \(n=2\); such functions are often called bivariate.

Graphs of bivariate functions

The graph of a function \(f:\mathbb R^2 \to \mathbb R\) is the set \[ \{(x,y,z)\subset\mathbb R^3: z = f(x,y)\}. \] In general, this looks like a surface:

Paraboloids

A good family of functions to know is \(f(x,y) = ax^2 + by^2\). In the figure below, for example, \(a=b=1\) so we see the graph of \(f(x,y) = x^2 + y^2\).

Another paraboloid

If \(a=1\) and \(b = -1\), we get a somewhat different paraboloid.

Cross-sections

These are called “paraboloids” because their cross-sections are parabolas. A cross-section of a 3D object is a slice by a vertical plane. We can create one by setting either \(x\) or \(y\) constant.

The other paraboloid

The \(f(x,y) = y^2 - x^2\) function is also a paraboloid because its cross-sections are also parabolas. When \(y\) is a constant, we get parabolas opening down. The origin is often called a saddle point for this example.

Contours

Another way to slice a 3D graph is with a horizontal plane. These slices are called contours.

Contour diagrams

We can draw a collection of contours in the plane to generate the contour digram of the function. Here’s the contour diagram of \(f(x,y) = x^2 + 4y^2\). Note that lighter colors imply higher values.

Hyperbolic contours

The previous contour diagram displays an elliptic paraboloid because its contours are elliptic. The function \(f(x,y) = x^2 - 4y^2\) yields a hyperbolic paraboloid.

Peaks contours

Matlab’s Peaks function has three peaks, three valleys and three saddle points.

\[f(x,y) = 3 \, (1-x)^2 e^{-x^2-(y+1)^2}-10 \, e^{-x^2-y^2} \left(-x^3+\frac{x}{5}-y^5\right)-\frac{1}{3} \, e^{-(x+1)^2-y^2}.\]

Peaks 3D

Here’s the Peaks function in 3D:

Maxima, minima, and saddle points

Note that it’s easy to see the locations of maxima and minima in a 3D plot. It might be even easier to see them in a contour plot.

It’s pretty easy to see saddle points, as well! Can you see the saddle points in the peaks contour diagram?

Partial derivatives

The simplest analog to the derivative of a single variable function but for bivariate functions is the partial derivative.

To compute a partial derivative of a \(f(x,y)\) with respect to one variable, we simply hold the other variable constant. For example, if \(f(x,y) = y^2 - x^2\), then \[ \frac{\partial f}{\partial x} = f_x(x,y) = -2x. \]

Partial pic

Geometrically, the partial derivative represents the rate of change of \(f\) in the positive direction of the variable.

Critical points

Given a function of two variables, a critical point is a point \((x_0,y_0)\) such that \[f_x(x_0,y_0) = 0 \text{ and } f_y(x_0,y_0) = 0.\]

If, for example, \(f(x,y) = a x^2 + b xy + c y^2\), then

\[\begin{aligned} f_x(x,y) &= 2ax + by \text{ and } \\ f_y(x,y) &= bx + 2cy. \end{aligned}\]

Thus \(f_x(0,0) = f_y(0,0) = 0\) so \((0,0)\) is a critical point of \(f\).

Critical points can be maxima, minima or saddles and, we’ve seen, they’re pretty easy to spot in contour diagrams

Finding critical points

Let’s consider \(f(x,y) = x^2 - xy + y^2 - 3 y\). It’s pretty easy to see from a contour plot that there’s a minimum:

Setting up and solving a system

If \(f(x,y) = x^2 - xy + y^2 - 3 y\), then

\[\begin{aligned} \frac{\partial f}{\partial x} &= 2x - y \stackrel{?}{=} 0 \text{ and} \\ \frac{\partial f}{\partial y} &= -x + 2y - 3 \stackrel{?}{=} 0. \\ \end{aligned}\]

If I multiply the second equation by \(2\) and add the result to the first, I get \(3y-6=0\) so that \(y=2\). I can then see that \(x=1\).

Once I know that \(x=1\), it’s easy to see that \(y=2\). That appears to agree with what we see in the figure and our knowledge from that figure indicates that we have found a minimum.

Gradients and optimization

Even if you do a little casual reading about machine learning and how it works, you’re likely to find the term gradient descent.

Gradient descent is a minimization technique that’s easy to implement and works well in high dimensional spaces, which is perfect for machine learning.

Let’s check out the basic theory explaining gradient descent and why it works.

The gradient vector

Given a function \(f\) of two variables, the gradient is a new function \(\nabla f\) that returns a two-dimensional vector. The components are exactly the two partial derivatives of the function.

Thus, \[\nabla f(x,y) = \left\langle \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right\rangle.\]

For example, if \(f(x,y) = x^2 + y^2\), then \[\nabla f = \langle 2x, 2y \rangle = 2\langle x,y \rangle.\]

The gradient field

If we plot a collection of gradient vectors of a function emanating from a grid of points in the plane, we obtain the gradient field of the function. Here’s the gradient field of \(f(x,y) = x^2 + y^2\):

The geometry of the gradient

Geometrically, vectors are typically represented as arrows and there are two main features that you need to understand it:

  • Direction: The gradient points in the direction of greatest increase of the function.
  • Magnitude: The magnitude of the gradient is the rate of change in that direction.

Gradients and contours

If you’ve ever gone hiking, you know that if you want to move as steeply as possible uphill, then your path should be perpendicular to the contours. You can see this by plotting a gradient field over the corresponding contour diagram.

Contours and gradients for peaks

It’s pretty easy to see that the previous figure is correct, since \(f(x,y) = x^2 + y^2\) is pretty simple. The contour/gradient relationship is held for more complicated functions, too - like the peaks function.

Gradient ascent

Note that gradient vectors generally point away from minima and towards maxima. Thus, if you follow gradient vectors, then you will either

  • diverge to infinity or
  • eventually find yourself at a local maximum.

That leads to a maximization technique called gradient ascent.

Illustration

Hover over or touch the figure below to see gradient ascent in action!

Gradient descent

Gradient descent is just the opposite:

Local minimizer

It’s worth repeating that gradient descent finds local minima. It comes with no guarantee concerning the global behavior.

Linear regression

With an understanding of optimization in two variables, we are now in a position to discuss our first truly serious application, namely linear regression.

The basic question:

Suppose we’ve got data with two numeric variables that we suspect have a close to linear relationship. How can we best model that relationship?

Comment

This column of slides is hugely important. It’s our first fully explained example of minimization of error in a model, which is a fundamental technique of machine learning.

Example

The figure below plots some very simple data:
[[0, 0],[1, 0],[1, 1],[2, 1],[2, 2],[3, 2]]

The parameters defining the function \(f(x)=ax+b\) are controlled by the sliders. The question is - “how small can we make the total squared error?”

Total squared error

In general, we’ve got data represented as a list of points: \[ \{(x_i,y_i)\}_{i=1}^N = \{(x_1,y_1),(x_2,y_2),(x_3,y_3),\ldots,(x_N,y_N)\}. \]

If we model that data with a function \(y=f(x)\), then the total squared error is defined by \[ E = \sum_{i=1}^N \left(y_i - f(x_i)\right)^2. \] The objective is to choose the parameters defining \(f\) to minimize \(E\).

More specifically

In the current example, we model the data with a first order polynomial \(f(x) = ax+b\). Thus, our formula takes on the more specific form \[ E(a,b) = \sum_{i=1}^N \left(y_i - (a\,x_i + b)\right)^2. \] Note that the data is fixed but we have control over the parameters. Thus, we can treat \(E\) as a function of the two variables \(a\) and \(b\) and use the techniques of multivariable calculus to perform the minimization.

Even more specifically

Recall that the data consists of these points:

[[0, 0], [1, 0], [1, 1], [2, 1], [2, 2], [3, 2]]

and the error \(E\) as a function of \(a\) and \(b\) is

\[ E(a,b) = \sum_{i=1}^6 \left(y_i - f(x_i)\right)^2. \]

Writing that out in full, we get:

\(\displaystyle \begin{aligned} E(a,b) &=(0 - (a\times0 + b))^2+(0 - (a\times1 + b))^2+(1 - (a\times1 + b))^2\\ &+(1 - (a\times2 + b))^2+(2 - (a\times2 + b))^2+(2 - (a\times3 + b))^2\end{aligned}\)

Setting up the system

We can differentiate with respect to \(a\) to get

\(\displaystyle \begin{aligned} \frac{\partial E}{\partial a} &=(0)+(2 a + 2 b)+(2 a + 2 b - 2)\\ &+(8 a + 4 b - 4)+(8 a + 4 b - 8)+(18 a + 6 b - 12)\\ &=38 a + 18 b - 26\end{aligned}\)

And with respect to \(b\) to get

\(\displaystyle \begin{aligned} \frac{\partial E}{\partial b} &=(2 b)+(2 a + 2 b)+(2 a + 2 b - 2)\\ &+(4 a + 2 b - 2)+(4 a + 2 b - 4)+(6 a + 2 b - 4)\\ &=18 a + 12 b - 12\end{aligned}\)

Solving the system

Setting \(\partial E/\partial a = \partial E/\partial b\) = 0, we get the system

\[\begin{aligned} 38a + 18b &= 26 \\ 18a + 12b &= 12. \end{aligned}\]

Multiply the second by 3/2 to get the new system \[\begin{aligned} 38a + 18b &= 26 \\ 27a + 18b &= 18. \end{aligned}\]

Subtract the first minus the second to get \[11a = 8 \text{ or } a = 8/11.\]

Plug that back into either of the others to get \(b=-1/11\)

Solution wrap up

The formula for the line with the least total squared error is thus \[f(x) = \frac{8}{11} x - \frac{1}{11} = 0.\overline{72}x - 0.\overline{09}.\]

That appears to be in line with our dynamic illustration.

Note that the geometric nature of the problem guarantees that we’ve found a minimum, rather than a local max or saddle point.

Variations on regression

Regression is a huge topic for us and we’ll spend plenty more time on it throughout the semester. Just about all of the linear algebra and higher dimensional calculus we learn can be applied to linear regression in some way as well.

In this column of slides, we’re going to take a look at a few variations on the basic idea we’ve just seen.

Extending the basis functions

Linear regression is so-called because we fit data with a linear combination of basis functions - i.e. a sum of constants times the functions. Often, the basis is chosen to be \(\{1,x\}\), which leads to approximations using functions of the form \[f(x) = a\times x + b\times 1.\] In that case the graph is a line.

There are other possibilities for the basis, though. For example, we might choose the basis to be \(\{1,x,x^2,x^3\}\). In this case, we’ll approximate the data with a cubic function - i.e. one of the form \[f(x) = ax^3 + bx^2 + cx + d.\]

Example

Suppose we wish to the fit points \(\{n, \sin(n)\}_{n=1}^6\), as shown below:

Code
import numpy as np
import matplotlib.pyplot as plt

xs = np.arange(1,7)
ys = np.sin(xs)
p = plt.plot(xs,ys,'o')

plt.show()

It looks like a cubic might fit fairly well.

Using SciPy’s least_squares

Here’s a cubic fit produced by SciPy’s least_squares function.

Code
import numpy as np
from scipy.optimize import least_squares
from IPython.display import Math, display
import matplotlib.pyplot as plt

xs = np.arange(1,7)
ys = np.sin(xs)
f = lambda x, a,b,c,d: a*x**3 + b*x**2 + c*x + d
e = lambda coeffs, x,y: f(x, *coeffs) - y
res = least_squares(e, [0,0,0,0], args=(xs, ys))

a, b, c, d = res.x
xx = np.linspace(0.5, 6.2, 500)
yy = f(xx, a, b, c, d)

plt.plot(xx, yy)
plt.plot(xs, ys, 'o')
plt.show()

\(\displaystyle \text{Fit: } \: f(x) = 0.1025x^3-0.9807x^2+2.2365x-0.5012\)

Data

Typically we apply these ideas to actual data. Here, for example, is a small slice of a data table stored on my website. Each row corresponds to a college football team taken from the 2024 season:

The scatter plot with regression line

Here’s a scatter plot of that data where the horizontal variable is represents total points scored and the vertical variable is win/loss percentage. The dark line, of course, is the regression line.

Logistic regression

In the plot below, each dot corresponds to a game from an NCAA basketball tournament played between 2010 and 2023. You can hover over the dots to get specific information on the individual games.

The horizontal axis effectively corresponds to the number of points that the first team is predicted to beat the second team by. That prediction was obtained by a linear regression technique. The vertical axis is a Boolean flag indicating whether team 1 actually beat team 2 or not.

Fit with the logistic curve

The objective is to make probabilistic predictions for future NCAA tournaments. We do so by fitting a curve like the one shown below to the data. That curve takes on values between zero and one and generally increases as the projected score difference increases. Thus, it makes sense to interpret these values as probabilities.

Formulae

The curve on the previous slide is called the logistic curve and also the sigmoid. A formula for the sigmoid is \[ \hat{p} = \frac{1}{1+e^{ax+b}}. \] That formula is definitely not a linear combination of basis functions. We can solve for the \(ax+b\) inside the exponential, though, to get \[ -\log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) = ax+b. \] In this way, the logistic regression that we want can be translated to a linear regression that we know how to solve.