Mon, Feb 10, 2025
We’ve worked plenty in \(\mathbb R^n\) so far but, while we’ve talked a bit about geometry, we’ve focused mostly on the algebraic properties of linear transformations. Today, we’re going to focus on geometry, though, investigating questions like
Our main tools for doing this are norms and inner products
\[ \newcommand{\dotproduct}[2]{#1 \cdot #2} \newcommand{\vectorentry}[2]{\left\lbrack#1\right\rbrack_{#2}} \newcommand{\transpose}[1]{#1^{T}} \]
We’re all familiar with the concept of magnitude or length of a vector in \(\mathbb R^n\). If \(\vec{x} = \langle x_1,x_2,\ldots,x_n \rangle\), then its length is often denoted/computed by \[\|\vec{x}\|_2 = \sqrt{x_1^2 + x_2^2 +\cdots+x_n^2}\] Furthermore, this literally represents the physical length of the vector, at least in \(\mathbb R^2\) or \(\mathbb R^3\).
Furthermore, if you’ve got two points \(p_1\) and \(p_2\) in space, then you might refer to the distance between them as the length of the vector from one to other.
We use the concept of distance frequently in numerical analysis and machine learning. For example, we quantify the error \(E\) of a model \(\vec{F}\) given input \(\vec{x}\) with expected output \(\vec{y}\) as \[E = \|\vec{F}(\vec{x}) - \vec{y}\|.\] It turns out that there are other ways magnitude that can affect the results and efficiency of a model and that there are really just a few properties that any notion of “magnitude” should have. Once we figure those out, we are led to a generalization of the notion of magnitude called a “norm”.
The Manhattan norm of \(\vec{x}\in\mathbb R^n\) is defined by \[ \|\vec{x}\|_1 = |x_1| + |x_2|+\cdots+|x_n|. \] We might think of the Manhattan norm as the distance from the tip of a vector back to its origin as measured along a grid of streets in space. I’ve also seen it called the “taxicab norm”.
You might notice that we denoted
In fact, they both lie in whole family of norms called the \(L_p\) norms. The parameter \(p\) can be any real number in \([1,\infty]\), both “endpoints” included!
Before defining these, we should precisely define “norm”.
As you work with proofs and error estimates for distance on \(\mathbb R^n\), you find yourself repeatedly using just a few properties. This is exactly the situation where a mathematician asks how they might generalize the concept to make it more broadly applicable and useful.
In order to do this, let \(V\) be any vector space and let \(\|\cdot\|\) be a non-negative, real function on \(V\). We say that \(\|\cdot\|\) is a norm provided that
for all \(\lambda \in \mathbb R^n\) and all \(\vec{x},\vec{y}\in V\).
It’s not hard to show that \(L_1\) is a norm. In fact, the whole concept of norm is based on the absolute value on which the norm is based. Here are the proofs of the triangle inequality and homogeneity. The fact that \(L_1\) is positive definite seems pretty clear.
\[\begin{aligned} \|\lambda \vec{x}\| &= |\lambda \langle x_1, x_2, \cdots, x_n \rangle| \\ &= |\langle \lambda x_1, \lambda x_2, \cdots, \lambda x_n \rangle| \\ &= |\lambda x_1| + |\lambda x_2| + \cdots + |\lambda x_n| \\ &= |\lambda| (|x_1|+|x_2|+\cdots+|x_n|) = |\lambda|\|\vec{x}\|. \end{aligned}\] \[\begin{aligned} \|\vec{x}+\vec{y}\| &= \|\langle x_1, x_2, \cdots, x_n \rangle + \langle y_1, y_2, \cdots, y_n \rangle\| \\ &= \|\langle x_1+y_1, x_2+y_2, \cdots, x_n+y_n \rangle \| \\ &= |x_1+y_1| + |x_2+y_2| +\cdots+ |x_n+y_n| \\ &\leq |x_1| + |x_2| +\cdots+ |x_n| + |y_1| + |y_2| +\cdots+ |y_n| \\ &= \|\langle x_1, x_2, \cdots, x_n \rangle\| + \|\langle y_1, y_2, \cdots, y_n \rangle\| = \|\vec{x}\|+\|\vec{y}\|. \end{aligned}\]It’s actually a bit harder to prove that \(L^2\) is a norm. The only really difficult part, though, is the triangle inequality and there’s a very simple geometric intuition behind it.
Given a vector space \(V\) with norm \(\|\cdot\|\), the corresponding unit ball is \[ \{\vec{x}\in V: \|\vec{x}\| \leq 1\}. \] The shape of the ball depends on the norm, as well as on the space itself. Here are pictures of the unit ball in \(\mathbb R^2\) with respect to three different norms.
The \(L^{\infty}\) shown in the previous slide is yet another norm defined by \[ \|\langle x_1,x_2\ldots,x_n\rangle\| = \max\{|x_1|,|x_2|\ldots,|x_n|\}. \]
Perhaps the proof of one of the norm properties would be a reasonable exam question??
Finally, it’s worth mentioning that a norm implies a so-called metric, which is a generalization of the notion of distance.
To define the notion of distance between two points (written as vectors) \(\vec{x}\) and \(\vec{y}\), simply compute \(\|\vec{x}-\vec{y}\|\).
It’s straightforward to show that the \(L^2\) norm induces the standard Euclidean distance on \(\mathbb R^n\).
In Calculus III, we discuss the dot product of two vectors, which we can express with our matrix entry notation as
\[\begin{equation*} \dotproduct{\vec{u}}{\vec{v}}= \vectorentry{\vec{u}}{1}\vectorentry{\vec{v}}{1}+ \vectorentry{\vec{u}}{2}\vectorentry{\vec{v}}{2}+ \vectorentry{\vec{u}}{3}\vectorentry{\vec{v}}{3}+ \cdots+ \vectorentry{\vec{u}}{m}\vectorentry{\vec{v}}{m} = \sum_{i=1}^{m}\vectorentry{\vec{u}}{i}\vectorentry{\vec{v}}{i}\text{.} \end{equation*}\]
This literally just tells us to multiply termwise and add the results. Note that this can also be expressed as matrix multiplication. For example,
\[ \begin{bmatrix}1&2&3\end{bmatrix} \begin{bmatrix}1\\2\\3\end{bmatrix} = [1+4+9] = [14] \]
The dot product obeys the familiar properties of commutativity, distributivity over vector addition, and compatibility with scalar multiplication.
These all follow from the corresponding facts properties for matrix multiplication, which we’ve already proved! In addition, the dot product of a vector with itself is mostly, always postive: \[ \vec{u}\cdot\vec{u} \geq 0, \] the only exception being when \(\vec{u}=\vec{0}\).
Again, it’s worth considering these things, since they aren’t automatic. Can you see why the dot product is not associative?
If you’re keeping score,
The dot product has a fabulous geometric interpretation arising from the formula \[ \vec{u}\cdot\vec{v} = \|\vec{u}\|\|\vec{v}\|\cos(\theta), \] where \(\theta\) is the angle between the two vectors.
In particular, two nonzero vectors are perpendicular precisely when their dot product is zero. The dot product is, in fact, very often used as a test for perpendicularity.
We often say that two vectors are orthogonal when their dot product is zero.
If we place two non-parallel vectors with their tips at the same point, then that point with the tips of the vectors determine a plane. Thus, the idea of perpendicularity extends to any dimension.
For example, \[ \langle 1,2,3,4,5 \rangle \cdot \langle 1,1,1,1,-10 \rangle = 0 \] so the vectors are perpendicular in \(\mathbb R^5\).
Recall that a basis for a vector space is a linearly independent set of vectors that span the whole space.
Even better, the vectors could be orthogonal to one another.
Ideally, those orthogonal vectors are normalized so that the each has length one.
A basis of normalized, orthogonal vectors is called orthnormal.
The great thing about an orthonormal basis is that it’s easy to express arbitrary vectors in terms of the basis vectors.
To see this, suppose that \(\{\vec{u}_1, \vec{u}_2,\ldots,\vec{u}_n\}\) form an orthonormal basis for \(\mathbb R^n\) and \(\vec{v} \in \mathbb R^n\) is a vector satisfying \[ \vec{v} = \alpha_1\vec{u}_1 + \alpha_2 \vec{u}_2 + \cdots + \alpha_n\vec{u}_n. \] Then, for any \(j\), \[ \vec{u}_j\cdot\vec{v} = \vec{u}_j\cdot(\alpha_1\vec{u}_1 + \alpha_2 \vec{u}_2 + \cdots + \alpha_n\vec{u}_n) = \alpha_j. \]
Note that the collection \[ \left\{ \left\langle\frac{2}{3},-\frac{1}{3},\frac{2}{3}\right\rangle, \left\langle\frac{2}{3},\frac{2}{3},-\frac{1}{3}\right\rangle, \left\langle-\frac{1}{3},\frac{2}{3},\frac{2}{3}\right\rangle \right\} \] forms an orthnormal basis for \(\mathbb R^3\). Suppose we’d like to express the vector \(\vec{v}=\langle 11, -1, 2 \rangle\) in terms of that basis. That is, we’d like \[ \langle 11, -1, 2 \rangle = \alpha_1\left\langle\frac{2}{3},-\frac{1}{3},\frac{2}{3}\right\rangle + \alpha_2\left\langle\frac{2}{3},\frac{2}{3},-\frac{1}{3}\right\rangle + \alpha_3 \left\langle -\frac{1}{3},\frac{2}{3},\frac{2}{3}\right\rangle. \] I suppose we could set up the three by three system representing this situation, or..
We can simply compute the dot product between the target vector \(\langle 11, -1, 2 \rangle\) and the three basis vectors to get the coefficients.
\[ \langle 11, -1, 2 \rangle \cdot \left\langle\frac{2}{3},-\frac{1}{3},\frac{2}{3}\right\rangle = 9 \\ \langle 11, -1, 2 \rangle \cdot \left\langle\frac{2}{3},\frac{2}{3},-\frac{1}{3}\right\rangle = 6 \\ \langle 11, -1, 2 \rangle \cdot \left\langle -\frac{1}{3},\frac{2}{3},\frac{2}{3}\right\rangle = -3 \\ \] Thus, we should have \[ \langle 11, -1, 2 \rangle = 9\left\langle\frac{2}{3},-\frac{1}{3},\frac{2}{3}\right\rangle + 6\left\langle\frac{2}{3},\frac{2}{3},-\frac{1}{3}\right\rangle -3 \left\langle -\frac{1}{3},\frac{2}{3},\frac{2}{3}\right\rangle. \]
The dot product generalizes to a more widely applicable concept known as an inner product, which is typically denoted \(\langle\cdot,\cdot\rangle\). The inner product should be a binary function on a vector space \(V\) and should satisfy the same properties that the dot product does. That is,
Let \(V\) denote the vector space of all continuous functions over a closed interval \([a,b]\). Given \(f,g\in V\), define \(\langle f,g \rangle\) by \[ \langle f,g \rangle = \int_a^b f(x)g(x)\,dx. \] It’s not hard to show that this definition satisfies all the assumptions of an inner product. In fact, if you think of integration as a generalized sum, then it’s really closely analogous to the basic dot product.
This definition lies at the heart of Fourier series and tremendously useful in applied mathematics.
Let \(V\) denote the set of all continuous functions on \([0,1]\).
The functions \[\{1,x,x^2,x^3,x^4,x^5\}\] are linearly independent in \(V\) but they are not orthogonal, as you can easily compute.
The function \[\{1,\sin(2\pi x), \cos(2\pi x), \sin(4\pi x), \cos(4\pi x)\}\] are orthogonal, however, as this SymPy code verifies.
The dot product is, of course, an example of an inner product; there are many others, though.
One easy way to create an inner product is to include positive coefficients in the sum that we might think of as weights. For example \[\langle \vec{u},\vec{v} \rangle = 3u_1v_1 + 2u_2v_2 + u_3v_3\] defines an inner product on \(\mathbb R^3\).
In the context of machine learning, we might use this kind of thing to weight some features of a model more heavily than others.
The weighted inner product is actually a special case of a more general approach to defining an inner product on \(\mathbb R^n\) - an approach that uses a particular type of matrix.
Def An \(n\times n\) matrix \(A\) is called positive definite if it’s symmetric and \[ \transpose{\vec{x}}A\vec{x} \geq 0, \] for all \(\vec{x}\in\mathbb R^n\). In addition, we require that equality occurs only when \(\vec{x}=\vec{0}\).
This somewhat crazy definition is built precisely so that the binary function \[ \langle \vec{x},\vec{y} \rangle \to \transpose{\vec{x}}A\vec{y} \] forms an inner product.
Any diagonal matrix with only positive entries on the diagonal is positive definite. If all diagonal entries equal 1, then the function \[ \langle \vec{x},\vec{y} \rangle \to \transpose{\vec{x}}A\vec{y} \] is exactly the standard dot product. If some are different than one, then we recapture the weighted dot product. You can see this by simply expanding
\[ \transpose{\begin{bmatrix}x_1&x_2&\cdots&x_n\end{bmatrix}} \begin{bmatrix} a_1 & 0 & \cdots & 0 \\ 0 & a_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_n \end{bmatrix} \begin{bmatrix}x_1\\x_2\\\vdots\\x_n\end{bmatrix} = [a_1x_1^2+a_2x_2^2+\cdots+a_nx_n^2]. \]
\[ \begin{bmatrix}4&2\\2&3\end{bmatrix}\] is positive definite, though it’s a little trickier to prove. \[\begin{align*} \begin{bmatrix}x_1&x_2\end{bmatrix} \begin{bmatrix}4&2\\2&3\end{bmatrix} \begin{bmatrix}x_1\\x_2\end{bmatrix} &= \begin{bmatrix}4x_1+2x_2 & 2x_1+3x_2\end{bmatrix} \begin{bmatrix}x_1\\x_2\end{bmatrix} \\ &= (4x_1+2x_2)x_1 + (2x_1+3x_2)x_2 \\ &= 4x_1^2 + 4x_1x_2+3x_2^2 = (2x_1+x_2)^2 + 2x_2^2. \end{align*} \] The last term is certainly never negative, since it’s a sum of two squares. If the second term is zero, then \(y=0\) and, then, the first term can only be zero if \(x\) is also zero.
The last example is, clearly, a bit ad-hoc and it’s unfortunate that it’s not always easy to tell if a matrix is positive. There are a couple of other characterizations that are useful, though.
The product \(U^TU\) is an example of a matrix factorization. We’ll create a matrix of that form in class next time when we discuss a projection approach to the least squares problem.
Our two main topics of norms and inner products are closely related. In fact, we can use an inner product to define a norm.
Thm: If \(\langle \cdot,\cdot \rangle\) is an inner product on a vector space \(V\), then the function \[\vec{x} \to \sqrt{\langle \vec{x},\vec{x} \rangle}\] defines a norm on \(V\).
The standard dot product defines the Euclidean norm on \(\mathbb R^n\): \[\|\vec{x}\| = \sqrt{\vec{x}\cdot\vec{x}} = \sqrt{x_1^2+x_2^2+\cdots+x_n^2}.\]
A weighted inner product defines a weighted norm. If \[\langle \vec{x},\vec{y} \rangle = 2x_1y_1 + x_2y_2,\] on \(\mathbb R^2\), for example, then \[\|\vec{x}\| = \sqrt{\langle \vec{x},\vec{x} \rangle} = \sqrt{2x_1^2+x_2^2}.\]
Let \(V\) denote the set of all continuous, real values functions on \([0,1]\). Then, \[ \langle f,g \rangle = \int_0^1 f(x)g(x)\,dx \] Defines an inner product on \(V\). Thus, we now have a “norm” on these functions, defined by \[ \|f\| = \sqrt{\int_0^1 f(x)^2 dx}. \]
Suppose that \(\langle \cdot,\cdot \rangle\) is a an inner product on \(V\) and define \(\|\cdot\|\) on \(V\) by \[\vec{x} \to \sqrt{\langle \vec{x},\vec{x} \rangle}.\] Claim: \(\|\cdot\|\) is, in fact, a norm. Two properties are easy.
\[\begin{aligned} \|\lambda \vec{x}\| &= \sqrt{\langle \lambda\vec{x},\lambda\vec{x} \rangle} = \sqrt{\lambda^2 \langle \vec{x},\vec{x} \rangle} \\ &= \sqrt{\lambda^2} \sqrt{\langle \vec{x},\vec{x} \rangle} = |\lambda| \sqrt{\langle \vec{x},\vec{x} \rangle}. \end{aligned}\]
\[\|\vec{x}\| = \sqrt{\langle \vec{x},\vec{x} \rangle} \geq 0.\]
We’d still like to prove the triangle inequality, namely \[\|\vec{x} + \vec{y}\| \leq \|\vec{x}\| + \|\vec{y}\|.\] Let’s assume for a moment that \[ \langle \vec{x},\vec{y} \rangle \leq \|\vec{x}\|\|\vec{y}\|. \] Then \[\begin{aligned} \|\vec{x} + \vec{y}\|^2 &= \langle \vec{x}+\vec{y},\vec{x}+\vec{y} \rangle = \langle \vec{x},\vec{x} \rangle + 2\langle \vec{x},\vec{y} \rangle + \langle \vec{y},\vec{y} \rangle \\ &\leq \|\vec{x}\|^2 + 2\|\vec{x}\|\|\vec{y}\| + \|\vec{y}\|^2 = \left(\|\vec{x}\| + \|\vec{y}\|\right)^2 \end{aligned}\] Thus, we’d be done, if we could prove that inequality.
The inequality \[ \langle \vec{x},\vec{y} \rangle \leq \|\vec{x}\|\|\vec{y}\|. \] is quite a famous and important inequality of its own known as the Cauchy-Schwarz inequality. It certainly holds for the standard dot product because \[ \vec{x}\cdot\vec{y} = \|\vec{x}\|\|\vec{y}\|\cos(\theta) \] and the cosine is always in \(\pm1\).
It’s true for all inner products and to prove that more general fact, we need to do so using only the properties of the inner product and not the specific properties of the dot product.
Given \(\vec{x},\vec{y}\) in a vector space with an inner product, define a function \(f\) by \[ f(t) = \langle t\vec{x},\vec{y} \rangle \cdot \langle t\vec{x},\vec{y} \rangle = \|\vec{x}\|^2 t^2 + 2 \langle \vec{x},\vec{y} \rangle t + \|\vec{y}\|^2. \] Note that function is a quadratic and has a non-negative minimum at \[ t = -\frac{\langle \vec{x},\vec{y} \rangle}{\|\vec{x}\|^2}. \] Furthermore, the minimum value is \[ \|\vec{x}\|^2 \left(-\frac{\langle \vec{x},\vec{y} \rangle}{\|\vec{x}\|^2}\right)^2 - 2\langle \vec{x},\vec{y} \rangle \frac{\langle \vec{x},\vec{y} \rangle}{\|\vec{x}\|^2} + \|\vec{y}\|^2 = -\frac{\langle \vec{x},\vec{y} \rangle^2}{\|\vec{x}\|^2} + \|\vec{y}\|^2 \geq 0 \] Solving that last inequality for \(\langle \vec{x},\vec{y} \rangle\), we get exactly the Caucy-Schwarz inequality.