Geometry and Linear Algebraic Operations⚓︎
:label:sec_geometry-linear-algebraic-ops
In :numref:sec_linear-algebra
, we encountered the basics of linear algebra
and saw how it could be used to express common operations for transforming our data.
Linear algebra is one of the key mathematical pillars
underlying much of the work that we do in deep learning
and in machine learning more broadly.
While :numref:sec_linear-algebra
contained enough machinery
to communicate the mechanics of modern deep learning models,
there is a lot more to the subject.
In this section, we will go deeper,
highlighting some geometric interpretations of linear algebra operations,
and introducing a few fundamental concepts, including of eigenvalues and eigenvectors.
Geometry of Vectors⚓︎
First, we need to discuss the two common geometric interpretations of vectors, as either points or directions in space. Fundamentally, a vector is a list of numbers such as the Python list below.
#@tab all
v = [1, 7, 0, 1]
Mathematicians most often write this as either a column or row vector, which is to say either as
or
These often have different interpretations,
where data examples are column vectors
and weights used to form weighted sums are row vectors.
However, it can be beneficial to be flexible.
As we have described in :numref:sec_linear-algebra
,
though a single vector's default orientation is a column vector,
for any matrix representing a tabular dataset,
treating each data example as a row vector
in the matrix
is more conventional.
Given a vector, the first interpretation
that we should give it is as a point in space.
In two or three dimensions, we can visualize these points
by using the components of the vectors to define
the location of the points in space compared
to a fixed reference called the origin. This can be seen in :numref:fig_grid
.
:label:fig_grid
This geometric point of view allows us to consider the problem on a more abstract level. No longer faced with some insurmountable seeming problem like classifying pictures as either cats or dogs, we can start considering tasks abstractly as collections of points in space and picturing the task as discovering how to separate two distinct clusters of points.
In parallel, there is a second point of view
that people often take of vectors: as directions in space.
Not only can we think of the vector \(\mathbf{v} = [3,2]^\top\)
as the location \(3\) units to the right and \(2\) units up from the origin,
we can also think of it as the direction itself
to take \(3\) steps to the right and \(2\) steps up.
In this way, we consider all the vectors in figure :numref:fig_arrow
the same.
:label:fig_arrow
One of the benefits of this shift is that
we can make visual sense of the act of vector addition.
In particular, we follow the directions given by one vector,
and then follow the directions given by the other, as is seen in :numref:fig_add-vec
.
:label:fig_add-vec
Vector subtraction has a similar interpretation. By considering the identity that \(\mathbf{u} = \mathbf{v} + (\mathbf{u}-\mathbf{v})\), we see that the vector \(\mathbf{u}-\mathbf{v}\) is the direction that takes us from the point \(\mathbf{v}\) to the point \(\mathbf{u}\).
Dot Products and Angles⚓︎
As we saw in :numref:sec_linear-algebra
,
if we take two column vectors \(\mathbf{u}\) and \(\mathbf{v}\),
we can form their dot product by computing:
\(\(\mathbf{u}^\top\mathbf{v} = \sum_i u_i\cdot v_i.\)\)
:eqlabel:eq_dot_def
Because :eqref:eq_dot_def
is symmetric, we will mirror the notation
of classical multiplication and write
to highlight the fact that exchanging the order of the vectors will yield the same answer.
The dot product :eqref:eq_dot_def
also admits a geometric interpretation: it is closely related to the angle between two vectors. Consider the angle shown in :numref:fig_angle
.
:label:fig_angle
To start, let's consider two specific vectors:
The vector \(\mathbf{v}\) is length \(r\) and runs parallel to the \(x\)-axis, and the vector \(\mathbf{w}\) is of length \(s\) and at angle \(\theta\) with the \(x\)-axis. If we compute the dot product of these two vectors, we see that
With some simple algebraic manipulation, we can rearrange terms to obtain
In short, for these two specific vectors, the dot product combined with the norms tell us the angle between the two vectors. This same fact is true in general. We will not derive the expression here, however, if we consider writing \(\|\mathbf{v} - \mathbf{w}\|^2\) in two ways: one with the dot product, and the other geometrically using the law of cosines, we can obtain the full relationship. Indeed, for any two vectors \(\mathbf{v}\) and \(\mathbf{w}\), the angle between the two vectors is
\(\(\theta = \arccos\left(\frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}\right).\)\)
:eqlabel:eq_angle_forumla
This is a nice result since nothing in the computation references two-dimensions. Indeed, we can use this in three or three million dimensions without issue.
As a simple example, let's see how to compute the angle between a pair of vectors:
#@tab mxnet
%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
from mxnet import gluon, np, npx
npx.set_np()
def angle(v, w):
return np.arccos(v.dot(w) / (np.linalg.norm(v) * np.linalg.norm(w)))
angle(np.array([0, 1, 2]), np.array([2, 3, 4]))
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
from IPython import display
import torch
from torchvision import transforms
import torchvision
def angle(v, w):
return torch.acos(v.dot(w) / (torch.norm(v) * torch.norm(w)))
angle(torch.tensor([0, 1, 2], dtype=torch.float32), torch.tensor([2.0, 3, 4]))
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
from IPython import display
import tensorflow as tf
def angle(v, w):
return tf.acos(tf.tensordot(v, w, axes=1) / (tf.norm(v) * tf.norm(w)))
angle(tf.constant([0, 1, 2], dtype=tf.float32), tf.constant([2.0, 3, 4]))
We will not use it right now, but it is useful to know that we will refer to vectors for which the angle is \(\pi/2\) (or equivalently \(90^{\circ}\)) as being orthogonal. By examining the equation above, we see that this happens when \(\theta = \pi/2\), which is the same thing as \(\cos(\theta) = 0\). The only way this can happen is if the dot product itself is zero, and two vectors are orthogonal if and only if \(\mathbf{v}\cdot\mathbf{w} = 0\). This will prove to be a helpful formula when understanding objects geometrically.
It is reasonable to ask: why is computing the angle useful? The answer comes in the kind of invariance we expect data to have. Consider an image, and a duplicate image, where every pixel value is the same but \(10\%\) the brightness. The values of the individual pixels are in general far from the original values. Thus, if one computed the distance between the original image and the darker one, the distance can be large. However, for most ML applications, the content is the same---it is still an image of a cat as far as a cat/dog classifier is concerned. However, if we consider the angle, it is not hard to see that for any vector \(\mathbf{v}\), the angle between \(\mathbf{v}\) and \(0.1\cdot\mathbf{v}\) is zero. This corresponds to the fact that scaling vectors keeps the same direction and just changes the length. The angle considers the darker image identical.
Examples like this are everywhere. In text, we might want the topic being discussed to not change if we write twice as long of document that says the same thing. For some encoding (such as counting the number of occurrences of words in some vocabulary), this corresponds to a doubling of the vector encoding the document, so again we can use the angle.
Cosine Similarity⚓︎
In ML contexts where the angle is employed to measure the closeness of two vectors, practitioners adopt the term cosine similarity to refer to the portion $$ \cos(\theta) = \frac{\mathbf{v}\cdot\mathbf{w}}{|\mathbf{v}||\mathbf{w}|}. $$
The cosine takes a maximum value of \(1\) when the two vectors point in the same direction, a minimum value of \(-1\) when they point in opposite directions, and a value of \(0\) when the two vectors are orthogonal. Note that if the components of high-dimensional vectors are sampled randomly with mean \(0\), their cosine will nearly always be close to \(0\).
Hyperplanes⚓︎
In addition to working with vectors, another key object that you must understand to go far in linear algebra is the hyperplane, a generalization to higher dimensions of a line (two dimensions) or of a plane (three dimensions). In an \(d\)-dimensional vector space, a hyperplane has \(d-1\) dimensions and divides the space into two half-spaces.
Let's start with an example.
Suppose that we have a column vector \(\mathbf{w}=[2,1]^\top\). We want to know, "what are the points \(\mathbf{v}\) with \(\mathbf{w}\cdot\mathbf{v} = 1\)?"
By recalling the connection between dot products and angles above :eqref:eq_angle_forumla
,
we can see that this is equivalent to
$$
|\mathbf{v}||\mathbf{w}|\cos(\theta) = 1 \; \iff \; |\mathbf{v}|\cos(\theta) = \frac{1}{|\mathbf{w}|} = \frac{1}{\sqrt{5}}.
$$
:label:fig_vector-project
If we consider the geometric meaning of this expression,
we see that this is equivalent to saying
that the length of the projection of \(\mathbf{v}\)
onto the direction of \(\mathbf{w}\) is exactly \(1/\|\mathbf{w}\|\), as is shown in :numref:fig_vector-project
.
The set of all points where this is true is a line
at right angles to the vector \(\mathbf{w}\).
If we wanted, we could find the equation for this line
and see that it is \(2x + y = 1\) or equivalently \(y = 1 - 2x\).
If we now look at what happens when we ask about the set of points with
\(\mathbf{w}\cdot\mathbf{v} > 1\) or \(\mathbf{w}\cdot\mathbf{v} < 1\),
we can see that these are cases where the projections
are longer or shorter than \(1/\|\mathbf{w}\|\), respectively.
Thus, those two inequalities define either side of the line.
In this way, we have found a way to cut our space into two halves,
where all the points on one side have dot product below a threshold,
and the other side above as we see in :numref:fig_space-division
.
:label:fig_space-division
The story in higher dimension is much the same.
If we now take \(\mathbf{w} = [1,2,3]^\top\)
and ask about the points in three dimensions with \(\mathbf{w}\cdot\mathbf{v} = 1\),
we obtain a plane at right angles to the given vector \(\mathbf{w}\).
The two inequalities again define the two sides of the plane as is shown in :numref:fig_higher-division
.
:label:fig_higher-division
While our ability to visualize runs out at this point,
nothing stops us from doing this in tens, hundreds, or billions of dimensions.
This occurs often when thinking about machine learned models.
For instance, we can understand linear classification models
like those from :numref:sec_softmax
,
as methods to find hyperplanes that separate the different target classes.
In this context, such hyperplanes are often referred to as decision planes.
The majority of deep learned classification models end
with a linear layer fed into a softmax,
so one can interpret the role of the deep neural network
to be to find a non-linear embedding such that the target classes
can be separated cleanly by hyperplanes.
To give a hand-built example, notice that we can produce a reasonable model
to classify tiny images of t-shirts and trousers from the Fashion-MNIST dataset
(seen in :numref:sec_fashion_mnist
)
by just taking the vector between their means to define the decision plane
and eyeball a crude threshold. First we will load the data and compute the averages.
#@tab mxnet
# Load in the dataset
train = gluon.data.vision.FashionMNIST(train=True)
test = gluon.data.vision.FashionMNIST(train=False)
X_train_0 = np.stack([x[0] for x in train if x[1] == 0]).astype(float)
X_train_1 = np.stack([x[0] for x in train if x[1] == 1]).astype(float)
X_test = np.stack(
[x[0] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
y_test = np.stack(
[x[1] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
# Compute averages
ave_0 = np.mean(X_train_0, axis=0)
ave_1 = np.mean(X_train_1, axis=0)
#@tab pytorch
# Load in the dataset
trans = []
trans.append(transforms.ToTensor())
trans = transforms.Compose(trans)
train = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=True, download=True)
test = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=False, download=True)
X_train_0 = torch.stack(
[x[0] * 256 for x in train if x[1] == 0]).type(torch.float32)
X_train_1 = torch.stack(
[x[0] * 256 for x in train if x[1] == 1]).type(torch.float32)
X_test = torch.stack(
[x[0] * 256 for x in test if x[1] == 0 or x[1] == 1]).type(torch.float32)
y_test = torch.stack([torch.tensor(x[1]) for x in test
if x[1] == 0 or x[1] == 1]).type(torch.float32)
# Compute averages
ave_0 = torch.mean(X_train_0, axis=0)
ave_1 = torch.mean(X_train_1, axis=0)
#@tab tensorflow
# Load in the dataset
((train_images, train_labels), (
test_images, test_labels)) = tf.keras.datasets.fashion_mnist.load_data()
X_train_0 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 0]] * 256), dtype=tf.float32)
X_train_1 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 1]] * 256), dtype=tf.float32)
X_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 0]] * 256), dtype=tf.float32)
y_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 1]] * 256), dtype=tf.float32)
# Compute averages
ave_0 = tf.reduce_mean(X_train_0, axis=0)
ave_1 = tf.reduce_mean(X_train_1, axis=0)
It can be informative to examine these averages in detail, so let's plot what they look like. In this case, we see that the average indeed resembles a blurry image of a t-shirt.
#@tab mxnet, pytorch
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
#@tab tensorflow
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(tf.reshape(ave_0, (28, 28)), cmap='Greys')
d2l.plt.show()
In the second case, we again see that the average resembles a blurry image of trousers.
#@tab mxnet, pytorch
# Plot average trousers
d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
#@tab tensorflow
# Plot average trousers
d2l.plt.imshow(tf.reshape(ave_1, (28, 28)), cmap='Greys')
d2l.plt.show()
In a fully machine learned solution, we would learn the threshold from the dataset. In this case, I simply eyeballed a threshold that looked good on the training data by hand.
#@tab mxnet
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
predictions = X_test.reshape(2000, -1).dot(w.flatten()) > -1500000
# Accuracy
np.mean(predictions.astype(y_test.dtype) == y_test, dtype=np.float64)
#@tab pytorch
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
# '@' is Matrix Multiplication operator in pytorch.
predictions = X_test.reshape(2000, -1) @ (w.flatten()) > -1500000
# Accuracy
torch.mean((predictions.type(y_test.dtype) == y_test).float(), dtype=torch.float64)
#@tab tensorflow
# Print test set accuracy with eyeballed threshold
w = tf.transpose(ave_1 - ave_0)
predictions = tf.reduce_sum(X_test * tf.nest.flatten(w), axis=0) > -1500000
# Accuracy
tf.reduce_mean(
tf.cast(tf.cast(predictions, y_test.dtype) == y_test, tf.float32))
Geometry of Linear Transformations⚓︎
Through :numref:sec_linear-algebra
and the above discussions,
we have a solid understanding of the geometry of vectors, lengths, and angles.
However, there is one important object we have omitted discussing,
and that is a geometric understanding of linear transformations represented by matrices. Fully internalizing what matrices can do to transform data
between two potentially different high dimensional spaces takes significant practice,
and is beyond the scope of this appendix.
However, we can start building up intuition in two dimensions.
Suppose that we have some matrix:
If we want to apply this to an arbitrary vector \(\mathbf{v} = [x, y]^\top\), we multiply and see that
This may seem like an odd computation, where something clear became somewhat impenetrable. However, it tells us that we can write the way that a matrix transforms any vector in terms of how it transforms two specific vectors: \([1,0]^\top\) and \([0,1]^\top\). This is worth considering for a moment. We have essentially reduced an infinite problem (what happens to any pair of real numbers) to a finite one (what happens to these specific vectors). These vectors are an example a basis, where we can write any vector in our space as a weighted sum of these basis vectors.
Let's draw what happens when we use the specific matrix
If we look at the specific vector \(\mathbf{v} = [2, -1]^\top\),
we see this is \(2\cdot[1,0]^\top + -1\cdot[0,1]^\top\),
and thus we know that the matrix \(A\) will send this to
\(2(\mathbf{A}[1,0]^\top) + -1(\mathbf{A}[0,1])^\top = 2[1, -1]^\top - [2,3]^\top = [0, -5]^\top\).
If we follow this logic through carefully,
say by considering the grid of all integer pairs of points,
we see that what happens is that the matrix multiplication
can skew, rotate, and scale the grid,
but the grid structure must remain as you see in :numref:fig_grid-transform
.
:label:fig_grid-transform
This is the most important intuitive point to internalize about linear transformations represented by matrices. Matrices are incapable of distorting some parts of space differently than others. All they can do is take the original coordinates on our space and skew, rotate, and scale them.
Some distortions can be severe. For instance the matrix
compresses the entire two-dimensional plane down to a single line. Identifying and working with such transformations are the topic of a later section, but geometrically we can see that this is fundamentally different from the types of transformations we saw above. For instance, the result from matrix \(\mathbf{A}\) can be "bent back" to the original grid. The results from matrix \(\mathbf{B}\) cannot because we will never know where the vector \([1,2]^\top\) came from---was it \([1,1]^\top\) or \([0, -1]^\top\)?
While this picture was for a \(2\times2\) matrix, nothing prevents us from taking the lessons learned into higher dimensions. If we take similar basis vectors like \([1,0, \ldots,0]\) and see where our matrix sends them, we can start to get a feeling for how the matrix multiplication distorts the entire space in whatever dimension space we are dealing with.
Linear Dependence⚓︎
Consider again the matrix
This compresses the entire plane down to live on the single line \(y = 2x\). The question now arises: is there some way we can detect this just looking at the matrix itself? The answer is that indeed we can. Let's take \(\mathbf{b}_1 = [2,4]^\top\) and \(\mathbf{b}_2 = [-1, -2]^\top\) be the two columns of \(\mathbf{B}\). Remember that we can write everything transformed by the matrix \(\mathbf{B}\) as a weighted sum of the columns of the matrix: like \(a_1\mathbf{b}_1 + a_2\mathbf{b}_2\). We call this a linear combination. The fact that \(\mathbf{b}_1 = -2\cdot\mathbf{b}_2\) means that we can write any linear combination of those two columns entirely in terms of say \(\mathbf{b}_2\) since
This means that one of the columns is, in a sense, redundant because it does not define a unique direction in space. This should not surprise us too much since we already saw that this matrix collapses the entire plane down into a single line. Moreover, we see that the linear dependence \(\mathbf{b}_1 = -2\cdot\mathbf{b}_2\) captures this. To make this more symmetrical between the two vectors, we will write this as
In general, we will say that a collection of vectors \(\mathbf{v}_1, \ldots, \mathbf{v}_k\) are linearly dependent if there exist coefficients \(a_1, \ldots, a_k\) not all equal to zero so that
In this case, we can solve for one of the vectors in terms of some combination of the others, and effectively render it redundant. Thus, a linear dependence in the columns of a matrix is a witness to the fact that our matrix is compressing the space down to some lower dimension. If there is no linear dependence we say the vectors are linearly independent. If the columns of a matrix are linearly independent, no compression occurs and the operation can be undone.
Rank⚓︎
If we have a general \(n\times m\) matrix, it is reasonable to ask what dimension space the matrix maps into. A concept known as the rank will be our answer. In the previous section, we noted that a linear dependence bears witness to compression of space into a lower dimension and so we will be able to use this to define the notion of rank. In particular, the rank of a matrix \(\mathbf{A}\) is the largest number of linearly independent columns amongst all subsets of columns. For example, the matrix
has \(\textrm{rank}(B)=1\), since the two columns are linearly dependent, but either column by itself is not linearly dependent. For a more challenging example, we can consider
and show that \(\mathbf{C}\) has rank two since, for instance, the first two columns are linearly independent, however any of the four collections of three columns are dependent.
This procedure, as described, is very inefficient. It requires looking at every subset of the columns of our given matrix, and thus is potentially exponential in the number of columns. Later we will see a more computationally efficient way to compute the rank of a matrix, but for now, this is sufficient to see that the concept is well defined and understand the meaning.
Invertibility⚓︎
We have seen above that multiplication by a matrix with linearly dependent columns cannot be undone, i.e., there is no inverse operation that can always recover the input. However, multiplication by a full-rank matrix (i.e., some \(\mathbf{A}\) that is \(n \times n\) matrix with rank \(n\)), we should always be able to undo it. Consider the matrix
which is the matrix with ones along the diagonal, and zeros elsewhere. We call this the identity matrix. It is the matrix which leaves our data unchanged when applied. To find a matrix which undoes what our matrix \(\mathbf{A}\) has done, we want to find a matrix \(\mathbf{A}^{-1}\) such that
If we look at this as a system, we have \(n \times n\) unknowns (the entries of \(\mathbf{A}^{-1}\)) and \(n \times n\) equations (the equality that needs to hold between every entry of the product \(\mathbf{A}^{-1}\mathbf{A}\) and every entry of \(\mathbf{I}\)) so we should generically expect a solution to exist. Indeed, in the next section we will see a quantity called the determinant, which has the property that as long as the determinant is not zero, we can find a solution. We call such a matrix \(\mathbf{A}^{-1}\) the inverse matrix. As an example, if \(\mathbf{A}\) is the general \(2 \times 2\) matrix
then we can see that the inverse is
We can test to see this by seeing that multiplying by the inverse given by the formula above works in practice.
#@tab mxnet
M = np.array([[1, 2], [1, 4]])
M_inv = np.array([[2, -1], [-0.5, 0.5]])
M_inv.dot(M)
#@tab pytorch
M = torch.tensor([[1, 2], [1, 4]], dtype=torch.float32)
M_inv = torch.tensor([[2, -1], [-0.5, 0.5]])
M_inv @ M
#@tab tensorflow
M = tf.constant([[1, 2], [1, 4]], dtype=tf.float32)
M_inv = tf.constant([[2, -1], [-0.5, 0.5]])
tf.matmul(M_inv, M)
Numerical Issues⚓︎
While the inverse of a matrix is useful in theory, we must say that most of the time we do not wish to use the matrix inverse to solve a problem in practice. In general, there are far more numerically stable algorithms for solving linear equations like
than computing the inverse and multiplying to get
Just as division by a small number can lead to numerical instability, so can inversion of a matrix which is close to having low rank.
Moreover, it is common that the matrix \(\mathbf{A}\) is sparse, which is to say that it contains only a small number of non-zero values. If we were to explore examples, we would see that this does not mean the inverse is sparse. Even if \(\mathbf{A}\) was a \(1\) million by \(1\) million matrix with only \(5\) million non-zero entries (and thus we need only store those \(5\) million), the inverse will typically have almost every entry non-negative, requiring us to store all \(1\textrm{M}^2\) entries---that is \(1\) trillion entries!
While we do not have time to dive all the way into the thorny numerical issues frequently encountered when working with linear algebra, we want to provide you with some intuition about when to proceed with caution, and generally avoiding inversion in practice is a good rule of thumb.
Determinant⚓︎
The geometric view of linear algebra gives an intuitive way
to interpret a fundamental quantity known as the determinant.
Consider the grid image from before, but now with a highlighted region (:numref:fig_grid-filled
).
:label:fig_grid-filled
Look at the highlighted square. This is a square with edges given by \((0, 1)\) and \((1, 0)\) and thus it has area one. After \(\mathbf{A}\) transforms this square, we see that it becomes a parallelogram. There is no reason this parallelogram should have the same area that we started with, and indeed in the specific case shown here of
it is an exercise in coordinate geometry to compute the area of this parallelogram and obtain that the area is \(5\).
In general, if we have a matrix
we can see with some computation that the area of the resulting parallelogram is \(ad-bc\). This area is referred to as the determinant.
Let's check this quickly with some example code.
#@tab mxnet
import numpy as np
np.linalg.det(np.array([[1, -1], [2, 3]]))
#@tab pytorch
torch.det(torch.tensor([[1, -1], [2, 3]], dtype=torch.float32))
#@tab tensorflow
tf.linalg.det(tf.constant([[1, -1], [2, 3]], dtype=tf.float32))
The eagle-eyed amongst us will notice that this expression can be zero or even negative. For the negative term, this is a matter of convention taken generally in mathematics: if the matrix flips the figure, we say the area is negated. Let's see now that when the determinant is zero, we learn more.
Let's consider
If we compute the determinant of this matrix, we get \(2\cdot(-2 ) - 4\cdot(-1) = 0\). Given our understanding above, this makes sense. \(\mathbf{B}\) compresses the square from the original image down to a line segment, which has zero area. And indeed, being compressed into a lower dimensional space is the only way to have zero area after the transformation. Thus we see the following result is true: a matrix \(A\) is invertible if and only if the determinant is not equal to zero.
As a final comment, imagine that we have any figure drawn on the plane. Thinking like computer scientists, we can decompose that figure into a collection of little squares so that the area of the figure is in essence just the number of squares in the decomposition. If we now transform that figure by a matrix, we send each of these squares to parallelograms, each one of which has area given by the determinant. We see that for any figure, the determinant gives the (signed) number that a matrix scales the area of any figure.
Computing determinants for larger matrices can be laborious, but the intuition is the same. The determinant remains the factor that \(n\times n\) matrices scale \(n\)-dimensional volumes.
Tensors and Common Linear Algebra Operations⚓︎
In :numref:sec_linear-algebra
the concept of tensors was introduced.
In this section, we will dive more deeply into tensor contractions
(the tensor equivalent of matrix multiplication),
and see how it can provide a unified view
on a number of matrix and vector operations.
With matrices and vectors we knew how to multiply them to transform data. We need to have a similar definition for tensors if they are to be useful to us. Think about matrix multiplication:
or equivalently
This pattern is one we can repeat for tensors. For tensors, there is no one case of what to sum over that can be universally chosen, so we need specify exactly which indices we want to sum over. For instance we could consider
Such a transformation is called a tensor contraction. It can represent a far more flexible family of transformations that matrix multiplication alone.
As a often-used notational simplification, we can notice that the sum is over exactly those indices that occur more than once in the expression, thus people often work with Einstein notation, where the summation is implicitly taken over all repeated indices. This gives the compact expression:
Common Examples from Linear Algebra⚓︎
Let's see how many of the linear algebraic definitions we have seen before can be expressed in this compressed tensor notation:
- \(\mathbf{v} \cdot \mathbf{w} = \sum_i v_iw_i\)
- \(\|\mathbf{v}\|_2^{2} = \sum_i v_iv_i\)
- \((\mathbf{A}\mathbf{v})_i = \sum_j a_{ij}v_j\)
- \((\mathbf{A}\mathbf{B})_{ik} = \sum_j a_{ij}b_{jk}\)
- \(\textrm{tr}(\mathbf{A}) = \sum_i a_{ii}\)
In this way, we can replace a myriad of specialized notations with short tensor expressions.
Expressing in Code⚓︎
Tensors may flexibly be operated on in code as well.
As seen in :numref:sec_linear-algebra
,
we can create tensors as is shown below.
#@tab mxnet
# Define tensors
B = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = np.array([[1, 2], [3, 4]])
v = np.array([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
#@tab pytorch
# Define tensors
B = torch.tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = torch.tensor([[1, 2], [3, 4]])
v = torch.tensor([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
#@tab tensorflow
# Define tensors
B = tf.constant([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = tf.constant([[1, 2], [3, 4]])
v = tf.constant([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
Einstein summation has been implemented directly. The indices that occurs in the Einstein summation can be passed as a string, followed by the tensors that are being acted upon. For instance, to implement matrix multiplication, we can consider the Einstein summation seen above (\(\mathbf{A}\mathbf{v} = a_{ij}v_j\)) and strip out the indices themselves to get the implementation:
#@tab mxnet
# Reimplement matrix multiplication
np.einsum("ij, j -> i", A, v), A.dot(v)
#@tab pytorch
# Reimplement matrix multiplication
torch.einsum("ij, j -> i", A, v), A@v
#@tab tensorflow
# Reimplement matrix multiplication
tf.einsum("ij, j -> i", A, v), tf.matmul(A, tf.reshape(v, (2, 1)))
This is a highly flexible notation. For instance if we want to compute what would be traditionally written as
it can be implemented via Einstein summation as:
#@tab mxnet
np.einsum("ijk, il, j -> kl", B, A, v)
#@tab pytorch
torch.einsum("ijk, il, j -> kl", B, A, v)
#@tab tensorflow
tf.einsum("ijk, il, j -> kl", B, A, v)
This notation is readable and efficient for humans,
however bulky if for whatever reason
we need to generate a tensor contraction programmatically.
For this reason, einsum
provides an alternative notation
by providing integer indices for each tensor.
For example, the same tensor contraction can also be written as:
#@tab mxnet
np.einsum(B, [0, 1, 2], A, [0, 3], v, [1], [2, 3])
#@tab pytorch
# PyTorch does not support this type of notation.
#@tab tensorflow
# TensorFlow does not support this type of notation.
Either notation allows for concise and efficient representation of tensor contractions in code.
Summary⚓︎
- Vectors can be interpreted geometrically as either points or directions in space.
- Dot products define the notion of angle to arbitrarily high-dimensional spaces.
- Hyperplanes are high-dimensional generalizations of lines and planes. They can be used to define decision planes that are often used as the last step in a classification task.
- Matrix multiplication can be geometrically interpreted as uniform distortions of the underlying coordinates. They represent a very restricted, but mathematically clean, way to transform vectors.
- Linear dependence is a way to tell when a collection of vectors are in a lower dimensional space than we would expect (say you have \(3\) vectors living in a \(2\)-dimensional space). The rank of a matrix is the size of the largest subset of its columns that are linearly independent.
- When a matrix's inverse is defined, matrix inversion allows us to find another matrix that undoes the action of the first. Matrix inversion is useful in theory, but requires care in practice owing to numerical instability.
- Determinants allow us to measure how much a matrix expands or contracts a space. A nonzero determinant implies an invertible (non-singular) matrix and a zero-valued determinant means that the matrix is non-invertible (singular).
- Tensor contractions and Einstein summation provide for a neat and clean notation for expressing many of the computations that are seen in machine learning.
Exercises⚓︎
- What is the angle between $$ \vec v_1 = \begin{bmatrix} 1 \ 0 \ -1 \ 2 \end{bmatrix}, \qquad \vec v_2 = \begin{bmatrix} 3 \ 1 \ 0 \ 1 \end{bmatrix}? $$
- True or false: \(\begin{bmatrix}1 & 2\\0&1\end{bmatrix}\) and \(\begin{bmatrix}1 & -2\\0&1\end{bmatrix}\) are inverses of one another?
- Suppose that we draw a shape in the plane with area \(100\textrm{m}^2\). What is the area after transforming the figure by the matrix $$ \begin{bmatrix} 2 & 3\ 1 & 2 \end{bmatrix}. $$
- Which of the following sets of vectors are linearly independent?
- \(\left\{\begin{pmatrix}1\\0\\-1\end{pmatrix}, \begin{pmatrix}2\\1\\-1\end{pmatrix}, \begin{pmatrix}3\\1\\1\end{pmatrix}\right\}\)
- \(\left\{\begin{pmatrix}3\\1\\1\end{pmatrix}, \begin{pmatrix}1\\1\\1\end{pmatrix}, \begin{pmatrix}0\\0\\0\end{pmatrix}\right\}\)
- \(\left\{\begin{pmatrix}1\\1\\0\end{pmatrix}, \begin{pmatrix}0\\1\\-1\end{pmatrix}, \begin{pmatrix}1\\0\\1\end{pmatrix}\right\}\)
- Suppose that you have a matrix written as \(A = \begin{bmatrix}c\\d\end{bmatrix}\cdot\begin{bmatrix}a & b\end{bmatrix}\) for some choice of values \(a, b, c\), and \(d\). True or false: the determinant of such a matrix is always \(0\)?
- The vectors \(e_1 = \begin{bmatrix}1\\0\end{bmatrix}\) and \(e_2 = \begin{bmatrix}0\\1\end{bmatrix}\) are orthogonal. What is the condition on a matrix \(A\) so that \(Ae_1\) and \(Ae_2\) are orthogonal?
- How can you write \(\textrm{tr}(\mathbf{A}^4)\) in Einstein notation for an arbitrary matrix \(A\)?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab:
创建日期: November 25, 2023