-
Notifications
You must be signed in to change notification settings - Fork 18
/
cs189-linalg.tex
435 lines (383 loc) · 29.3 KB
/
cs189-linalg.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
In this section we present important classes of spaces in which our data will live and our operations will take place: vector spaces, metric spaces, normed spaces, and inner product spaces.
Generally speaking, these are defined in such a way as to capture one or more important properties of Euclidean space but in a more general way.
\subsection{Vector spaces}
\term{Vector spaces} are the basic setting in which linear algebra happens.
A vector space $V$ is a set (the elements of which are called \term{vectors}) on which two operations are defined: vectors can be added together, and vectors can be multiplied by real numbers\footnote{
More generally, vector spaces can be defined over any \term{field} $\F$.
We take $\F = \R$ in this document to avoid an unnecessary diversion into abstract algebra.
} called \term{scalars}.
$V$ must satisfy
\begin{enumerate}[(i)]
\item There exists an additive identity (written $\vec{0}$) in $V$ such that $\x+\vec{0} = \x$ for all $\x \in V$
\item For each $\x \in V$, there exists an additive inverse (written $\vec{-x}$) such that $\x+(\vec{-x}) = \vec{0}$
\item There exists a multiplicative identity (written $1$) in $\R$ such that $1\x = \x$ for all $\x \in V$
\item Commutativity: $\x+\y = \y+\x$ for all $\x, \y \in V$
\item Associativity: $(\x+\y)+\vec{z} = \x+(\y+\vec{z})$ and $\alpha(\beta\x) = (\alpha\beta)\x$ for all $\x, \y, \vec{z} \in V$ and $\alpha, \beta \in \R$
\item Distributivity: $\alpha(\x+\y) = \alpha\x + \alpha\y$ and $(\alpha+\beta)\x = \alpha\x + \beta\x$ for all $\x, \y \in V$ and $\alpha, \beta \in \R$
\end{enumerate}
\subsubsection{Euclidean space}
The quintessential vector space is \term{Euclidean space}, which we denote $\R^n$.
The vectors in this space consist of $n$-tuples of real numbers:
\[\x = (x_1, x_2, \dots, x_n)\]
For our purposes, it will be useful to think of them as $n \times 1$ matrices, or \term{column vectors}:
\[\x = \matlit{x_1 \\ x_2 \\ \vdots \\ x_n}\]
Addition and scalar multiplication are defined component-wise on vectors in $\R^n$:
\[\x + \y = \matlit{x_1 + y_1 \\ \vdots \\ x_n + y_n}, \tab \alpha\x = \matlit{\alpha x_1 \\ \vdots \\ \alpha x_n}\]
Euclidean space is used to mathematically represent physical space, with notions such as distance, length, and angles.
Although it becomes hard to visualize for $n > 3$, these concepts generalize mathematically in obvious ways.
Even when you're working in more general settings than $\R^n$, it is often useful to visualize vector addition and scalar multiplication in terms of 2D vectors in the plane or 3D vectors in space.
\subsubsection{Subspaces}
Vector spaces can contain other vector spaces.
If $V$ is a vector space, then $S \subseteq V$ is said to be a \term{subspace} of $V$ if
\begin{enumerate}[(i)]
\item $\vec{0} \in S$
\item $S$ is closed under addition: $\x, \y \in S$ implies $\x+\y \in S$
\item $S$ is closed under scalar multiplication: $\x \in S, \alpha \in \R$ implies $\alpha\x \in S$
\end{enumerate}
Note that $V$ is always a subspace of $V$, as is the trivial vector space which contains only $\vec{0}$.
As a concrete example, a line passing through the origin is a subspace of Euclidean space.
Some of the most important subspaces are those induced by linear maps.
If $T : V \to W$ is a linear map, we define the \term{nullspace}\footnote{
It is sometimes called the \term{kernel} by algebraists, but we eschew this terminology because the word ``kernel'' has another meaning in machine learning.
} of $T$ as
\[\Null(T) = \{\x \in V \mid T\x = \vec{0}\}\]
and the \term{range} (or the \term{columnspace} if we are considering the matrix form) of $T$ as
\[\range(T) = \{\y \in W \mid \text{$\exists \x \in V$ such that $T\x = \y$}\}\]
It is a good exercise to verify that the nullspace and range of a linear map are always subspaces of its domain and codomain, respectively.
\subsection{Metric spaces}
Metrics generalize the notion of distance from Euclidean space (although metric spaces need not be vector spaces).
A \term{metric} on a set $S$ is a function $d : S \times S \to \R$ that satisfies
\begin{enumerate}[(i)]
\item $d(x,y) \geq 0$, with equality if and only if $x = y$
\item $d(x,y) = d(y,x)$
\item $d(x,z) \leq d(x,y) + d(y,z)$ (the so-called \term{triangle inequality})
\end{enumerate}
for all $x, y, z \in S$.
A key motivation for metrics is that they allow limits to be defined for mathematical objects other than real numbers.
We say that a sequence $\{x_n\} \subseteq S$ converges to the limit $x$ if for any $\epsilon > 0$, there exists $N \in \N$ such that $d(x_n, x) < \epsilon$ for all $n \geq N$.
Note that the definition for limits of sequences of real numbers, which you have likely seen in a calculus class, is a special case of this definition when using the metric $d(x, y) = |x-y|$.
\subsection{Normed spaces}
Norms generalize the notion of length from Euclidean space.
A \term{norm} on a real vector space $V$ is a function $\|\cdot\| : V \to \R$ that satisfies
\begin{enumerate}[(i)]
\item $\|\x\| \geq 0$, with equality if and only if $\x = \vec{0}$
\item $\|\alpha\x\| = |\alpha|\|\x\|$
\item $\|\x+\y\| \leq \|\x\| + \|\y\|$ (the \term{triangle inequality} again)
\end{enumerate}
for all $\x, \y \in V$ and all $\alpha \in \R$.
A vector space endowed with a norm is called a \term{normed vector space}, or simply a \term{normed space}.
Note that any norm on $V$ induces a distance metric on $V$:
\[d(\x, \y) = \|\x-\y\|\]
One can verify that the axioms for metrics are satisfied under this definition and follow directly from the axioms for norms.
Therefore any normed space is also a metric space.\footnote{
If a normed space is complete with respect to the distance metric induced by its norm, we say that it is a \term{Banach space}.
}
We will typically only be concerned with a few specific norms on $\R^n$:
\begin{align*}
\|\x\|_1 &= \sum_{i=1}^n |x_i| \\
\|\x\|_2 &= \sqrt{\sum_{i=1}^n x_i^2} \\
\|\x\|_p &= \left(\sum_{i=1}^n |x_i|^p\right)^\frac{1}{p} \tab\tab (p \geq 1) \\
\|\x\|_\infty &= \max_{1 \leq i \leq n} |x_i|
\end{align*}
Note that the 1- and 2-norms are special cases of the $p$-norm, and the $\infty$-norm is the limit of the $p$-norm as $p$ tends to infinity.
We require $p \geq 1$ for the general definition of the $p$-norm because the triangle inequality fails to hold if $p < 1$.
(Try to find a counterexample!)
Here's a fun fact: for any given finite-dimensional vector space $V$, all norms on $V$ are equivalent in the sense that for two norms $\|\cdot\|_A, \|\cdot\|_B$, there exist constants $\alpha, \beta > 0$ such that
\[\alpha\|\x\|_A \leq \|\x\|_B \leq \beta\|\x\|_A\]
for all $\x \in V$. Therefore convergence in one norm implies convergence in any other norm.
This rule may not apply in infinite-dimensional vector spaces such as function spaces, though.
\subsection{Inner product spaces}
An \term{inner product} on a real vector space $V$ is a function $\inner{\cdot}{\cdot} : V \times V \to \R$ satisfying
\begin{enumerate}[(i)]
\item $\inner{\x}{\x} \geq 0$, with equality if and only if $\x = \vec{0}$
\item $\inner{\alpha\x + \beta\y}{\vec{z}} = \alpha\inner{\x}{\vec{z}} + \beta\inner{\y}{\vec{z}}$
\item $\inner{\x}{\y} = \inner{\y}{\x}$
\end{enumerate}
for all $\x, \y, \vec{z} \in V$ and all $\alpha,\beta \in \R$.
A vector space endowed with an inner product is called an \term{inner product space}.
Note that any inner product on $V$ induces a norm on $V$:
\[\|\x\| = \sqrt{\inner{\x}{\x}}\]
One can verify that the axioms for norms are satisfied under this definition and follow (almost) directly from the axioms for inner products.
Therefore any inner product space is also a normed space (and hence also a metric space).\footnote{
If an inner product space is complete with respect to the distance metric induced by its inner product, we say that it is a \term{Hilbert space}.
}
Two vectors $\x$ and $\y$ are said to be \term{orthogonal} if $\inner{\x}{\y} = 0$; we write $\x \perp \y$ for shorthand.
Orthogonality generalizes the notion of perpendicularity from Euclidean space.
If two orthogonal vectors $\x$ and $\y$ additionally have unit length (i.e. $\|\x\| = \|\y\| = 1$), then they are described as \term{orthonormal}.
The standard inner product on $\R^n$ is given by
\[\inner{\x}{\y} = \sum_{i=1}^n x_iy_i = \x\tran\y\]
The matrix notation on the righthand side (see the Transposition section if it's unfamiliar) arises because this inner product is a special case of matrix multiplication where we regard the resulting $1 \times 1$ matrix as a scalar.
The inner product on $\R^n$ is also often written $\x\cdot\y$ (hence the alternate name \term{dot product}).
The reader can verify that the two-norm $\|\cdot\|_2$ on $\R^n$ is induced by this inner product.
\subsubsection{Pythagorean Theorem}
The well-known Pythagorean theorem generalizes naturally to arbitrary inner product spaces.
\begin{theorem}
If $\x \perp \y$, then
\[\|\x+\y\|^2 = \|\x\|^2 + \|\y\|^2\]
\end{theorem}
\begin{proof}
Suppose $\x \perp \y$, i.e. $\inner{\x}{\y} = 0$. Then
\[\|\x+\y\|^2 = \inner{\x+\y}{\x+\y} = \inner{\x}{\x} + \inner{\y}{\x} + \inner{\x}{\y} + \inner{\y}{\y} = \|\x\|^2 + \|\y\|^2\]
as claimed.
\end{proof}
\subsubsection{Cauchy-Schwarz inequality}
This inequality is sometimes useful in proving bounds:
\[|\inner{\x}{\y}| \leq \|\x\| \cdot \|\y\|\]
for all $\x, \y \in V$. Equality holds exactly when $\x$ and $\y$ are scalar multiples of each other (or equivalently, when they are linearly dependent).
\subsection{Transposition}
If $\A \in \R^{m \times n}$, its \term{transpose} $\A\tran \in \R^{n \times m}$ is given by $(\A\tran)_{ij} = A_{ji}$ for each $(i, j)$.
In other words, the columns of $\A$ become the rows of $\A\tran$, and the rows of $\A$ become the columns of $\A\tran$.
The transpose has several nice algebraic properties that can be easily verified from the definition:
\begin{enumerate}[(i)]
\item $(\A\tran)\tran = \A$
\item $(\A+\mat{B})\tran = \A\tran + \mat{B}\tran$
\item $(\alpha \A)\tran = \alpha \A\tran$
\item $(\A\mat{B})\tran = \mat{B}\tran \A\tran$
\end{enumerate}
\subsection{Eigenthings}
For a square matrix $\A \in \R^{n \times n}$, there may be vectors which, when $\A$ is applied to them, are simply scaled by some constant.
We say that a nonzero vector $\x \in \R^n$ is an \term{eigenvector} of $\A$ corresponding to \term{eigenvalue} $\lambda$ if
\[\A\x = \lambda\x\]
The zero vector is excluded from this definition because $\A\vec{0} = \vec{0} = \lambda\vec{0}$ for every $\lambda$.
We now give some useful results about how eigenvalues change after various manipulations.
\begin{proposition}
Let $\x$ be an eigenvector of $\A$ with corresponding eigenvalue $\lambda$.
Then
\begin{enumerate}[(i)]
\item For any $\gamma \in \R$, $\x$ is an eigenvector of $\A + \gamma\I$ with eigenvalue $\lambda + \gamma$.
\item If $\A$ is invertible, then $\x$ is an eigenvector of $\A\inv$ with eigenvalue $\lambda\inv$.
\item $\A^k\x = \lambda^k\x$ for any $k \in \Z$ (where $\A^0 = \I$ by definition).
\end{enumerate}
\end{proposition}
\begin{proof}
(i) follows readily:
\[(\A + \gamma\I)\x = \A\x + \gamma\I\x = \lambda\x + \gamma\x = (\lambda + \gamma)\x\]
(ii) Suppose $\A$ is invertible. Then
\[\x = \A\inv\A\x = \A\inv(\lambda\x) = \lambda\A\inv\x\]
Dividing by $\lambda$, which is valid because the invertibility of $\A$ implies $\lambda \neq 0$, gives $\lambda\inv\x = \A\inv\x$.
(iii) The case $k \geq 0$ follows immediately by induction on $k$.
Then the general case $k \in \Z$ follows by combining the $k \geq 0$ case with (ii).
\end{proof}
\subsection{Trace}
The \term{trace} of a square matrix is the sum of its diagonal entries:
\[\tr(\A) = \sum_{i=1}^n A_{ii}\]
The trace has several nice algebraic properties:
\begin{enumerate}[(i)]
\item $\tr(\A+\mat{B}) = \tr(\A) + \tr(\mat{B})$
\item $\tr(\alpha\A) = \alpha\tr(\A)$
\item $\tr(\A\tran) = \tr(\A)$
\item $\tr(\A\mat{B}\mat{C}\mat{D}) = \tr(\mat{B}\mat{C}\mat{D}\A) = \tr(\mat{C}\mat{D}\A\mat{B}) = \tr(\mat{D}\A\mat{B}\mat{C})$
\end{enumerate}
The first three properties follow readily from the definition.
The last is known as \term{invariance under cyclic permutations}.
Note that the matrices cannot be reordered arbitrarily, for example $\tr(\A\mat{B}\mat{C}\mat{D}) \neq \tr(\mat{B}\A\mat{C}\mat{D})$ in general.
Also, there is nothing special about the product of four matrices -- analogous rules hold for more or fewer matrices.
Interestingly, the trace of a matrix is equal to the sum of its eigenvalues (repeated according to multiplicity):
\[\tr(\A) = \sum_i \lambda_i(\A)\]
\subsection{Determinant}
The \term{determinant} of a square matrix can be defined in several different confusing ways, none of which are particularly important for our purposes; go look at an introductory linear algebra text (or Wikipedia) if you need a definition.
But it's good to know the properties:
\begin{enumerate}[(i)]
\item $\det(\I) = 1$
\item $\det(\A\tran) = \det(\A)$
\item $\det(\A\mat{B}) = \det(\A)\det(\mat{B})$
\item $\det(\A\inv) = \det(\A)\inv$
\item $\det(\alpha\A) = \alpha^n \det(\A)$
\end{enumerate}
Interestingly, the determinant of a matrix is equal to the product of its eigenvalues (repeated according to multiplicity):
\[\det(\A) = \prod_i \lambda_i(\A)\]
\subsection{Orthogonal matrices}
A matrix $\mat{Q} \in \R^{n \times n}$ is said to be \term{orthogonal} if its columns are pairwise orthonormal.
This definition implies that
\[\mat{Q}\tran \mat{Q} = \mat{Q}\mat{Q}\tran = \I\]
or equivalently, $\mat{Q}\tran = \mat{Q}\inv$. A nice thing about orthogonal matrices is that they preserve inner products:
\[(\mat{Q}\x)\tran(\mat{Q}\y) = \x\tran \mat{Q}\tran \mat{Q}\y = \x\tran \I\y = \x\tran\y\]
A direct result of this fact is that they also preserve 2-norms:
\[\|\mat{Q}\x\|_2 = \sqrt{(\mat{Q}\x)\tran(\mat{Q}\x)} = \sqrt{\x\tran\x} = \|\x\|_2\]
Therefore multiplication by an orthogonal matrix can be considered as a transformation that preserves length, but may rotate or reflect the vector about the origin.
\subsection{Symmetric matrices}
A matrix $\A \in \R^{n \times n}$ is said to be \term{symmetric} if it is equal to its own transpose ($\A = \A\tran$), meaning that $A_{ij} = A_{ji}$ for all $(i,j)$.
This definition seems harmless enough but turns out to have some strong implications.
We summarize the most important of these as
\begin{theorem}
(Spectral Theorem)
If $\A \in \R^{n \times n}$ is symmetric, then there exists an orthonormal basis for $\R^n$ consisting of eigenvectors of $\A$.
\end{theorem}
The practical application of this theorem is a particular factorization of symmetric matrices, referred to as the \term{eigendecomposition} or \term{spectral decomposition}.
Denote the orthonormal basis of eigenvectors $\q_1, \dots, \q_n$ and their eigenvalues $\lambda_1, \dots, \lambda_n$.
Let $\mat{Q}$ be an orthogonal matrix with $\q_1, \dots, \q_n$ as its columns, and $\mat{\Lambda} = \diag(\lambda_1, \dots, \lambda_n)$.
Since by definition $\A\q_i = \lambda_i\q_i$ for every $i$, the following relationship holds:
\[\A\mat{Q} = \mat{Q}\mat{\Lambda}\]
Right-multiplying by $\mat{Q}\tran$, we arrive at the decomposition
\[\A = \mat{Q}\mat{\Lambda}\mat{Q}\tran\]
\subsubsection{Rayleigh quotients}
Let $\A \in \R^{n \times n}$ be a symmetric matrix.
The expression $\x\tran\A\x$ is called a \term{quadratic form}.
There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues.
This connection is provided by the \term{Rayleigh quotient}
\[R_\A(\x) = \frac{\x\tran\A\x}{\x\tran\x}\]
The Rayleigh quotient has a couple of important properties which the reader can (and should!) easily verify from the definition:
\begin{enumerate}[(i)]
\item \term{Scale invariance}: for any vector $\x \neq \vec{0}$ and any scalar $\alpha \neq 0$, $R_\A(\x) = R_\A(\alpha\x)$.
\item If $\x$ is an eigenvector of $\A$ with eigenvalue $\lambda$, then $R_\A(\x) = \lambda$.
\end{enumerate}
We can further show that the Rayleigh quotient is bounded by the largest and smallest eigenvalues of $\A$.
But first we will show a useful special case of the final result.
\begin{proposition}
For any $\x$ such that $\|\x\|_2 = 1$,
\[\lambda_{\min}(\A) \leq \x\tran\A\x \leq \lambda_{\max}(\A)\]
with equality if and only if $\x$ is a corresponding eigenvector.
\end{proposition}
\begin{proof}
We show only the $\max$ case because the argument for the $\min$ case is entirely analogous.
Since $\A$ is symmetric, we can decompose it as $\A = \mat{Q}\mat{\Lambda}\mat{Q}\tran$.
Then use the change of variable $\y = \mat{Q}\tran\x$, noting that the relationship between $\x$ and $\y$ is one-to-one and that $\|\y\|_2 = 1$ since $\mat{Q}$ is orthogonal.
Hence
\[\max_{\|\x\|_2 = 1} \x\tran\A\x = \max_{\|\y\|_2 = 1} \y\tran\mat{\Lambda}\y = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2\]
Written this way, it is clear that $\y$ maximizes this expression exactly if and only if it satisfies $\sum_{i \in I} y_i^2 = 1$ where $I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\A)\}$ and $y_j = 0$ for $j \not\in I$.
That is, $I$ contains the index or indices of the largest eigenvalue.
In this case, the maximal value of the expression is
\[\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\A) \sum_{i \in I} y_i^2 = \lambda_{\max}(\A)\]
Then writing $\q_1, \dots, \q_n$ for the columns of $\mat{Q}$, we have
\[\x = \mat{Q}\mat{Q}\tran\x = \mat{Q}\y = \sum_{i=1}^n y_i\q_i = \sum_{i \in I} y_i\q_i\]
where we have used the matrix-vector product identity.
Recall that $\q_1, \dots, \q_n$ are eigenvectors of $\A$ and form an orthonormal basis for $\R^n$.
Therefore by construction, the set $\{\q_i : i \in I\}$ forms an orthonormal basis for the eigenspace of $\lambda_{\max}(\A)$.
Hence $\x$, which is a linear combination of these, lies in that eigenspace and thus is an eigenvector of $\A$ corresponding to $\lambda_{\max}(\A)$.
We have shown that $\max_{\|\x\|_2 = 1} \x\tran\A\x = \lambda_{\max}(\A)$, from which we have the general inequality $\x\tran\A\x \leq \lambda_{\max}(\A)$ for all unit-length $\x$.
\end{proof}
By the scale invariance of the Rayleigh quotient, we immediately have as a corollary (since $\x\tran\A\x = R_{\A}(\x)$ for unit $\x$)
\begin{theorem}
(Min-max theorem)
For all $\x \neq \vec{0}$,
\[\lambda_{\min}(\A) \leq R_\A(\x) \leq \lambda_{\max}(\A)\]
with equality if and only if $\x$ is a corresponding eigenvector.
\end{theorem}
\subsection{Positive (semi-)definite matrices}
A symmetric matrix $\A$ is \term{positive semi-definite} if for all $\x \in \R^n$, $\x\tran\A\x \geq 0$.
Sometimes people write $\A \succeq 0$ to indicate that $\A$ is positive semi-definite.
A symmetric matrix $\A$ is \term{positive definite} if for all nonzero $\x \in \R^n$, $\x\tran\A\x > 0$.
Sometimes people write $\A \succ 0$ to indicate that $\A$ is positive definite.
Note that positive definiteness is a strictly stronger property than positive semi-definiteness, in the sense that every positive definite matrix is positive semi-definite but not vice-versa.
These properties are related to eigenvalues in the following way.
\begin{proposition}
A symmetric matrix is positive semi-definite if and only if all of its eigenvalues are nonnegative, and positive definite if and only if all of its eigenvalues are positive.
\end{proposition}
\begin{proof}
Suppose $A$ is positive semi-definite, and let $\x$ be an eigenvector of $\A$ with eigenvalue $\lambda$.
Then
\[0 \leq \x\tran\A\x = \x\tran(\lambda\x) = \lambda\x\tran\x = \lambda\|\x\|_2^2\]
Since $\x \neq \vec{0}$ (by the assumption that it is an eigenvector), we have $\|\x\|_2^2 > 0$, so we can divide both sides by $\|\x\|_2^2$ to arrive at $\lambda \geq 0$.
If $\A$ is positive definite, the inequality above holds strictly, so $\lambda > 0$.
This proves one direction.
To simplify the proof of the other direction, we will use the machinery of Rayleigh quotients.
Suppose that $\A$ is symmetric and all its eigenvalues are nonnegative.
Then for all $\x \neq \vec{0}$,
\[0 \leq \lambda_{\min}(\A) \leq R_\A(\x)\]
Since $\x\tran\A\x$ matches $R_\A(\x)$ in sign, we conclude that $\A$ is positive semi-definite.
If the eigenvalues of $\A$ are all strictly positive, then $0 < \lambda_{\min}(\A)$, whence it follows that $\A$ is positive definite.
\end{proof}
As an example of how these matrices arise, consider
\begin{proposition}
Suppose $\A \in \R^{m \times n}$.
Then $\A\tran\A$ is positive semi-definite.
If $\Null(\A) = \{\vec{0}\}$, then $\A\tran\A$ is positive definite.
\end{proposition}
\begin{proof}
For any $\x \in \R^n$,
\[\x\tran (\A\tran\A)\x = (\A\x)\tran(\A\x) = \|\A\x\|_2^2 \geq 0\]
so $\A\tran\A$ is positive semi-definite.
Note that $\|\A\x\|_2^2 = 0$ implies $\|\A\x\|_2 = 0$, which in turn implies $\A\x = \vec{0}$ (recall that this is a property of norms).
If $\Null(\A) = \{\vec{0}\}$, $\A\x = \vec{0}$ implies $\x = \vec{0}$, so $\x\tran (\A\tran\A)\x = 0$ if and only if $\x = \vec{0}$, and thus $\A\tran\A$ is positive definite.
\end{proof}
Positive definite matrices are invertible (since their eigenvalues are nonzero), whereas positive semi-definite matrices might not be.
However, if you already have a positive semi-definite matrix, it is possible to perturb its diagonal slightly to produce a positive definite matrix.
\begin{proposition}
If $\A$ is positive semi-definite and $\epsilon > 0$, then $\A + \epsilon\I$ is positive definite.
\end{proposition}
\begin{proof}
Assuming $\A$ is positive semi-definite and $\epsilon > 0$, we have for any $\x \neq \vec{0}$ that
\[\x\tran(\A+\epsilon\I)\x = \x\tran\A\x + \epsilon\x\tran\I\x = \underbrace{\x\tran\A\x}_{\geq 0} + \underbrace{\epsilon\|\x\|_2^2}_{> 0} > 0\]
as claimed.
\end{proof}
An obvious but frequently useful consequence of the two propositions we have just shown is that $\A\tran\A + \epsilon\I$ is positive definite (and in particular, invertible) for \textit{any} matrix $\A$ and any $\epsilon > 0$.
\subsubsection{The geometry of positive definite quadratic forms}
A useful way to understand quadratic forms is by the geometry of their level sets.
A \term{level set} or \term{isocontour} of a function is the set of all inputs such that the function applied to those inputs yields a given output.
Mathematically, the $c$-isocontour of $f$ is $\{\x \in \dom f : f(\x) = c\}$.
Let us consider the special case $f(\x) = \x\tran\mat{A}\x$ where $\mat{A}$ is a positive definite matrix.
Since $\mat{A}$ is positive definite, it has a unique matrix square root $\A\halfpow = \mat{Q}\mat{\Lambda}\halfpow\mat{Q}\tran$, where $\mat{Q}\mat{\Lambda}\mat{Q}\tran$ is the eigendecomposition of $\A$ and $\mat{\Lambda}\halfpow = \diag(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$.
It is easy to see that this matrix $\A\halfpow$ is positive definite (consider its eigenvalues) and satisfies $\A\halfpow\A\halfpow = \A$.
Fixing a value $c \geq 0$, the $c$-isocontour of $f$ is the set of $\x \in \R^n$ such that
\[c = \x\tran\A\x = \x\tran\A\halfpow\A\halfpow\x = \|\A\halfpow\x\|_2^2\]
where we have used the symmetry of $\A\halfpow$.
Making the change of variable $\z = \A\halfpow\x$, we have the condition $\|\z\|_2 = \sqrt{c}$.
That is, the values $\z$ lie on a sphere of radius $\sqrt{c}$.
These can be parameterized as $\z = \sqrt{c}\hat{\z}$ where $\hat{\z}$ has $\|\hat{\z}\|_2 = 1$.
Then since $\A\neghalfpow = \mat{Q}\mat{\Lambda}\neghalfpow\mat{Q}\tran$, we have
\[\x = \A\neghalfpow\z = \mat{Q}\mat{\Lambda}\neghalfpow\mat{Q}\tran\sqrt{c}\hat{\z} = \sqrt{c}\mat{Q}\mat{\Lambda}\neghalfpow\tilde{\z}\]
where $\tilde{\z} = \mat{Q}\tran\hat{\z}$ also satisfies $\|\tilde{\z}\|_2 = 1$ since $\mat{Q}$ is orthogonal.
Using this parameterization, we see that the solution set $\{\x \in \R^n : f(\x) = c\}$ is the image of the unit sphere $\{\tilde{\z} \in \R^n : \|\tilde{\z}\|_2 = 1\}$ under the invertible linear map $\x = \sqrt{c}\mat{Q}\mat{\Lambda}\neghalfpow\tilde{\z}$.
What we have gained with all these manipulations is a clear algebraic understanding of the $c$-isocontour of $f$ in terms of a sequence of linear transformations applied to a well-understood set.
We begin with the unit sphere, then scale every axis $i$ by $\lambda_i\neghalfpow$, resulting in an axis-aligned ellipsoid.
Observe that the axis lengths of the ellipsoid are proportional to the inverse square roots of the eigenvalues of $\A$.
Hence larger eigenvalues correspond to shorter axis lengths, and vice-versa.
Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. one that preserves length and angles, such as a rotation/reflection) given by $\mat{Q}$.
The result of this transformation is that the axes of the ellipse are no longer along the coordinate axes in general, but rather along the directions given by the corresponding eigenvectors.
To see this, consider the unit vector $\vec{e}_i \in \R^n$ that has $[\vec{e}_i]_j = \delta_{ij}$.
In the pre-transformed space, this vector points along the axis with length proportional to $\lambda_i\neghalfpow$.
But after applying the rigid transformation $\mat{Q}$, the resulting vector points in the direction of the corresponding eigenvector $\q_i$, since
\[\mat{Q}\vec{e}_i = \sum_{j=1}^n [\vec{e}_i]_j\q_j = \q_i\]
where we have used the matrix-vector product identity from earlier.
In summary: the isocontours of $f(\x) = \x\tran\A\x$ are ellipsoids such that the axes point in the directions of the eigenvectors of $\A$, and the radii of these axes are proportional to the inverse square roots of the corresponding eigenvalues.
\subsection{Singular value decomposition}
Singular value decomposition (SVD) is a widely applicable tool in linear algebra.
Its strength stems partially from the fact that \textit{every matrix} $\A \in \R^{m \times n}$ has an SVD (even non-square matrices)!
The decomposition goes as follows:
\[\A = \mat{U}\mat{\Sigma}\mat{V}\tran\]
where $\mat{U} \in \R^{m \times m}$ and $\mat{V} \in \R^{n \times n}$ are orthogonal matrices and $\mat{\Sigma} \in \R^{m \times n}$ is a diagonal matrix with the \term{singular values} of $\A$ (denoted $\sigma_i$) on its diagonal.
By convention, the singular values are given in non-increasing order, i.e.
\[\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_{\min(m,n)} \geq 0\]
Only the first $r$ singular values are nonzero, where $r$ is the rank of $\A$.
Observe that the SVD factors provide eigendecompositions for $\A\tran\A$ and $\A\A\tran$:
\begin{align*}
\A\tran\A &= (\mat{U}\mat{\Sigma}\mat{V}\tran)\tran\mat{U}\mat{\Sigma}\mat{V}\tran = \mat{V}\mat{\Sigma}\tran\mat{U}\tran\mat{U}\mat{\Sigma}\mat{V}\tran = \mat{V}\mat{\Sigma}\tran\mat{\Sigma}\mat{V}\tran \\
\A\A\tran &= \mat{U}\mat{\Sigma}\mat{V}\tran(\mat{U}\mat{\Sigma}\mat{V}\tran)\tran = \mat{U}\mat{\Sigma}\mat{V}\tran\mat{V}\mat{\Sigma}\tran\mat{U}\tran = \mat{U}\mat{\Sigma}\mat{\Sigma}\tran\mat{U}\tran
\end{align*}
It follows immediately that the columns of $\mat{V}$ (the \term{right-singular vectors} of $\A$) are eigenvectors of $\A\tran\A$, and the columns of $\mat{U}$ (the \term{left-singular vectors} of $\A$) are eigenvectors of $\A\A\tran$.
The matrices $\mat{\Sigma}\tran\mat{\Sigma}$ and $\mat{\Sigma}\mat{\Sigma}\tran$ are not necessarily the same size, but both are diagonal with the squared singular values $\sigma_i^2$ on the diagonal (plus possibly some zeros).
Thus the singular values of $\A$ are the square roots of the eigenvalues of $\A\tran\A$ (or equivalently, of $\A\A\tran$)\footnote{
Recall that $\A\tran\A$ and $\A\A\tran$ are positive semi-definite, so their eigenvalues are nonnegative, and thus taking square roots is always well-defined.
}.
\subsection{Some useful matrix identities}
\subsubsection{Matrix-vector product as linear combination of matrix columns}
\begin{proposition}
Let $\x \in \R^n$ be a vector and $\A \in \R^{m \times n}$ a matrix with columns $\a_1, \dots, \a_n$.
Then
\[\A\x = \sum_{i=1}^n x_i\a_i\]
\end{proposition}
This identity is extremely useful in understanding linear operators in terms of their matrices' columns.
The proof is very simple (consider each element of $\A\x$ individually and expand by definitions) but it is a good exercise to convince yourself.
\subsubsection{Sum of outer products as matrix-matrix product}
An \term{outer product} is an expression of the form $\a\b\tran$, where $\a \in \R^m$ and $\b \in \R^n$.
By inspection it is not hard to see that such an expression yields an $m \times n$ matrix such that
\[[\a\b\tran]_{ij} = a_ib_j\]
It is not immediately obvious, but the sum of outer products is actually equivalent to an appropriate matrix-matrix product!
We formalize this statement as
\begin{proposition}
Let $\a_1, \dots, \a_k \in \R^m$ and $\b_1, \dots, \b_k \in \R^n$. Then
\[\sum_{\ell=1}^k \a_\ell\b_\ell\tran = \mat{A}\mat{B}\tran\]
where
\[\mat{A} = \matlit{\a_1 & \cdots & \a_k}, \tab \mat{B} = \matlit{\b_1 & \cdots & \b_k}\]
\end{proposition}
\begin{proof}
For each $(i,j)$, we have
\[\left[\sum_{\ell=1}^k \a_\ell\b_\ell\tran\right]_{ij} = \sum_{\ell=1}^k [\a_\ell\b_\ell\tran]_{ij} = \sum_{\ell=1}^k [\a_\ell]_i[\b_\ell]_j = \sum_{\ell=1}^k A_{i\ell}B_{j\ell}\]
This last expression should be recognized as an inner product between the $i$th row of $\A$ and the $j$th row of $\mat{B}$, or equivalently the $j$th column of $\mat{B}\tran$.
Hence by the definition of matrix multiplication, it is equal to $[\mat{A}\mat{B}\tran]_{ij}$.
\end{proof}
\subsubsection{Quadratic forms}
Let $\A \in \R^{n \times n}$ be a symmetric matrix, and recall that the expression $\x\tran\A\x$ is called a quadratic form of $\A$.
It is in some cases helpful to rewrite the quadratic form in terms of the individual elements that make up $\A$ and $\x$:
\[\x\tran\A\x = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j\]
This identity is valid for any square matrix (need not be symmetric), although quadratic forms are usually only discussed in the context of symmetric matrices.