Backprop is the process that lets NN learn from their own mistakes, but in a more organized fashion, and more mathematical.
Neural Networks
Linear score function: $f = Wx$, where $W \in \mathbb R^{C \times D},, x \in \mathbb R^{D}$, $D$ is for data, $C$ is for classes.
2-layer Neural Network: $f = W_2 \max(0, W_1x)$, where $W_1 \in \mathbb R^{H \times D},, x \in \mathbb R^{D}, W_2 \in \mathbb R^{C \times H}$, $D$ is for data, $C$ is for classes, $H$ is for hidden layer.
- the $\max$ function is to create a non-linearity between the two linear transformations. We call it activation function.
- In practice, we usually add a bias vector at each layer.
- 3-layer: stack the layers similarly $f = W_3\max(0,W_2 \max(0, W_1x))$.
The hidden layer can create templates for parts of the object.
Activation function: the function in the network that creates non-linearity.
- if we exclude it from the network, then the network will be degenerated into simple linear classifier.
- ReLU: $f(x) = \max(0, x)$.
- is the most popular function used in NN.
- problem: create dead neurons when the input is negative.
- has a lot of variations.
- Leaky ReLU: $f(x) = \max(0.1x, x)$
- ELU: $f(x) = \begin{cases} x,&x \ge 0 \ \alpha(e^x - 1),& x< 0 \end{cases}$
- GELU: $f(x) = x · \Phi(x)$
- SiLU: $f(x) = x · \sigma(x)$
- Sigmoid: $\displaystyle \sigma(x) = \frac{1}{1 + e^{-x}}$
- Tanh: $\displaystyle \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- The choice of the activation function is mostly empirical.
Features of Neuron Network:
- more hidden neurons mean more capacity for complex functions.
- we should use explicit regularization (add a $\lambda R(W)$ in the end) to avoid overfitting the training dataset, rather than shrinking the size of neural network.
- The size of neural network implies the comprehensive ability. The choice of the size is empirical and based on given problem. Experiments are necessary.
- The ratio of regularization implies how well we want our network to be general.
biological inspirations:
- multiple input: cell body aggregates the impulses from the dendrite
- output: impulses are carried away from the cell body through the axon
Backpropagation
Computational Graph: a graph that puts together all the calculations in the neuron network.
- derivative of each node (single calculation step) is obvious.
- Based on the chain rule. the gradient of each parameter is the multiplication of its “upstream gradients”.
- This way of calculation avoid complicated calculation by human.
- computational graph representation may not be unique.
Typical patterns in gradient flow:

The forward / backward API can be implemented in the class function:

Backdrop with vectors/matrices:
- vector to scalar derivative: $\displaystyle \left(\frac{\partial y}{\partial x}\right)_n = \frac{\partial y}{\partial x_n}$. The derivative is a vector.
- vector to vector derivative: $\displaystyle \left(\frac{\partial y}{\partial x}\right)_{nm} = \frac{\partial y_m}{\partial x_n}$. The derivative is a Jacobian matrix.
- when doing backdrop, derivative of each node is Jacobian matrices. The multiplication become matrix-matrix or matrix-vector.
- For elementwise function, the Jacobian matrix is only diagonal and very sparse, like $\max$ function. In this case we don’t really store the matrix but calculated in backward pass function.
- matrix to scalar derivative: $\displaystyle \left(\frac{\partial y}{\partial x}\right)_{nm} = \frac{\partial y}{\partial x_{nm}}$. The size is the same as the matrix.
- matrix to matrix derivative: $\displaystyle \left(\frac{\partial y}{\partial x}\right)_{nmab} = \frac{\partial y_{ab}}{\partial x_{nm}}$. The size is the multiplication of the both matrixes.
- In general, we don’t store the Jacobian matrix because it takes a huge memory. We usually write the backward pass function to calculate the elements.
- For matrix multiply $Y = XW$ node in the loss function, the gradient is a swap: $\displaystyle \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^{T},\ \frac{\partial L}{\partial W} = X^T\frac{\partial L}{\partial Y}$.
Glossary
- reiterate v. 重申
- for the sake of
- dimensionality
- terminology
- pivotal
- lump together
- rectified
- binarize 二值化
- in a nutshell 概括地说
- prescription
- metaparameter
- aggregate v. 集合,聚集
- dendrite n. 树突
- axon n. 轴突
- the ground truth
- intractable
- infeasible
- madularize
- with respect to