Neural networks

Lecture Notes on 26-9-2023

1 Universal Approximators

We can find a number M and for the weights \(\textbf{w}^(2)\) so the weights in the second layer. Here \(M\) represents the units aka the circles in a layer. See the green highlighted circles below

If we want the error to be to zero then we want the value of \(M\) to be larger and larger. This is because we fixed our NN to have only 2 layers so the only thing that can grow its the widht so \(M\) so the number of neurons (units). So with M its hard if you compute plenty of them because then you need a lot of GPUs.

The summary of the Universal Approximator is that width \(M\) is the one that matter. It is just one hidden layer and if you can make it big enough, so if you can increase \(M\) enough then you can fit approximate almost anything

Here \(\textbf{w}^2 \in \mathbb{R}^{Mx1}\) because if we want to compute only one single value at the end of the NN we need the dimensions of the vector to align see figure above

2 Deep Neural Nets and Shallow Neural Nets

Here we are saying that if we want to conserve the error lower as much as a Deep Neural Net, so that we can have a Shallow Neural Net then we need to increase the number of \(M\) or number of weights. But by doing so this scales exponentially.

To growth complexity you have to increase the width, but instead of this you can do it in the depth. As we will see in the slide below, Depth gives you more complexity so its preferable to increase depth instead of width

Width: \(M\) number of weights
If you want the error small then increase this number. However if you go this route the number \(M(\varepsilon)\) grows exponentially
Depth: \(l\) number of hidden layers.
Here it gives you more complexity, so its recommended, End result have deeper NNs.

The point of the image above is to show that with the same model (3 hidden units and 1 linesr ouput unit) which is the Neural Network we can approximate these 4 functions.

3 Expressive power ReLU networks

4 Network Training: Regression

Remember computing the Maximum Likelihood is the same as the minimum of the negative log likelihood, that is why there is a minus on front the \(\ln p(\textbf{t}|\textbf{W},\textbf{w})\)

5 Network Training

5.1 Regression: Network Training

5.2 Binary Classification: Network Training

5.3 Classification with K classes: Network Training

For further explanation in class distribution take a look at here

6 Losses overview

7 Training Neural Networks

Because we cannot find easily the mathematically solutions for training the NN, we use Gradient Descent

Result of forward propagation. Depends on all the NN weights —> Lots of local minima! It results in lots of local minima because we are updating simultaneously all the weights

That means that the loss function depends on all the parameters (all the weights in the NN) thus we have a lot of parameters to update.

Here for instance we have 2 local minima and one unique global minima

8 Neural Network Optimization Surface

Despite these crazy optimization landscapes, gradient descent works amazingly well!
For large networks, many of the critical points are saddle-points, not local minima.

As our dimensionality increases, the saddle points will grow with respect to the minimum, but the minimam will also grow

Because of all these variation we want to always report uncertainties on performance. This uncertainties comes from randomness in initialization and Stochastic Gradient Descend SGD.

The idea is to report error by averaging over all these models in the same column.

Now, we talked about making the learning rate higher at start and then slowly decreased it at training. This what we talk in the next topic:

9 Cyclic Learning Rates

It may seem crazy making the learning rate cyclic, so increase it again then lower it then increase it again. But this results in exploration so that you don’t end up in a local minima.

With cyclic learning rate we could explore three local minima

Note: We used to compute Logistic Regression with Gradient Descent, here its the same with NN. More mathematically we cannot compute a closed form of the weights hence we use GD.

10 Chain Rule through Layers

Forward Propagation: is to evaluate the full network from inputs to ouputs
Backward Propagation: is to send error signals back though the network

Remember \(h_l = \sigma(a)\). Meaning the ouput of the sigmoid is the new \(h_l\). With this in mind because we want to tune the parameters aka weights we send the error back to the inputs, this as we said before it’s called backpropagation.

Now, because we want to send the error back so that we can reduce it, we would compute the derivative. Here is the same, we will compute the derivatives and because the NN its like a chain of multiplication due to the chain rule, here we can make use of optimization techniques. That is:

Backpropagation is simply the chain rule implemented with cached values of intermediate computations.
We have the tradeoff between computational complexity for memory. In other words we have saved in computation complexity (carry out many derivatives) because we have cached these \(h_{l,j}\) which is exactly the result of computing the derivatives to reduce the error and thus carry out backpropagation.

In summary when we do backprogation due to the symmetry of the hidden layers, we can save in computing all the times the derivatives and instead trade it for memory to cache these \(h_{l,j}\)

The above is what distinguishes the chain rule from backpropagation. Backpropagation uses the notion of reusing information.

11 Downside: Numerical Issues

Two things can go wrong with this approach:

Exploding gradients:
- Problem: The multiplication of an intermediate derivate value could be large so then wehn you continue multiplying with the rest of the layers then it will become a huge number.
- Solution: can usually be handled by ‘clipping’ values or regularizing weights to be small.
Vanishing gradients:
- Problem: When the ouput of the sigmoid is close to \(1\) or \(0\) we would have that the derivative \(h_{l,j}*(1-h_{l,j})\) where recall that \(\textbf{h}_l=\sigma(\textbf{a})\) meaning when sigmoid evaluates to this extremes then the derivative would become zero. In other words, if zero because \(h_{l,j}\) can be \(1\) or \(0\) then all derivatives to earlier layers will be zero. See figure below:
The figure above then means that if one derivative becomes zero then learning cannot happen.
- Possible Solution1: there is no proper solution because if you say you want to clip this derivative not to yield \(1\) or \(0\) then your next question its to how much close to these extremes I should set the derivative. (In practice however the sigmoid will not evaluate to perfect \(1\) or \(0\) but it will be small enough to make the multiplication of the other derivatives in the chain rule significantly deleterious to proceed training).
- Possible Solution2: In the neural networks we add a line that connect the current output layer \(h_l\) with the previous outout layer \(h_{l-1}\)

12 Scalar Neural Network

Recall that when we do linear regression we do not need to have an activation function \(f\) like softmax or logistic, we just have the identity which means \(f(w_2 \cdot h)\) becomes just \(w_2 \cdot h\).

12.1 Computing the Error loss on \(w_1\)

If we would compute the error but now with activation function \(a\) equals Relu

This could be a bad idea because if \(w_1 \cdot x\) is lower than \(0\) then the whole error loss would be evaluated to zero and thus cannot train the network because you keep getting derivatives of zero. This is called the Dead Relu problem. In practice people initialize the input to a positive value. Also people use Leaky Relu depicted below

13 Vector Neural Network

We will have a scalar input and scalar ouput but with hidden layers as vectors.

Note

Derivative is a row vector: \[ \begin{align} \frac{\partial f}{\partial \textbf{x}} = \left[\frac{\partial f}{\partial x_1},...,\frac{\partial f}{\partial x_D} \right] \end{align} \]
Gradient is the transpose of the Derivative vector: \[ \begin{align} \nabla_{w_1}f = \left[\frac{\partial f}{\partial \textbf{w}_1} \right]^T \end{align} \] We do the transpose because we are tryng to get the same shape as our paramter \(\textbf{W}\). So in this case if \(W\) would have been a two-row vector then with the transpose we ensure that.

14 Computations in 2D

A rule of thumb is:

Output linear operator Rows
The model parameters as Columns

15 Two Modes: Forward and Reverse

In Pytorch, you can compute the forward or reverse computation we saw before. The only difference is which direction you cache from.

Forward mode: you start with the partial derivatives at the beginning of the NN. For instance in the image below, from dh1/dw1 (the first layer) to df/dh2 (the last layer).
- Better when ouput dim >> input dim

Reverse mode: you start from the ouput to the input.
- Better when ouput dim << input dim

This is telling us that i.e to be more efficient in an image classifier then you would prefer the Reverse mode because you have that your input would be a huge array of numbers whereas your ouput would be a few classes. So here it applies: dim ouput << dim input.

The reason for ie in forward mode is that you would compute small derivatives because your input is small so then you delay to compute much more multiplication to the end

Another examples why is more efficient, say your outputs >> inputs, then if you start by computing the ouput derivatives then you are going to accumulate and carry all those dimensions due to the large ouput derivatives. Then you carry all these computations to the input layer that has fewer computations but you already have a hug baggage and its not efficient. Summary: you want to do the large multiplication until the end so that you do not carry unnecessary derivation.