Deep Learning Optimizations II

Description of this Post

All

Deep Learning

1 Lecture overview

2 Start with

3 Quiz

Why left would be preferred?

simpler model
generalize better to unseen data

Why the right is better?

We are actually fitting all the data points
The right hand side is the ground truth
You dont care about extrapolating

4 Digression: Gravitation

5 Bias-variance tradeoff

The single best prediction of the parameters

A good estimator is a function whose output is close to the true underlying thetha that generated the data

6 Bias-variance tradeoff

Bias

Estimator’s expected value= so the ouput of our thetha estimate and the true value of that parameter

Bias comes from not being able to model the real model in a correct way

Variance

So if you use a different training set or a different split how different are the learnt NNs parameters. If differ a lot then high variance

7 Bias-variance tradeoff

High variance & low bias, your model on average get it right but it has a high variance because it wants to model all noise, so overfilling, it spread all over the place

High bias and low variance, in average it does not even get it correct so underfitting

8 Overfitting

9 Overfitting

It will learn to recognize that the img contains hourses not by the hourses but the watermark. This is seen in the heat-map where the point of attention is in the watermark. This is overfitting

10 Overfitting

To avoid overfitting we use regularization

11 Overfitting: how “powerful” are neural networks?

12 This isn’t the full story..

13 Update 2019: Double Descent

Dash line is where the data samples equals the number of parameters

So in the top right image even if we have increase the number of hidden layers and our training error is zero then our test set is still decreasing that is weird.

So here we then presume that bigger model will have lower error.

Before we use to have a curve upwrads from the tipical bias/variance curve in the prev slide, but the weird thing is that where the dash line meets then this error starts to decrease again.

Two answers: smoothness and regularization

14 Double-descent: Smoothness from bigger models

In the x axis you have the number of times SGD was proceed, so the amount of units of texts akak ‘words’ were processed

We can see that with larger models I require fewer samples to reach a lower test loss

Also when they reach stability they can provide with a test loss which is the best shot they can give and this depends on their size, larger models give more accurate results

Also it shows that the quicker models learns quickler than the smaller model. So while the smaller model for a given number of tokens the large model learns more quickler.

This is Language models still not applicable for vision.

We also can say that (in language models) in terms of flops is more efficient to learn large models for fewer steps than to learn small models for larger amount of steps

15 Double-descent in practice?

In practice if you increase the number of neurons you may be closer to overfitting. For that we need regularization

16 Regularization

Here we reduce the complexity of a NN and avoid ovverfitting

17 1) L2-regularization

Referred to weight decay or Ritch regression in the linear case

Omega is proportional to how large the weights are

Minimizing this is also the same as if you assume a Gaussian prior on your weight, here you assume the weights are Gaussian distributed

18 1) L2-regularization

18.1 L2 & L1 Formula

L2 example Python

The L2 loss, also known as the Euclidean loss or Mean Squared Error (MSE), is a common loss function used in regression problems. It measures the average squared difference between the predicted values and the actual values.

The formula for L2 loss is given by:

\(L2\_loss = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\)

Where:

\(N\) is the number of data points.
\(y_i\) is the true (ground truth) value for the i-th data point.
\(\hat{y}_i\) is the predicted value for the i-th data point.

Here’s a simple example in Python using NumPy:

Code

import numpy as np

# Generate some example data
true_values = np.array([2, 4, 5, 4, 5])
predicted_values = np.array([1.5, 3.5, 4.8, 4.2, 5.2])

# Calculate L2 loss
l2_loss = np.mean((true_values - predicted_values)**2)

print("L2 Loss:", l2_loss)

L2 Loss: 0.12400000000000003

The L2 loss is calculated by taking the mean of the squared differences between the true and predicted values. The smaller the L2 loss, the better the model’s predictions align with the true values.

Lamda tells you how important the regularization is, if:

Lambda high then puts all the weights towards zero
Lambda is zero, then it learns the typical loss and do not care about regularization

Lambda during trainnig is fixed, if you want to find the best one you need to run lots of experiments

19 2) L1-regularization

Isotropics means is equal in all directions

Now if you take the gradient of this, now if you take derivative of w, because you want to optimize then it end ups a 1/2. So you end up with a constant factor in the loss that keeps getting substracted. So what happens is that we end up substracting a constant to our weights.

So it substract a bit from the positive weights and adds a bit to the negative weights and pushed them towards zero

20 2) L1-regularization

L1 leads to sparce weights, so that means more weights will be closer to zero

If alpha here increases then that means weights become zero

21 Why do L1 and L2 Regularizations work?

L2 regularization has basically a circular constraint area because you have w1^2 + w2^2 needs to be constant so all these combinations would be a circle

So the contours of the loss function in red will intercept the constrains regions at an axis. What that means is that if you are trying to find in this case the optimal loss, then it touches the constraint region where one of the values is set to zero while with L2 regularization there is no particular point where you could make one of the weights zero

This is because it needs to have the sum of the squares to be a small value but there is no particular motivation to have a similar weigth dimension to be equal to zero, so there is no reason to have sparse weights in L2

22 Effect: linear regression example

Here the alpha goes from strong to weak

In L1 it puches some weights to zero and then stay zero aftwer a while and it is not like all of them get smaller at the same time but some of them stay quite high for a loong time, and now increasing alpha here you are making individual weights close to zero.

L2 Regularization (Weight Decay): Encourages smaller weights but does not force them to be exactly zero. It smoothens the weights but doesn’t induce sparsity.

L1 Regularization: Promotes sparsity by adding a penalty term based on the absolute values of the weights. This can lead to some weights being exactly zero.

23 3) Early stopping

24 3) Early stopping

*Typo: with better test set error

The model at this stage have low variance because they are not overfitting

25 3) Early stopping

26 Why does early-stopping work as regularization?

27 Why does early-stopping work as regularization?

Here weight decay they mean by L2-regularization

Weight decay (L2-loss) vs Early Stopping

Weight decay, also known as L2 regularization, is a technique to prevent overfitting by adding a penalty term to the loss function that is proportional to the squared magnitudes of the weights. This regularization term discourages the model from learning very large weights and encourages a smoother and more generalized solution.

Early Stopping:

Early stopping is a regularization technique used during the training of a machine learning model, typically in the context of iterative optimization algorithms like gradient descent. The idea behind early stopping is to monitor the model’s performance on a validation set during training and stop the training process when the performance on the validation set starts to degrade, even if the performance on the training set continues to improve.

Mechanism: Monitor a performance metric (e.g., validation loss) on a separate validation set during training.
Decision Criteria: Stop training when the performance on the validation set starts to worsen or fails to improve for a certain number of consecutive epochs.
Purpose: Prevent overfitting by terminating training before the model starts to memorize noise in the training data.

Weight Decay (L2 Regularization):

Mechanism: Add a term to the loss function that penalizes large weights by adding the sum of squared weights multiplied by a regularization strength.
Decision Criteria: No specific stopping criterion; regularization is applied throughout the training process.
Purpose: Encourage the model to have smaller and more evenly distributed weights, preventing overfitting.

Key Differences:

Focus:
- Early stopping focuses on monitoring the model’s performance during training and stopping when the validation performance indicates potential overfitting.
- Weight decay focuses on adjusting the optimization objective by penalizing large weights, aiming to prevent overfitting from the beginning of training.
Decision Criteria:
- Early stopping makes decisions based on the validation performance, and the training stops when the validation performance degrades.
- Weight decay does not have a specific stopping criterion; it is a continuous regularization technique applied throughout training.
Implementation:
- Early stopping involves monitoring and interrupting the training loop.
- Weight decay involves adding a regularization term to the loss function during each iteration of the optimization algorithm.

In practice, these techniques can be used together to enhance the regularization effect and improve the generalization performance of a machine learning model.

28 4) Dropout: the problem it addresses

29 4) Dropout: why does it work?

30 4) Dropout: why does it work?

31 4) Dropout: how is it implemented?

You switch the activations to 0, and now say with Bernulli you have 50% neurons working

During testing you are not learning so you use all the neurons

Now with Dropout you cannot have neurons that are inactive, because you drop all other neurons so now they need to work
Decreases overfitting

32 Dropout

33 Dropout

34 Dropout

35 Dropout

36 Dropout

37 Dropout

38 Dropout US. Bagging

Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. It is train in parallel

Bagging:

Has its corresponding training set that is different from the whole training set
Uses all neurons

Dropout:

It does not employ all neurons
They are not trained they only get one SGD because you have many infinitely subnetworks, so if you apply dropout it is very unlikely that you trian the same subentwork many times

39 Dropout beyond Bagging

40 5) Data augmentation

41 Data augmentation

42 One note about backtranslation though:

43 Common computer vision augmentations visualised

44 Data augmentation

45 Data augmentation

46 Other regularizations

47 Various ways to regularise

48 Normalization

Putting data into a common shape without distorting tis shape

49 Data preprocessing

Here basically if we have in same scale then our weights will not be elongated like in the elipse, now they would be able to take the same step size in the correct direction

50 Normalizing Input Data

This we apply in the input stage:

Normalization is a linear operator so you can put this back into the NN after you have trained if you want to

51 Normalizing intermediate layers

Here we talked about the normalization within the NN

52 Batch normalization

53 Batch normalization

How does batch normalization works?

Batch Normalization (BatchNorm) is a technique used in neural networks to improve the training stability and speed by normalizing the inputs of each layer. It was introduced by Sergey Ioffe and Christian Szegedy in their paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.”

Here’s a high-level overview of how Batch Normalization works:

53.1 Steps of Batch Normalization:

Normalization:
- For each mini-batch during training, normalize the input by subtracting the mean and dividing by the standard deviation. The normalization is applied independently to each feature (dimension) in the input.
\(\hat{x}^{(k)} = \frac{x^{(k)} - \mu}{\sqrt{\sigma^2 + \epsilon}}\)

Where:
- \(\hat{x}^{(k)}\) is the normalized output for the k-th feature.
- \(x^{(k)}\) is the input for the k-th feature.
- \(\mu\) is the mean of the mini-batch.
- \(\sigma^2\) is the variance of the mini-batch.
- \(\epsilon\) is a small constant added for numerical stability.
Scale and Shift:
- Introduce learnable parameters (scale and shift) for each feature to allow the model to adapt during training.
\(y^{(k)} = \gamma \hat{x}^{(k)} + \beta\)

Where:
- \(y^{(k)}\) is the final output for the k-th feature.
- \(\gamma\) is a learnable scale parameter.
- \(\beta\) is a learnable shift parameter.
Training and Inference:
- During training, the mean and variance are computed for each mini-batch and used for normalization.
- During inference, running averages of mean and variance from the training phase are typically used for normalization to ensure consistency.

53.2 Benefits of Batch Normalization:

Improved Training Stability:
- Helps mitigate the internal covariate shift problem, leading to more stable and faster convergence during training.
Reduced Sensitivity to Initialization:
- Reduces the sensitivity of the model to the choice of initial weights.
Allows Higher Learning Rates:
- Enables the use of higher learning rates, which can accelerate training.
Acts as Regularization:
- Introduces a slight regularization effect, reducing the need for other regularization techniques.
Applicability to Various Architectures:
- Can be applied to various types of neural network architectures, including fully connected layers, convolutional layers, and recurrent layers.

Batch Normalization has become a standard component in many deep learning architectures due to its effectiveness in improving training stability and convergence speed.

54 Batch normalization — The algorithm

55 How does batch normalization help optimization?

Because some layers will push ouputs in one direction, then other layers will not use this. With batch norm we centered data so that all layers train around these centered inputs

56 How does batch normalization help optimization?

57 Benefits of Batch normalization

Train faster because all layers train similarly quickly
Allows you to have high learning rates because you wont have vanishing or exploding gradients because everything is 0-1 distributed
Makes weights easier to initialize, because you know everything will be between 0-1
Make activations function sensible because all the activation functions have something special about zero
Have added noise that comes from estimating the batch statistics, any noise that may help is regularization. Here the noise reduces overfitting and that acts as a regularization so that your model does not overfit.
Put it simple: Noise Disrupts Patterns

58 Quiz

There is no answer here

59 Batch normalization at test time

The important thing is that when you go to test time, you dont have batches (you dont want anything that is depended on how you construct the batch) because that means if you take another batch is not the same which then is not reproducible.

So what we usually do is keep a moving average of the mean and variance during training, and then at test time you plug them

Basically you extract the mean and the variance form the training and use it in test data

60 Disadvantages of batch normalization

It requires you to have large batch sizes because otherwhise the estimate of mean and variance is too noisy
Problematic if you have discrepancy of the training and tst data
Now the loss you get from training sample A, depends on what other training sample are present in the batch via this normalization of the mean and variance
One disadvantage is that is usually the reason for bugs, because if you keep estimating the mean and variances for the test data but now it is not reproducible because it will depend on the batch itself

61 Disadvantages of batch normalization with distributed training

Different values across GPU
If you batch size is small in a single GPU, but maybe you have 10 GPUs running and for 10 GPUs your batch size is bigger, it will be a stupid idea to estimate 10 very noisy estimate of the mean and variance but instead you should compute across the GPU

62 Layer normalization

Better explained here

Here mean and variance are not computed across batch but across all channels and spatial dimensions.

So now the statistics are independent of the batch size because now they depend on the feature dimensions see example below:

Example Layer normalization

Layer Normalization is a normalization technique similar to Batch Normalization but operates on a per-sample basis rather than per-minibatch. It normalizes the inputs of a layer across the features (dimensions) for each individual sample. Here’s an example of how Layer Normalization is typically applied:

import torch
import torch.nn as nn

# Assuming input has shape (batch_size, num_features)
input_data = torch.randn(32, 64)

# Layer Normalization
layer_norm = nn.LayerNorm(normalized_shape=64)
output = layer_norm(input_data)

# Display input and output shapes
print("Input shape:", input_data.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([32, 64])
Output shape: torch.Size([32, 64])

In this example:

input_data is a random tensor with shape (32, 64), representing a batch of 32 samples, each with 64 features.
nn.LayerNorm is the Layer Normalization layer provided by PyTorch. The normalized_shape parameter specifies the number of features in the input tensor.
The output tensor is the result of applying Layer Normalization to the input data.

Layer Normalization normalizes the values along the feature dimension independently for each sample. This means that each feature in a sample is normalized based on its mean and standard deviation across the entire sample, rather than across a batch as in Batch Normalization.

Layer Normalization is useful when the batch size is small or when working with sequences of varying lengths, as it normalizes each sample independently. It has been widely used in natural language processing tasks and recurrent neural networks.

Basically, we have that our mean and std will be computed across dimensions, so on the columns, not on each input. So then each sample (so each row) we make it with this mean and std to be distributed between 0 and 1.

This is great for RNN, or other stuff that requires small batch sizes.

Here the same operations happens at training and test time.

So now instead of normalizing across data samples now we basically i.e. if the input is an image that each color should be roughly be ocurring the same amount of spread across all the image, because now we normalize it across channel dimensions

63 Layer normalization (LN)

64 Instance normalization (IN)

Instance normalization now you do layer normalization but per channel and per training example.

So now the network should be agnostic to the constract of the original iamge and of the constrast whithin the channels

Not used that often

Here we compute the mean and var per sample but not per channel.

How Instance Normalization works

Instance Normalization is a normalization technique similar to Batch Normalization and Layer Normalization but operates on a per-instance basis. It normalizes the activations of each individual sample independently. Here’s an explanation of how Instance Normalization works:

64.1 Instance Normalization Steps:

Input Tensor:
- Assume you have an input tensor \(X\) with shape \((N, C, H, W)\), where:
  - \(N\) is the batch size.
  - \(C\) is the number of channels.
  - \(H\) is the height of the feature map.
  - \(W\) is the width of the feature map.
Calculate Mean and Variance:
- For each instance (sample) in the batch, calculate the mean \(\mu\) and variance \(\sigma^2\) along each channel independently. This is done for each channel and each instance separately.
\[\mu_c = \frac{1}{H \cdot W} \sum_{i=1}^{H} \sum_{j=1}^{W} X_{n,c,i,j}\]

\[\sigma^2_c = \frac{1}{H \cdot W} \sum_{i=1}^{H} \sum_{j=1}^{W} (X_{n,c,i,j} - \mu_c)^2\]
Normalize:
- Normalize each channel of each instance independently using the calculated mean and standard deviation:
\[\hat{X}_{n,c,i,j} = \frac{X_{n,c,i,j} - \mu_c}{\sqrt{\sigma^2_c + \epsilon}}\]

Where:
- \(\hat{X}_{n,c,i,j}\) is the normalized value.
- \(X_{n,c,i,j}\) is the original input value.
- \(\mu_c\) is the mean of the channel \(c\) for the instance \(n\).
- \(\sigma^2_c\) is the variance of the channel \(c\) for the instance \(n\).
- \(\epsilon\) is a small constant added for numerical stability.
Scale and Shift:
- Introduce learnable scale (\(\gamma\)) and shift (\(\beta\)) parameters for each channel:
\[Y_{n,c,i,j} = \gamma_c \hat{X}_{n,c,i,j} + \beta_c\]

Where:
- \(Y_{n,c,i,j}\) is the final normalized output.
- \(\gamma_c\) is a learnable scale parameter for channel \(c\).
- \(\beta_c\) is a learnable shift parameter for channel \(c\).

64.2 Benefits of Instance Normalization:

Normalization Across Samples:
- Instance Normalization normalizes each instance independently, making it suitable for scenarios where batch sizes may vary or are small.
Reduces Covariate Shift:
- Similar to Batch Normalization, Instance Normalization helps reduce internal covariate shift, leading to more stable training.
Applicability to Style Transfer:
- Instance Normalization has found applications in style transfer tasks in computer vision.

Instance Normalization is often used in computer vision tasks, especially in scenarios where the batch size may be small or when normalization across instances is desired.

65 Group normalization (GN)

We are gonna group different certain channels toguether. So we are not normalizing per channel but per groups, for instance 5 channels.

If you have only one group then you recover layer nomalization because layer normal normalizes across all channels and not per channel
If you have more number of channel group toguether i.e 3 channels group together then you do Instance normalization. meaning you compute the mean per sample across each channel

In Grouped Convs, you are basically separating the hidden layers or the hidden channels in hidden groups which makes computations more easier
This is better than batch normalization for small batches <32

66 A comparison of different normalizations

67 Weight normalization

You can thing the weights like a vector, you can have its magnitude and its direction. Now with this g can learn this parameter which tells you how long you want to go in that direction

68 Learning rate

69 Convergence

To achieve convergence you need tehse two equations:

all learning over time should be infinitely to allow for exploration
The quadratic term should be less than infite to converge

70 Learning rate schedules

You also make a warmpup learning rate so you start by going up in a linear fashion

71 In practice

72 Quiz

We do all of the above,

It is good that if the loss explode you check the individual values of the gradients,
or was the batch size was so small that something could have affect it
or was the learning rate too high that even a small batch it was not too but but the large learning rate was too big tha then it end up in a completely broken spot in the NNs

73 Dropout rate

74 Batch size

75 Designing CNNs to become even better. (Don’t try this at home)

Architecture, which model do you use?

76 Number of layers and neurons

Progress is not only in the architecture side, for example if you want to develop better algorithms, you want to have just a neural network which can identify between cats and dogs, there is multiple ways:

Come up with a new architecture
Come up with better gradients and better weights of the NNs

77 Babysitting Deep Nets

for classifying 8 classes, the loss should be -log(1/8) and you check whether is true, if it gets worst performance that randomly guessing so something is wrong

78 Logging tools

79 Babysitting Deep Nets

Link1