Convolutional Neural Networks

Description of this Post
Author
Published

November 20, 2023

Convolutional Neural Networks

Description of this Post
Author
Published

November 20, 2023

1 Optimizing neural networks

Slide 2

2 Multi-layer perceptrons (Recap)

Slide 3

3 Multi-layer perceptrons (Recap)

Slide 4

Prior knowledge, is something that we know about the data, we want to bring this into the design of the NNs

4 Consider an image

Slide 5

5 Hubel and Wiesel: Nobel Prize for Physiology or Medicine in 1981

Slide 6

Here when we see edges, there is some electricity

6 Filters, yes. How about learnable filters

Slide 7

Canny and GAbor filters they all try to find edges, then it can be used for recognition purposes.

7 Filters, yes. How about learnable filters

Slide 8

8 The convolution operation

Slide 9

9 The convolution operation

Slide 10



Here f*g is the convolution in red line. In the 1D case

Here g is the kernel

10 Convolution for 2D images

Slide 11

Now our kernel is 2D

11 Convolution for 2D images

Slide 12

12 Examples

Slide 13

13 Examples

Slide 14

Sobel fires for vertical edges

14 Quiz

Slide 15

  1. It will emphasize edges

If you take a CNN and you have weights uniform then you would not have this edge detectors and the NN would not train well

This is like adding prior knowledge because we know edges are supper important to detect whether is a cat or a dog

15 The motivation of convolutions

Slide 16

Local connectivity, for ie if you want to detect edges, you dont need to look at the whole image and because you share the parameters, the weights are tied and you are more efficient.

16 The motivation of convolutions

Slide 17

This saves quite a bit of neurons connected, so less parameters, this is the same as analysing an img of 16x16, but now instead we use filters so that we can detect edges and only with these edges we have now building blocks which are less than computing the whole image.

Here the kernel would be of width 3

17 The motivation of convolutions

Slide 18

So here in the left the NN has a receptive field of size 3, because this is how much a neuron can look up, so it is the kernel size. But per layer the receptive field gradually grows which allow you to have a hierarchical structure

  • For instance the neurons at layer 50 now they can see at the whole image and can put image into context. This is how you go from local to global

18 The motivation of convolutions

Slide 19

19 The motivation of convolutions

Slide 20

If the input shift then the outputs does the same, this is not the case for a fully connected NN

20 A simple convolution: saves space!

Slide 21

The bigger the filter the more zeros we will have

21 Convolution vs Pooling in 2D


22 The pooling operations

Slide 22

Pooling functions are another way to incorporate prior knowledge. It aggregates the activations. This can be local or global

You can max pool, or average pool the activations in some rectangular neighborhood. It reduces the space size and improves the efficiency and it also increases robustness

It also incorporates invariance to translations, because it will not matter whether the 6 would be in that corner or so on so on

At the last step you could od average global pooling, and just have one vector out, and in this vector will be trained to represent the whole image. Here you could apply a fully connected layer if you care classification

Min, max all are differentiable. If you instead would have and argmax then it will not be differentiable

So pooling operations like the global ones, also allow you to be independent in which input image you feed into your NN

23 LeNet-5

Slide 23

Here you do not have global pooling so an img of 29x29 would not have worked

The hidden dimensionalities are called channels, and the pooling is applied to all channels. So pooling operations do not change dimensionality but change the spatial extent.

So each layer would have channels those are all the squares in a layer, pooling it is apploed to every channel and per channel it reduces its square matrix to a lower width and lower height

23.1 More

LeNet-5, a convolutional neural network architecture proposed by Yann LeCun and his collaborators in 1998, does not use global average pooling in its original design. LeNet-5 primarily relies on subsampling layers (pooling layers) and fully connected layers.

The typical structure of LeNet-5 consists of alternating convolutional layers with subsampling (pooling) layers, followed by fully connected layers. The pooling layers in LeNet-5 perform down-sampling through operations like max pooling. Global average pooling was not a commonly used technique at the time LeNet-5 was introduced.

Global average pooling became more prominent in later CNN architectures, such as Google’s Inception models and the popular ResNet architectures

23.2 Global Pooling

Global pooling (or global average pooling) is a technique used in convolutional neural networks (CNNs) to reduce the spatial dimensions of a feature map to a single value or a vector. It involves taking the average (or maximum) value across all spatial locations of each feature map, resulting in a global representation.

Here’s an example of global average pooling with Python using NumPy:

import numpy as np

# Assume you have a 3x3 feature map with 2 channels
feature_map = np.array([
    [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    [[10, 11, 12], [13, 14, 15], [16, 17, 18]]
])

# Apply global average pooling
global_avg_pooled = np.mean(feature_map, axis=(0, 1))

# Print the original feature map and the result after global average pooling
print("Original Feature Map:")
print(feature_map)
print("\nGlobal Average Pooled Result:")
print(global_avg_pooled)

In this example, feature_map is a 3x3 feature map with 2 channels. The np.mean function is used to compute the average along the spatial dimensions (axis 0 and 1). The resulting global_avg_pooled is a vector representing the global average-pooled values for each channel.

The output should look like this:

Original Feature Map:
[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]]

Global Average Pooled Result:
[8.5 9.5 10.5]

In this case, the global average pooling operation has computed the average value for each channel across all spatial locations, resulting in a global representation for each channel. This global representation is often used as a compact and informative input to subsequent layers or for making predictions in the network.

24 AlexNet: similar principles, but some extra engineering.

Slide 24


Slide 25

Weight sharing in convolutional neural networks (CNNs) refers to the practice of using the same set of learnable parameters (weights and biases) for multiple units or neurons in a layer. In other words, the weights are shared across different spatial locations in the input.

The key idea behind weight sharing is to enforce translation invariance in the features learned by the convolutional layers. In an image, certain features (e.g., edges, textures) are meaningful regardless of their specific location. By using shared weights, the network can learn to detect these features at different spatial positions, leading to a more robust and generalizable representation.

Here’s a brief explanation of weight sharing in CNNs:

  1. Convolutional Operation:
    • In a convolutional layer, a set of filters (also known as kernels) is applied to the input image or feature map.
    • Each filter is characterized by a set of learnable weights and biases.
  2. Spatial Weight Sharing:
    • Instead of having unique weights for each spatial location in the input, weight sharing involves using the same set of weights across different spatial locations.
    • For example, if a filter detects a certain feature (e.g., an edge) at one location, the same filter with the same weights can be used to detect the same feature at a different location.
  3. Benefits:
    • Reduces the number of learnable parameters in the network, making it more computationally efficient.
    • Encourages the learning of spatially invariant features, enhancing the network’s ability to recognize patterns across different locations.
  4. Translation Invariance:
    • Weight sharing helps the network achieve translation invariance, meaning that it can recognize features regardless of their position in the input.

24.1 CNN and Weight Sharing

CNN is primarily used for image classification and segmentation, and it works by finding similar patterns throughout the input. These patterns can be found by sliding a filter with shared weights across the input. The shared weights concept allows the network to learn the same pattern, regardless of its position in the input

25 What shape should the

Slide 28


26 3D Activations

Slide 30

Instead of calling it RGB channels, we just call it channels

27 3D Activations

Slide 32

Now the activations contain width and height and also depth.

The depth is govern by the hidden dimensionality of the NN

28 3D Activations

Slide 34

29 3D Activations

Slide 35

Here this is a convolution kernel, with kernel size 5x5

Now our neuron has a kernel size of 5x5 weights and also x3 because it has 3 channels

So each Neuron as a 3D filter

30 3D Activations

Slide 38


30.1 Example: Not-moving Filter


Here we have:

  • Input Layer: 3x32x32
  • Kernel 5-size: 3x5x5

If we do not slide the filter then we are gonna end up with:

  • Pre-output: 3x1x1 (three scalar values per each channel)

Now we do a summation over the three channels and we have thus:

  • Output: 1x1x1

30.2 Example: Sliding Filter

Source: link

Imagine now that we slide this 3D filter along all the input layer, then we end up:

  • Pre-ouput: 3x(28)x(28), where we compute (width - kernel_size + 1) = (32-5+1)

Now we sum over all three channel element wise and end up with:

  • Output: 1x28x28

31 3D Activations

Slide 42


32 3D Activations

Slide 44

33 3D Activations

Slide 45

If you now slide the filter with as many neurons we will get:

  • Ouput: depth x (\(l\)) x (\(l\))

Where \(l\) = width - kernel_size + 1

34 3D Activations

Slide 47



35 3D Activations

Slide 50


36 3D Activations

Slide 52

37 3D Activations

Slide 53

38 3D Activations

Slide 54

39 3D Activations

Slide 55

40 3D Activations

Slide 56

41 Putting it together

Slide 57

42 Putting it together

Slide 58

43 Putting it together

Slide 59

44 Putting it together

Slide 60

45 Putting it together

Slide 61

46 Putting it together

Slide 62

47 Putting it together

Slide 63

48 Putting it together

Slide 64

Here all these coloured layers in the cube are filters which are neurons. All these neurons act only in one hidden layer8

49 Convolution: Stride

Slide 65

50 Convolution: Stride

Slide 66

51 Convolution: Stride

Slide 67

52 Convolution: Stride

Slide 68

53 Convolution: Stride

Slide 69

54 Convolution: Stride

Slide 70

55 Convolution: Stride

Slide 71

In each time sum across channels because if you want to detect something you want to use all the colors, all the incoming channels

So each convolution sums all channels like in RGB, because you dont want a filter that only looks at blue, one that only looks at red ..

56 Convolution: Stride

Slide 72

Here in the next slide we see that stride is the number of squares that are moved, so the kernel filter will i.e stride=3 will slide every two squares

57 Convolution: Stride

Slide 73

58 Convolution: Stride

Slide 74

59 Convolution: Stride

Slide 75

60 Convolution: Padding

Slide 76

61 Convolution: Padding

Slide 77

62 Convolution: Padding

Slide 78

63 Convolution: Padding

Slide 79


64 Convolution:

Slide 81

W_out is what you get in the ouput layer of the slide (so for one filter)

65 1x1 Convolution

Slide 82

It looks at all the values in the depth, so in RGB, or more importantly in deep NNs the different hidden layers and they just mixed those information together

66 1x1 Convolution: a computationally cheap method

Slide 83

Here in the 5x5x32 the convolution will be done over the 192 channels, 32 times (because we have 32 filters as depth), so sliding the filter a lot

In the bottom case we reduce the number of channels so now we reduce computations

67 Quiz:

Slide 84

  • In the case of a fully connected layer, it connects everything in the layer. And even the 1x1 convolution it takes the full input dimensionality meaning it takes a look at all 192 channels still. In the 1x1 it reduces the number of connections comparing to fully connected and also it mixes local information
More differences:

A 1x1 convolutional layer and a fully-connected layer (dense layer) are similar in that they both perform a linear transformation on the input data, but there are key differences between the two.

67.1 1x1 Convolutional Layer:

  1. Spatial Information:
    • A 1x1 convolutional layer operates on spatial information in the input tensor.
    • It applies convolutional filters with a size of 1x1, which means it processes information at individual spatial locations.
    • Useful for capturing relationships between channels but does not capture spatial patterns.
  2. Parameter Sharing:
    • Utilizes parameter sharing, similar to larger convolutional layers.
    • Each element in the output is the result of a weighted sum of its input elements, considering all channels.
  3. Output Dimensions:
    • The output dimensions depend on the number of 1x1 filters used.

67.2 Fully-Connected Layer:

  1. Flattening:
    • A fully-connected layer operates on the flattened version of the input.
    • It considers all elements in the input tensor as individual input features.
  2. Parameter Sharing:
    • Each neuron in a fully-connected layer has its set of weights for every input feature.
    • No parameter sharing between different neurons.
  3. Output Dimensions:
    • The output dimensions are determined by the number of neurons in the layer.

67.3 Differences:

  1. Spatial vs. Global Information:
    • 1x1 convolutional layers capture spatial information within each channel.
    • Fully-connected layers operate on global information, considering all elements as individual features.
  2. Parameter Sharing:
    • 1x1 convolutions use parameter sharing, making them more efficient for processing spatially correlated features.
    • Fully-connected layers lack parameter sharing, resulting in a larger number of parameters.
  3. Computational Efficiency:
    • 1x1 convolutions are computationally more efficient than fully-connected layers, especially in scenarios with spatially structured data.
  4. Usage in Convolutional Networks:
    • 1x1 convolutions are commonly used in convolutional neural networks (CNNs) to adjust the number of channels and perform feature transformations.
    • Fully-connected layers are typically used in the final layers of a neural network for classification.
  • You dont loss necessarily information, we do not want to do it at the beginning because mixing, red, blue and green per pixel does not do much. It makes sense to do it later if you have edges on top of edges and then you mix this information, it makes more sense.
  • It is also not good to apply 1x1 when you do not want translation invariance
  • every 1x1 is strictly local, every neuron as a receptive field so there is actually spatial information there

68 Dilated Convolutions

Slide 85

This is very usefull if you need to deal with a huge image, but dont want huge hidden activations

If you do this then you can quickly downscale the image, without ignoring too many things

Also think that dilation its less expensive because doing 5x5 its more expensive than doing 3x3, so you can learn 3x3 but with holes in between and that is more efficient to consider large spatial footprint

69 Pooling

Slide 86

70 Pooling

Slide 87

71 Max Pooling

Slide 88

72 Getting rid of pooling

Slide 89

Instead of using pooling you can use a larger stride (so how many squares we slide) that we talk about

In Transformers pooling it is also not used anymore

In CNN is used

73 Example ConvNet

Slide 93

Every filter is one row here

74 Quiz

Slide 94

If you choose the kernel size to be the same as the input image then it is fully connected.

Mathematically 1

Implementation wise 2

75 How research gets done part 4

Slide 95

Slide 96

76 AlexNet

Slide 97

77 AlexNet

Slide 98

78 Activation function

Slide 99

Faster to train because of simple Relu, and also the gradients are not vanishing because you have the gradient of 1 starting from the positive direction

Why does gradient do not vanish with Relu?

The vanishing gradient problem refers to the issue where the gradients of the loss function with respect to the weights become extremely small during backpropagation, making it challenging for the model to learn and update its parameters effectively. This problem is particularly associated with activation functions that squash their input into a small range, such as the sigmoid or hyperbolic tangent (tanh) functions.

ReLU (Rectified Linear Unit), on the other hand, has a non-saturating activation behavior, which means that it does not squash its input into a small range.

ReLU does not saturate in the positive region of its input. For positive input values, the gradient remains constant (1), leading to consistent and non-vanishing gradients during backpropagation.

79 Activation function

Slide 100

80 Training with multiple GPUs

Slide 101

81 Training with multiple GPUs

Slide 102

82 On that note: Communicating between GPUs: PyTorch

Slide 103

83 Local Response Normalization

Slide 104

84 Overlapping Pooling

Slide 105

85 Overlapping Pooling

Slide 106

86 Overall architecture

Slide 107

The max pooling make a vector per every image

87 The Overfitting Problem

Slide 108

If a have a cnn that has many parameter more than my data input will i overfit?

If your CNN has a large number of parameters (i.e., it’s a complex model) and you have a small dataset, there is an increased risk of overfitting. A complex model may have the capacity to memorize the training data, capturing noise and outliers instead of learning generalizable patterns.

Althoug all these increase training time but high performance

88 The learned filters

Slide 109

89 Removing layer 7

Slide 110

90 Removing layer 6, 7

Slide 111

91 Removing layer 3, 4

Slide 112

We dont save that much parameters because convolutional layers are more efficient (they are not fully connected, not too many parameters)

92 Removing layer 3, 4, 6, 7

Slide 113

93 Translation invariance

Slide 114

Despite saying that CNN tend to be equivariant which means if you shift the input the output should also shift you can see that if you do that with these images, where you are just shifting the images the outputs do vary quite a lot

So CNN do not learn something that is explicit symmetrical or explicitly equivariant. Equivariance may be a good prior that we put in, but that does not mean that that really happens

94 Scale invariance

Slide 115

Same with scale, we have said that we apply the pooling operations so therefore we can be a bit invariant to scaling, but still NNs tend not to be super scale invariant

95 Rotation invariance

Slide 116

96 Further reading

Slide 117

Slide 118

97 Transfer learning: carry benefits from large dataset to the small one!

Slide 119

98 UPDATE: Transfer learning

Slide 120

99 Why use Transfer Learning?

Slide 121

The answer is yes even if you have saved the weights from a extremely good model and you have a small dataset

100 Convnets are good in transfer learning

Slide 122

Fine Tune the whole NN

Or use the CNN as feature extractor

101 Solution I: Fine-tune hT using hS as initialization

Slide 123

102 Initializing hT with hS

Slide 124

Imagnet, it outputs 1000 categories. If you want to classification for 30 categories then you need to throw that one away and restart training a new classifier to your needs

AlexNet, you can start removing some layers depending on how much data you have

103 Initializing hT with hS

Slide 125

if you pertained your NN in ImgNet and now you want to do Satalite classification then it may be usefull to find tune even those layers the bottom ones

104 How to fine-tune?

Slide 126

105 Solution II: Use hS, as a feature extractor for hT

Slide 127

106 Transfer learning benchmarks & techniques

Slide 128

107 Title

Slide 129