Attention & Transformers

Description of this Post
Author
Published

November 27, 2023

Attention & Transformers

Description of this Post
Author
Published

November 27, 2023

1 General Overview: Transformer Architecture


2 Background knowledge





Slide 3

3 Seq2seq models

Slide 4

4 Neural machine translation with a seq2seq model

Slide 5

5 Defining seq2seq for NMT”: Encoder

Slide 6

h_{t-1} is the hidden state, is this what makes a NN like \(f()\) be recurrent

6 Defining seq2seg for NMT: Decoder

Slide 7

7 Issue of seq2seq models

Slide 8

The problem is that they were trying to compress all info into a context vector \(c\)

8 Attention

Slide 9

9 Here, | found what you were

Slide 10

10 Attention

Slide 11

11 Formal definition of Attention

Slide 12

12 Formal definition of Attention

Slide 13

If the value e or the alignment model is high then it tells you how similar the inputs around position j and the outputs at position i match.

13 Formal definition of Attention

Slide 14

14 Why attention?

Slide 15

15 Self-attention

Slide 16

16 Paying attention in vision

Slide 17

17 Attention is all you need

Slide 18

18 Queries, keys and values

Slide 19

19 Scaled Dot-Product Attention

Slide 20

20 Multi-head attention

Slide 21

21 Multi-head self-attention

Slide 22

22 Multi-head self-attention

Slide 23

23 Transformer encoder

Slide 24

24 Transformer decoder

Slide 25

Why do we use a mask in the decoder of self attention?

24.1 Autoregressive Property:

Autoregressive models generate outputs one step at a time in a sequential manner. In the context of natural language processing, this means predicting the next word in a sequence given the preceding words. Autoregressive models are trained to predict the next token in the sequence based on the tokens that have already been generated.

24.2 Decoder in a Transformer:

The decoder in a transformer is responsible for generating the output sequence. It consists of multiple layers, each containing a self-attention mechanism and feedforward neural networks. The self-attention mechanism allows the model to weigh different parts of the input sequence differently when generating each token.

24.3 Cheating and the Mask:

“Cheating” refers to the undesirable situation where the model uses information from future positions in the sequence during training. During training, the model is fed the true output sequence up to the current position to calculate loss and update its parameters. If the model were allowed to attend to future positions, it might artificially inflate its performance by relying on information that it wouldn’t have during actual generation.

The mask applied in the decoder’s self-attention mechanism prevents the model from accessing future information. The mask sets the attention scores for future positions to very small values, essentially blocking the model from attending to tokens that haven’t been generated yet. This ensures that the model learns to generate each token based only on the information available up to that point, aligning with the autoregressive nature of the decoding process.

24.4 Example:

Consider the task of language translation. When translating a sentence from English to French, the decoder generates the translation one word at a time. If it were allowed to attend to words in the future, it might incorrectly use information from the French translation that hasn’t been generated yet. This could lead to overfitting during training and poor generalization to unseen data.

In summary, preventing “cheating” by using a mask ensures that the decoder learns to generate outputs based on the information available up to the current step, improving the model’s ability to generalize to unseen data and maintain the autoregressive property essential for sequence generation tasks.

25 The full Transformer

Slide 26

26 Title

Slide 27

27 Coding a Transformer (PyTorch): init

Slide 28

28 Coding a Transformer (PyTorch): forward pass

Slide 29

tgt: in roder to predict the next word, you can look at the preceding word. So the target is the same as source except that is shofted one to the left. We do it so that it shofts and we are able to predict the last word

29 Transformer: Positional encodings

Slide 30

Attention is a permutation-invariant operation, but this is not ideal because we might have that sometimes the order is important like with ‘not’

positional enconding to locate where are you at the beginning or at the end

30 Transformer: Positional encodings

Slide 31

if they are apart from each other their positional encoding should be different

31 Coding the Positional Encodings (PyTorch)

Slide 32

32 Pros & Cons

Slide 33

It scales quadratically with the num inputs, the matrix is N * N let see an example:

The quadratic scaling of transformers with respect to the number of inputs primarily arises from the self-attention mechanism used in transformers. In self-attention, each element in the input sequence attends to all other elements, and the attention scores are computed pairwise. This leads to a quadratic dependency on the number of inputs.

Let’s consider a simple example with a sequence of length \(N\). For simplicity, let’s assume each input element has a dimension of 1 for illustration purposes.

What about other dimensions, wel that can be possible because remember we have our embeddings as the input to the NN, not the words itself


Like in this picture our dimensions are clearly larger than 1 for the embeddings

  1. Original Sequence (1D): \[ x_1, x_2, x_3, \ldots, x_N \]

  2. Self-Attention Weights: For each element \(x_i\), self-attention computes a weight for all other elements \(x_j\) based on their relationships. This results in a square matrix of attention weights:

    \[ \begin{bmatrix} w_{1,1} & w_{1,2} & \ldots & w_{1,N} \\ w_{2,1} & w_{2,2} & \ldots & w_{2,N} \\ \vdots & \vdots & \ddots & \vdots \\ w_{N,1} & w_{N,2} & \ldots & w_{N,N} \\ \end{bmatrix} \]

    Each entry \(w_{i,j}\) represents the attention weight between \(x_i\) and \(x_j\).

  3. Output for Each Element: The output for each element \(x_i\) is computed as a weighted sum of all elements based on the attention weights:

    \[ \text{output}_{i} = w_{i,1} \cdot x_1 + w_{i,2} \cdot x_2 + \ldots + w_{i,N} \cdot x_N \]

    This involves \(N\) multiplications for each element.

  4. Total Complexity: For \(N\) elements, we need to compute \(N\) attention weights for each element, resulting in a total of \(N^2\) attention weights. Therefore, the overall complexity is quadratic, \(O(N^2)\), due to the pairwise comparisons.

This quadratic scaling becomes computationally expensive as the sequence length increases, leading to challenges in handling long sequences efficiently. To address this, techniques like sparse attention patterns and approximations have been proposed in research to reduce the computational cost while maintaining the benefits of self-attention.

Slide 34

34 BERT

Slide 38

35 BERT input representation

Slide 39

36 BERT pre-training

Slide 40

37 BERT fine-tuning

Slide 41

38 BERT for feature extraction

Slide 42

With BERT we gained contextualized word embeddings

39 BERTology

Slide 43

40 GPT-{1, 2, 3, 4}

Slide 44

With bert you did not have a generative model, with GPT you can because it only relies on the past to predict the next ones. Bert mask the word in the middle, but sees at the right and left to see the context.

You dont need to have labels, because pred the next word is just looking in the corpus what is the actual word that should fit.

41 GPT-{1, 2, 3}

Slide 45

42 GPT: In- context learning

The three settings we explore for in-context learning Traditional fine-tuning (not used for GPT-3)

Slide 46

The ability to not train gradients is a cool ability that these hug models have

What is in-context learning?

In natural language processing or conversation, understanding a word or phrase in context means considering the words or sentences that precede and follow it to grasp its intended meaning. This is important because the same word can have different meanings in different situations.

In the context of machine learning, especially with language models like GPT-3, providing information “in context” often involves supplying relevant details or context so that the model can generate responses or perform tasks that take into account the broader context of the input. This is particularly important for tasks that require understanding and generating coherent and contextually appropriate language.

43 Discuss

For models like StableDiffusion, Dalle, EMU video etc. T5

Slide 47

Why may encoders models be favorable compared to decoder models?

  • Decoders are also trained with masks but if you want to predict the next word, this representation looks at everything that comes before, so in a way if that is what you care about there is no mack really (because you are looking at everything that was looked before)

  • Nobody knows the answer for this question

  • Hypothesis is that encoders compress the information, while for Large language models, they are basically the job of encoding and decoding at the same time, because th closer you get to the ouput the more you need to go back to i.e correct grammar and very low level features, and somewhere in the middle of these decoder models there is the summary semantics that you could use for the vission models but you don’t know exactly where those features are. So for encoders you know exaclty where th summary is because that is still the bottleneck still but for decoders we dont know where to take the features

  • A CNN is an encoder

  • Unet also has this decoder then the decoder part like

44 GPT vs BERT

Slide 48

45 Multimodal Transformer architecture: CLIP

Slide 49

46 Multimodal Transformer architecture: CLIP

Slide 50

Here we want things to be close but different. That is hard, and with these hard examples we learn new features and learn more

Now differentiating a dog vs a sheep that would be easier and eventually you will not learn anything.

1:32

47 Multimodal Transformer architecture: CLIP

Slide 51

Because they use a text encoder, like BERT, they can do Zero-shot for classification images

48 CLIP: Zero-Shot Examples

Slide 52

49 CLIP: Robustness

Slide 53

  • Better because of the internet:

CLIP is pre-trained on a large dataset with diverse images and associated text from the internet. This diverse pre-training data helps the model learn features that are more transferable across different tasks and domains. In contrast, a supervised ImageNet model might be optimized for the specific categories present in ImageNet, and its features may not generalize as well to new, unseen classes.

  • Better because we can guide it using engineered prompts:

In zero-shot learning with CLIP, you can provide textual prompts to guide the model’s behavior. This allows you to specify the task or class you’re interested in, enabling the model to adapt its predictions based on the provided textual information

  • Less prone to overfitting due to have trained in larger dataset:

Supervised models trained on specific datasets, such as ImageNet, may be prone to overfitting to the idiosyncrasies of that dataset. CLIP, having been trained on a broader range of data, may be less prone to overfitting to specific dataset biases

  • More data more understanding of semantics:

CLIP’s strength lies in its ability to understand the semantic relationships between images and text. A larger dataset provides more examples of diverse language-image pairs, allowing the model to learn richer semantic embeddings

50 CLIP: Usage in other models

Slide 54

51 CLIP: Shortcomings

Slide 55

CLIP does not have a decoder, so it cannot generate text

52 Visual Language Model: Flamingo

Slide 56

Basically with CLIP you give images, out labels in form of a prompt text

With flamingo you give images and text prompts and can generate now the output cpation for an specific image

GPT, you give it some text and is able to see what is next because it uses decoders

53 Visual Language Model: Flamingo

Slide 57

The language model is frozen, but you add this cross attention gates. So the cross atentions is sort of similar to the encoder decoder structure, when this language model can attend to the visual inputs. The pink is what is being learned. The visual encoder are also keept frozen.

The perciver part allows you to change the representation of the encoder

54 Visual Language Model: Flamingo

Slide 58

Here there is an encoder and a decoder

55 Vision Transformer

Slide 59

56 Understanding a “Figure 1”

Slide 60

This is similar to BERT, in bert we have positional embedding.

Here we split the picture but still we conserve the order by remembering the index values which define the value

Bert process information in parallel, like in the paper see image below and why do we need positional embeding is because:

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model designed for natural language processing tasks. Unlike traditional sequential models, transformers process input data in parallel, which makes them highly efficient but also means they don’t inherently understand the order of the input sequence. To address this limitation and enable transformers to capture sequential information, positional embeddings are used.

Positional embeddings are added to the input embeddings to provide information about the position of each token in a sequence. In BERT, the model processes the input tokens in parallel, and without positional embeddings, it would have no inherent understanding of the order of the tokens. Positional embeddings help the model distinguish between tokens based on their position in the sequence, allowing it to capture the sequential structure of the input.

Coming back to ViT model, here we can see that the model also process information in parallel that’s why we need positional embedding so that we can then learnthe order of how the pic was constructed


57 Quiz: From what you now know about attention, what

Slide 61

Both attention mechanisms and convolutions are essential components in neural network architectures, and they each have their advantages and disadvantages. Here’s a comparison of the two:

57.1 Attention Mechanisms:

57.1.1 Advantages:

  1. Global Context Handling: Attention mechanisms allow the model to focus on different parts of the input sequence when making predictions, enabling the model to consider global context and dependencies.

  2. Variable Receptive Field: Attention doesn’t enforce a fixed receptive field, meaning the model can attend to different parts of the input sequence with varying degrees of focus. This flexibility can be beneficial for tasks where capturing long-range dependencies is crucial.

  3. Sequence-to-Sequence Tasks: Attention mechanisms have been particularly successful in sequence-to-sequence tasks, such as machine translation, where the input and output sequences can have varying lengths and alignments.

57.1.2 Disadvantages:

  1. Computational Complexity: Attention mechanisms can be computationally expensive, especially with large sequences, as they require pairwise comparisons between all elements in the sequence.

  2. Memory Requirements: The model needs to store attention weights for each element in the sequence, leading to increased memory requirements.

57.2 Convolutional Operations:

57.2.1 Advantages:

  1. Parameter Sharing: Convolutional layers use shared weights, which reduces the number of parameters in the model. This can make convolutional networks more computationally efficient and easier to train, especially on tasks with limited data.

  2. Local Receptive Field: Convolutional layers have a fixed-size receptive field, allowing them to capture local patterns and spatial hierarchies efficiently.

  3. Translation Invariance: Convolutional layers can provide some degree of translation invariance, meaning they can recognize patterns regardless of their exact position in the input.

57.2.2 Disadvantages:

  1. Limited Global Context: Convolutional layers have a fixed receptive field, which may limit their ability to capture long-range dependencies in the data.

  2. Not Well-Suited for Sequence Tasks: While convolutional layers are effective for image-related tasks, they may not be as naturally suited for sequence-to-sequence tasks where the input and output lengths can vary.

In practice, a combination of both attention mechanisms and convolutional layers is often used in hybrid models to leverage the strengths of each. For example, the Transformer architecture combines self-attention mechanisms with feedforward layers, providing an effective approach for a variety of natural language processing tasks.

58 Vision Transformer

Slide 62

59 Vision Transformer

Slide 63

In ViT we actually learn the positional embeddings, compared to Bert, we can actually visualize them and see that

60 Attention as a superset of convolutions

Slide 64

61 Training a ViT is more difficult

Slide 65

62 Vil features

Slide 66

63 Also here: ImageNet can (more or less) be solved with textures

Slide 67

64 Swin Transformer: add hierarchy back in?

Slide 68

65 Hybrid Architectures get best performances (atm)

Slide 69

66 The Perceiver

Slide 70

67 The Perceiver: main idea

Slide 71

68 The Perceiver: Taming quadratic complexity

Slide 72

69 Title

Slide 73

70 Notes on weight sharing for CNN

A convolutional layer is generally comprised of many “filters”, which are usually 2x2 or 3x3. These filters are applied in a “sliding window” across the entire layer’s input. The “weight sharing” is using fixed weights for this filter across the entire input. It does not mean that all of the filters are equivalent.

To be concrete, let’s imagine a 2x2 filter 𝐹 striding a 3x3 input 𝑋 with padding, so the filter gets applied 4 times. Let’s denote the unrolled filter 𝛽

\[ \mathbf{X} = \begin{bmatrix} x_{11} & x_{21} & x_{31} \\ x_{12} & x_{22} & x_{32} \\ x_{13} & x_{23} & x_{33} \\ \end{bmatrix} \]

\[ \mathbf{F} = \begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ \end{bmatrix} \]

\[ \boldsymbol{\beta} = \begin{bmatrix} w_{11} & w_{12} & w_{21} & w_{22} \\ \end{bmatrix} \]

\[ \mathbf{F} \cdot \boldsymbol{\beta} = \begin{bmatrix} \beta \cdot [x_{11}, x_{12}, x_{21}, x_{22}] & \beta \cdot [x_{12}, x_{13}, x_{22}, x_{23}] \\ \beta \cdot [x_{21}, x_{22}, x_{31}, x_{32}] & \beta \cdot [x_{22}, x_{23}, x_{32}, x_{33}] \end{bmatrix} \]

“Weight sharing” means when we apply this 2x2 filter to our 3x3 input, we reuse the same four weights given by the filter across the entire input. The alternative would be each filter application having its own set of inputs (which would really be a separate filter for each region of the image), giving a total of 16 weights, or a dense layer with 4 nodes giving 36 weights.

Sharing weights in this way significantly reduces the number of weights we have to learn, making it easier to learn very deep architectures, and additionally allows us to learn features that are agnostic to what region of the input is being considered.