Self-supervised Learning I

Description of this Post
Author
Published

December 14, 2023

Self-supervised Learning I

Description of this Post
Author
Published

December 14, 2023

1 Selt-supervised learning for computer vision

Slide 1

2 Organisation

Slide 2

3 Self-supervised learning came up in multiple previous lectures.

Slide 3

4 Today:

Slide 4

5 Title

Slide 5

6 The field of Al has made rapid progress, the crucial fuel is data

Slide 6

7 Manual annotations for the data are limiting.

Slide 7

  • Weak supervised learnings are for i.e hastags in instagram, this can be noisy because a person can show a pic of a dog with #cute which is not very representative as a label

  • Weak supervised learning is a type of machine learning that falls between supervised and unsupervised learning. In weakly supervised learning, the training data is labeled, but the labels are noisy, incomplete, or imprecise. This approach is often used when it’s challenging or expensive to obtain a fully labeled dataset

8 Solving the problem of expensive annotations: self-supervision.

Slide 8

9 General procedure of self-supervised learning.

Slide 9

Here your transformation could be augmentations for instance.

The proxy task provides you with some gradients. That trains the DNN. Proxy tasks could be geometry based, clustering and so on

10 General procedure of self-supervised learning.

Slide 10

In Representation Learning you get image in and then vector out

Sueful Slef-supervised learning: You pose some task which was previously done in a supervised manner as a self-supervised task. This could be object detection & segmentation

11 Title

Slide 11

12 Reason 1: Scalability

Slide 12

13 Reason 1: Scalability

Slide 13

14 Reason 2: Constantly changing domains

Slide 14

15 Reason 2: Accessibility & generalisability

Slide 15

Why do we want to do self-supervised learning?

Once you have a pre-trained model you can example use it to classify samples

So you can do pre-training on a lot of data and then aftwerwards you can fine tuning it on your specific i.e hospital data.

Also pre-trained representations have been used in archeology for figuring out whether a particular sample belongs to a particular period

16 Reason 3: Ambiguity of labels

Slide 16

In weakly supervised learning you dont used lables that humans specifically provided but instead labels that you found i.e hastags that people place. Also for instance for the CLIP model where you have images and captions. This were just drawn from the internet so some of this captions you may see i.e a laptop and the labels may not be laptop but read product #15, or another example there may be a pic of a dog and it may say ‘my fav partner to go on a walk’. So that is ambiguous and confusing for the model. So instead serlf-supervise model implies that we only use the raw data, so we dont use any of the annotations.

So a reason to do self-supervised learning is that because these lalbels from the internet are already not accurate then you dont want to use them. Instead you can do slef-supervised learning where you train a DNN with unlabel data and then use it into your task at hand

17 Reason 4: Investigating the fundamentals of visual understanding

Slide 17

Can we understand really what happens without labels? so the fundamentals of computer vision.

18 Quiz:

Slide 18

Slef-supervised refers more to you want get something that you can use for another datasets. So topically in representation learning the dinstiction is clear because if you are learning a contrastive model that by itself is not usefull but in that case you can say self-supervised learning is part of supervised learning methodologies.

  • Normal AE are unsupervised learning methods

  • There another autoencoders that are self-supervised learners.

19 Title

Slide 19

20 Here, we will only cover the most important works.

Further details and recent developments can be found here:

Slide 20

21 How does one learn without labels?

Slide 21

  • We say that we need to generate gradients. So some type of signals that we can leverage include:

  • Reconstructions: we can remove some aprt of the image and ask the model to reconstruct what has been hidden

  • Geometry

22 Early methods: Context prediction

Slide 22

23 Note: similar to how BERT has been trained

Slide 23

24 Early methods

Slide 24

  • Context Encoders, you maks now a part of a image, so you put a white mask on top of the image and then you trained a model to ouput a dense feature map that will put the pixeles at that location. You only apply the loss at this locations but because you use a CNN you train all the weights.

25 Geometry: RotNet: learn features by predicting “which way is up”.

Slide 25

26 Image-uniqueness: Exemplar CNN, precursor to contrastive learning

Slide 26

EXampler CNN came before Contrastive Learning, this helps to work on CLip. Here you augment the image multiple times for each iamge, and now you model needs to ouput which image identity it was.

The idea of image uniqueness is that if you have near dusplicate copies of the same image then it makes for a string signal. For isntance if you have a dog jumping vs a dog sitting, that is very difficult to differentiate so in that sense the model needs to learn quite some good feature in order to be differentiating these two classes.

This also enforces augmentation-invariance because all the different views, all the different augmentations of the image they should be the same

27 Modern Noise-contrastive self-supervised learning

Slide 27

After that people develop contrastive models.

The basic idea for simCLR is: you take two views for two images. You have two augmentations of the dogs a and two for the chair.

Here the softmax is calcualting the similarity of z_i and z_j. These are the last representations. The sim() function is the dot products which tells you how similar they are. You apply the softmax across all these dot products

28 CLIP from Lect 9 and assignment 2 simply applies SimCLR across modalities

Slide 28

29 Modern Noise-contrastive self-supervised learning

Slide 29

30 Masked Image Modelling (recent development)

Slide 30

Transformers work on sequences but CNN this approach would not work because the ouputs are always spatial that means you can simply leave some. Then you get a representation and your task is to predict all these missing patches given the patches that you have seen

31 Clustering

Slide 31

32 Title

Slide 32

33 Datasets for images: Pretraining and downstream

Slide 33

34 Recent surge in research on problematic images in ImageNet

Slide 34

35 Title

Slide 35

36 The dataset: diverse, containing nature and buildings.

Slide 36

37 Datasets for images: Pretraining and

Slide 37

38 Downstream semi-supervised tasks: Self-supervised Learning helps

Slide 38

  • Supervised: red
  • Self-supervised: Blue

39 Title

Slide 39

40 Title

Slide 40

41 Goal: Discover visual concepts without annotations.

Slide 41

42 How can we solve this chicken and egg problem?

Slide 42

43 The key to image understanding is separating meaning from appearance.

Slide 43

Even though the pixel values are different

44 Quiz:

Slide 44

  1. So if you stay in the same domain, or a clone one you know that that the \(lr\) would be quite similar
  2. You also can set the network architecture because if you know is a vision recognition taks then you know a Conv architecture is meaningful.
  3. If your dataset is small then you reduce it because you trained before with tons of images so the epochs were also big

45 Our work applies the idea of augmentation invariance to assign concepts.

Slide 45

46 Our work applies the idea of transformation invariance to assign concepts.

Slide 46

47 How can we optimize the labels and make assignments consistents

Slide 47

here we want to make y differentiable because we want to learn those labels.

48 SK optimisation (not needed for exam)

Slide 48

49 SK optimisation of assignments Q (not needed for exam)

Slide 49

50 Algorithm

Slide 50

51 Our method applied on 1.2 million images:

Examples

Slide 51

52 Automatically discovered concepts match manual annotation.

Slide 52

53 AlexNet, ImageNet linear probes (remember Lecture 5)

Slide 53

54 Self-supervised labelling from three core ideas

Slide 54

55 More recently…

Slide 55

56 DINO has remarkable properties

Slide 56

57 Title

Slide 57

58 However: The world is not object-centric.

Slide 58

59 Self-Supervised Learning of Object Parts for Semantic Segmentation

Slide 59

60 Self-Supervised Learning of Odject Parts for semantic Segmentation

Slide 60

61 Self-Supervised Learning has to move from image-level to spatially-dense learning

Slide 61

62 We propose a dense clustering pretext task to learn object parts

Slide 62

63 Quiz

Slide 63

  1. False. ROI-Align cannot take care of non-recatangular selections
  2. False, just like any pooling can provide gradients

64 Title

Slide 64

65 Additional Innovation 2: Overclustering with Community Detection (CD)

Slide 65

66 Overclustering with Community Detection ran.

Slide 66

67 Title

Slide 67

68 Leopart improves fully unsupervised SOTA by >6%

Slide 68

69 Leopart achieves transfer SOTA on three datasets simultaneously

Slide 69

70 Augmentations were key for both SeLa and Leopart.

Slide 70

71 How can we Isolate the effect of augmentations?

By learning from a single image

Py4t4

Slide 71

72 How do we go about thiss

Slide 72

73 What do we learn?

Slide 73

74 Tested images

Slide 74

75 Self-supervised learning from one image:

First convolutional layer

Slide 75

76 Self-supervised learning from one image:

Quality (ImageNet linear probes)

Slide 76

77 Self-supervised learning from one image:

Quality (ImageNet linear probes)

Slide 77

78 Style transfer with a 1-image trained CNN

Slide 78

79 | Update Feb 2021 |] Using a ResNet-50 and MoCo loss,

we get even closer for fine-tuning tasks.

Slide 79

80 Update 2:

Slide 80

81 Conclusion

Slide 81

82 unsup. pre-train

Slide 82

83 Title

Slide 83