Compositional semantics and sentence representations

Description of this Post
Author
Published

November 22, 2023

Compositional semantics and sentence representations

Description of this Post
Author
Published

November 22, 2023

Slide 1


Slide 2

1 Compositional semantics

Slide 3

Deep here we mean as deeper understanding of language no NN

2 Compositional semantics alongside syntax

Slide 4

If we want to model semantics alongside syntax. Word meaning then phrase meaning, then sentence meaning.

3 Non-trivial issues with semantic composition

Slide 5

Here in the first i, it refers maybe to a dog. The second one it does not refer to anything. So even if they have same syntax structure they have different meanings

The problem with last one is that even though these phrases can mean soomthing unqiue like the second refere to a person has passed away. Sometimes we may also express that a person just kick the bucket

4 Non-trivial issues with semantic composition

Slide 6

5 Issues with semantic composition

Slide 7

This represent recursion

6 Modelling compositional semantics

Slide 8

These are two modelling frameworks

  1. Here we do the composition directly in vector space

Unsupervised methods, they are general purpose. They capture the meaning of a word based on the similarity with other words.

  1. Here you train your representation in a supervised way, which means you need a task to get the learning signal from. For example in sentiment classification.

For instance if you train your representations to sentiment analysis and then you want to do translation this will not work, cause you train on a different task

Slide 9

7 Compositional distributional semantics

These are the all general purpose unsupervised way

Slide 10

The idea come up, we were successful to create word representations, why not phrases, then why not in the sentence level.

If you have a finite vocab you can still create an infinitely amount of sentences

It is unfeseable because you dont have every possible sentence there and create a sentence representation for that. But we can do somthing similar which is, instead of learning sentence representatins directly you would try to use the word representation for the sentences and composed to create a sentence representatio.

In principle you need only word representations for all words which is more doable than getting word representations for all sentences

8 Vector mixture models

Slide 11

9 Additive and multiplicative models

Slide 12

Because summation is symmetric representation, we get the same representation so this model has a flaw

Prepositions are used flexibly in any position, which means they dont have a strong behavioral profile but the content words they do. For example they appear in the same position they co-ocurr in similar context

10 Lexical function models

Slide 13

11 Lexical function models

Slide 14

12 Learning adjective matrices

Slide 15

13 Learning adjective matrices

Slide 16


14 Title

Slide 17

15 Title

Slide 18

16 Title

Slide 19

17 Task: Sentiment classification of movie reviews

Slide 20

18 Words (and sentences) into vectors

Slide 21

19 Sentence representation: A (very) simplified picture

Slide 22

20 Title

Slide 23

21 Dataset: Stanford Sentiment Treebank (SST)

Slide 24

22 Binary parse tree: One example

Slide 25

23 Title

Slide 26

24 Models

Slide 27

25 First approach: Sentence + Sentiment

Slide 28

26 Title

Slide 29

27 Title

Slide 30

Here we do not model order or syntax

28 Bag of Words

Slide 31

29 Bag of Words

Slide 32

Because you don’t consider order the example in the sentece is the same, so there is a flaw in this model

30 Turning words into numbers

Slide 33

31 One-hot vectors select word embeddings

Slide 34

32 Title

Slide 35

33 Title

Slide 36

Now because the vector representation can squezze more detail in the embedding about the word, then this increase in dimensionality is better

34 Continuous Bag of Words (CBOW)

Slide 37

35 Recall: Matrix Multiplication

Slide 38

36 What about this?

Slide 39

Here the problem of just concatenating is that you dont know the size of the W to multiply because the sentence embeddings are vary in length

37 What about this?

Slide 40

38 Title

Slide 41

39 Title

Slide 42

Here we just learn more layers, so more complexity to the model

Slide 43

40 What about this?

Slide 44

Deeper is not always better because we might start to overfit, and at test time it will not generalize well

41 Question

Slide 45

42 Title

Slide 46

43 Deep CBOW with pretrained embeddings

Slide 47

It will be easier if the models already know the word meaning and then it can get the sentoment fo the sentece. Here we have a prior which is our word representation

44 Title

Slide 48

There is two paradigms by using pre-trained embeddings

  1. You can keep these representations frozen: so not for training
  2. or you fine-tune the word representations toguether with your task. This means the word represenation becomes more specialized for the task

45 Recap: Training a neural network

Slide 49

46 Cross Entropy Loss

Slide 50

47 Softmax

Slide 51

48 Title

Slide 52

Feed forward NNs were not able to contain word order information, RNN can do it

Here the words would be conditioned in the previous words

49 Introduction: Recurrent Neural Network (RNN)

Slide 53

50 Introduction: Recurrent Neural Network (RNN)

Slide 54

6 words 6 times steps

51 Introduction: Recurrent Neural Network (RNN)

Slide 55

52 Introduction: Unfolding the RNN

Slide 56

The W and the R amtrix are the same, they are shared

53 Introduction: Making a prediction

Slide 57

When you reach the end of the sentence, then you use the ouput vector for this word as the sentence representation. We can do this vecause this last time step was influenced by the entire history so that is why we substitute, this as our sentence representation. And then we can project it as 5 dimensional representation and then we take the argmax or softmax based on this representation

54 Introduction: The vanishing gradient problem

Slide 58

They tend to surfer from the vanishing problem

55 Introduction: The vanishing gradient problem

Slide 59

Here 5 num unrolls N refers to the amount of words you have in a sentence

56 What about this?

Slide 60

57 RNN vs ANN

Slide 61

In the ANN you have different parameters for your matrix in each layer and the problem could cancel out. However even in ANN you can run into vanishing/exploding gradients

Slide 62

58 Long Short-Term Memory (LSTM)

CV: resNET skip connections to aliviate exploding/vanishing gradients

NLP: LSTM were introduced

Slide 63

LSTM are good to deal with long-term dependencies because they are able to cpe with exploding/vanishing gradients

59 LSTM: Core idea

Slide 64

The cells are supposed to capture the long term memory information from the sentence

This is good because backpropagation through time with have this partially uninterrupted gradient flow

60 LSTMs

Slide 65

Now the activation function here would be the LSTM, so each copy would contain an LSTM cell where before we have one layer and now we will have the cell with four different layers interacting with each other

61 LSTM cell

Slide 66

3 gates:

  • forget gate
  • input gate
  • output gate

c_t is the memory cell, here we do not apply any weights, we just do multiplication and summation. This is why people call it as a conveyers belt. So here we forget information, we add information but information flows interrupted

62 LSTM: Cell state

Slide 67

63 LSTM: Forget gate

Slide 68

64 LSTM: Candidate cell

Slide 69

65 LSTM: Input gate

Slide 70

66 LSTM

Slide 71

67 LSTM: Output gate

Slide 72

Here we are saying what do I want to keep from my long term memory. The ouput of that is going to be my new output vector.

67.1 Recap


  1. Use the forget gate to get the input word x_t and the previous h_{t-1} with that you do apply softmax which then you multiply to the cell state which is the memory.

Here if you multiply by 1, then you wan to keep those items in memory. If 0 then you do not want to conserve them.

  1. We use the candidate gate where you mutliply the ouput of the tanh which gives candidate values between -1 and 1 with some scaled softmax from the input gate. With this we selectively add what to conserve in the memory cell

  2. We update the values of the ouput gate which we take: from the cell memory values between -1 to 1 and we multiply these by a softmax version from the input words x_t and also the previous state h_{t-1}

  • Step 1 & 2 is called long-term memory
  • Step 3, is the short term memory. This is a filtered version of the long-term memory.

68 Long Short-Term Memory (LSTM)

Slide 73

  • Cell state is the long term memory
  • Hidden state is your short term memory

69 LSTMs: Applications & Success in NLP

Slide 74


Slide 75

70 Summary fo models seen so far

Slide 76

Sequence models :

  • RNN
  • LSTM

Tree-structure models are the ones that are also sensitive to the syntactic structure

71 Second approach: Sentence + Sentiment + Syntax

Slide 77

72 Exploiting tree structure

Slide 78

Compositionality was this idea that you cannot derive the meaning of sentences from the meaning of the individual words.

In the models seen so far, we were getting the sentence representation deriving it from the individual words. But we were not taken the syntactic structure into account. These Tree LSTM allow us to do both. So we will get the meaning of the words and also the rules that combine them

73 Why would it be useful?

Slide 79

74 Constituency Parse

Slide 80

75 Recurrent vs Tree Recursive NN

Slide 81

Recurrent NNS that are LSTMs but you also have tree RNN which are recursive

If you input “I loved this movie” to the RNN you will not be able to model, the phrase independetly of the previous words in the sentence. So for instance “this movie” is dependent of having seen “I Loved” that means I cannot extract separate phrase representations from your sentence representations

This is different in the three recursive NN, because you explicitly first compose “this movie” into a phrase representation and then you would make it dependent on the previous word while you go up in the three

76 Tree Recursive NN

Slide 82

77 Practical II data set: Stanford Sentiment Treebank (SST)

Slide 83

78 Tree LSTMs: Generalize LSTM to tree structure

Slide 84

We can input multiple children in each time step

79 Tree LSTMs

Slide 85

  1. You can use any number of children that you want but you will loose child order information

  2. N-ary Tree LSTM: in practical Binary parse tree

80 Child-Sum Tree LSTM

Slide 86

81 Child-Sum Tree LSTM

Slide 87

82 N-ary Tree LSTM

Slide 88

That means I have to input separatly to the model because they have separate parameters matrices, so you not just summed them up.

83 N-ary Tree LSTM

Slide 66

Slide 89

84 N-ary Tree LSTM

Slide 90

\(u_j\) is for the candidate gate

For each child \(h_j\), we have a separate parameter matrix and you ill be summing

Slide 73

85 LSTMs vs Tree-LSTMs

Slide 91

Tree-LSTM general, general LSTM its just a Tree-LSTM with one child. So if you have one child then you have your standard tree LSTM

86 Title

Slide 92

87 Title

Slide 93

88 Building a tree with a transition sequence

Slide 94

89 Transition sequence example

Slide 95

90 Transition sequence example

Slide 96

91 Transition sequence example

Slide 97

92 Transition sequence example

Slide 98

93 Transition sequence example

Slide 99

94 Transition sequence example

Slide 100

95 Transition sequence example

Slide 101

96 Transition sequence example

Slide 102

97 Title

Slide 103

Because we are doing this in sequence so putting thins on the stack and then to the tree we cannot dot his in parallel, so this is slow. Thus, we want to do mini-batch where you process multiple sentences at the same time and

98 Transition sequence example (mini-batched)

Slide 104

99 Transition sequence example (mini-batched)

Slide 105

100 Transition sequence example (mini-batched)

Slide 106

101 Transition sequence example (mini-batched)

Slide 107

102 Transition sequence example (mini-batched)

Slide 108

103 Transition sequence example (mini-batched)

Slide 109

104 Optional approach: Sentence + Sentiment + Syntax + Node-level sentiment

Slide 110

105 Title

Slide 111

106 Recap

Slide 112

107 Title

Slide 113

108 Input

Slide 114

109 Recap: Activation functions

Slide 115

110 Introduction: Intuition to solving the vanishing gradient

Slide 116

111 Introduction: A small improvement

Slide 117

112 Child-Sum Tree LSTM

Slide 118

113 A naive recursive NN

Slide 119

114 SGD vs GD

Slide 120