Compositional semantics and sentence representations

Description of this Post

All

NLP

1 Compositional semantics

Deep here we mean as deeper understanding of language no NN

2 Compositional semantics alongside syntax

If we want to model semantics alongside syntax. Word meaning then phrase meaning, then sentence meaning.

3 Non-trivial issues with semantic composition

Here in the first i, it refers maybe to a dog. The second one it does not refer to anything. So even if they have same syntax structure they have different meanings

The problem with last one is that even though these phrases can mean soomthing unqiue like the second refere to a person has passed away. Sometimes we may also express that a person just kick the bucket

4 Non-trivial issues with semantic composition

5 Issues with semantic composition

This represent recursion

6 Modelling compositional semantics

These are two modelling frameworks

Here we do the composition directly in vector space

Unsupervised methods, they are general purpose. They capture the meaning of a word based on the similarity with other words.

Here you train your representation in a supervised way, which means you need a task to get the learning signal from. For example in sentiment classification.

For instance if you train your representations to sentiment analysis and then you want to do translation this will not work, cause you train on a different task

7 Compositional distributional semantics

These are the all general purpose unsupervised way

The idea come up, we were successful to create word representations, why not phrases, then why not in the sentence level.

If you have a finite vocab you can still create an infinitely amount of sentences

It is unfeseable because you dont have every possible sentence there and create a sentence representation for that. But we can do somthing similar which is, instead of learning sentence representatins directly you would try to use the word representation for the sentences and composed to create a sentence representatio.

In principle you need only word representations for all words which is more doable than getting word representations for all sentences

8 Vector mixture models

9 Additive and multiplicative models

Because summation is symmetric representation, we get the same representation so this model has a flaw

Prepositions are used flexibly in any position, which means they dont have a strong behavioral profile but the content words they do. For example they appear in the same position they co-ocurr in similar context

10 Lexical function models

11 Lexical function models

12 Learning adjective matrices

13 Learning adjective matrices

14 Title

15 Title

16 Title

17 Task: Sentiment classification of movie reviews

18 Words (and sentences) into vectors

19 Sentence representation: A (very) simplified picture

20 Title

21 Dataset: Stanford Sentiment Treebank (SST)

22 Binary parse tree: One example

23 Title

24 Models

25 First approach: Sentence + Sentiment

26 Title

27 Title

Here we do not model order or syntax

28 Bag of Words

29 Bag of Words

Because you don’t consider order the example in the sentece is the same, so there is a flaw in this model

30 Turning words into numbers

31 One-hot vectors select word embeddings

32 Title

33 Title

Now because the vector representation can squezze more detail in the embedding about the word, then this increase in dimensionality is better

34 Continuous Bag of Words (CBOW)

35 Recall: Matrix Multiplication

36 What about this?

Here the problem of just concatenating is that you dont know the size of the W to multiply because the sentence embeddings are vary in length

37 What about this?

38 Title

39 Title

Here we just learn more layers, so more complexity to the model

40 What about this?

Deeper is not always better because we might start to overfit, and at test time it will not generalize well

41 Question

42 Title

43 Deep CBOW with pretrained embeddings

It will be easier if the models already know the word meaning and then it can get the sentoment fo the sentece. Here we have a prior which is our word representation

44 Title

There is two paradigms by using pre-trained embeddings

You can keep these representations frozen: so not for training
or you fine-tune the word representations toguether with your task. This means the word represenation becomes more specialized for the task

45 Recap: Training a neural network

46 Cross Entropy Loss

47 Softmax

48 Title

Feed forward NNs were not able to contain word order information, RNN can do it

Here the words would be conditioned in the previous words

49 Introduction: Recurrent Neural Network (RNN)

50 Introduction: Recurrent Neural Network (RNN)

6 words 6 times steps

51 Introduction: Recurrent Neural Network (RNN)

52 Introduction: Unfolding the RNN

The W and the R amtrix are the same, they are shared

53 Introduction: Making a prediction

When you reach the end of the sentence, then you use the ouput vector for this word as the sentence representation. We can do this vecause this last time step was influenced by the entire history so that is why we substitute, this as our sentence representation. And then we can project it as 5 dimensional representation and then we take the argmax or softmax based on this representation

54 Introduction: The vanishing gradient problem

They tend to surfer from the vanishing problem

55 Introduction: The vanishing gradient problem

Here 5 num unrolls N refers to the amount of words you have in a sentence

56 What about this?

57 RNN vs ANN

In the ANN you have different parameters for your matrix in each layer and the problem could cancel out. However even in ANN you can run into vanishing/exploding gradients

58 Long Short-Term Memory (LSTM)

CV: resNET skip connections to aliviate exploding/vanishing gradients

NLP: LSTM were introduced

LSTM are good to deal with long-term dependencies because they are able to cpe with exploding/vanishing gradients

59 LSTM: Core idea

The cells are supposed to capture the long term memory information from the sentence

This is good because backpropagation through time with have this partially uninterrupted gradient flow

60 LSTMs

Now the activation function here would be the LSTM, so each copy would contain an LSTM cell where before we have one layer and now we will have the cell with four different layers interacting with each other

61 LSTM cell

3 gates:

forget gate
input gate
output gate

c_t is the memory cell, here we do not apply any weights, we just do multiplication and summation. This is why people call it as a conveyers belt. So here we forget information, we add information but information flows interrupted

62 LSTM: Cell state

63 LSTM: Forget gate

64 LSTM: Candidate cell

65 LSTM: Input gate

66 LSTM

67 LSTM: Output gate

Here we are saying what do I want to keep from my long term memory. The ouput of that is going to be my new output vector.

67.1 Recap

Use the forget gate to get the input word x_t and the previous h_{t-1} with that you do apply softmax which then you multiply to the cell state which is the memory.

Here if you multiply by 1, then you wan to keep those items in memory. If 0 then you do not want to conserve them.

We use the candidate gate where you mutliply the ouput of the tanh which gives candidate values between -1 and 1 with some scaled softmax from the input gate. With this we selectively add what to conserve in the memory cell
We update the values of the ouput gate which we take: from the cell memory values between -1 to 1 and we multiply these by a softmax version from the input words x_t and also the previous state h_{t-1}

Step 1 & 2 is called long-term memory
Step 3, is the short term memory. This is a filtered version of the long-term memory.

68 Long Short-Term Memory (LSTM)

Cell state is the long term memory
Hidden state is your short term memory

69 LSTMs: Applications & Success in NLP

70 Summary fo models seen so far

Sequence models :

RNN
LSTM

Tree-structure models are the ones that are also sensitive to the syntactic structure

71 Second approach: Sentence + Sentiment + Syntax

72 Exploiting tree structure

Compositionality was this idea that you cannot derive the meaning of sentences from the meaning of the individual words.

In the models seen so far, we were getting the sentence representation deriving it from the individual words. But we were not taken the syntactic structure into account. These Tree LSTM allow us to do both. So we will get the meaning of the words and also the rules that combine them

73 Why would it be useful?

74 Constituency Parse

75 Recurrent vs Tree Recursive NN

Recurrent NNS that are LSTMs but you also have tree RNN which are recursive

If you input “I loved this movie” to the RNN you will not be able to model, the phrase independetly of the previous words in the sentence. So for instance “this movie” is dependent of having seen “I Loved” that means I cannot extract separate phrase representations from your sentence representations

This is different in the three recursive NN, because you explicitly first compose “this movie” into a phrase representation and then you would make it dependent on the previous word while you go up in the three

76 Tree Recursive NN

77 Practical II data set: Stanford Sentiment Treebank (SST)

78 Tree LSTMs: Generalize LSTM to tree structure

We can input multiple children in each time step

79 Tree LSTMs

You can use any number of children that you want but you will loose child order information
N-ary Tree LSTM: in practical Binary parse tree

80 Child-Sum Tree LSTM

81 Child-Sum Tree LSTM

82 N-ary Tree LSTM

That means I have to input separatly to the model because they have separate parameters matrices, so you not just summed them up.

83 N-ary Tree LSTM

84 N-ary Tree LSTM

\(u_j\) is for the candidate gate

For each child \(h_j\), we have a separate parameter matrix and you ill be summing

85 LSTMs vs Tree-LSTMs

Tree-LSTM general, general LSTM its just a Tree-LSTM with one child. So if you have one child then you have your standard tree LSTM

86 Title

87 Title

88 Building a tree with a transition sequence

89 Transition sequence example

90 Transition sequence example

91 Transition sequence example

92 Transition sequence example

93 Transition sequence example

94 Transition sequence example

95 Transition sequence example

96 Transition sequence example

97 Title

Because we are doing this in sequence so putting thins on the stack and then to the tree we cannot dot his in parallel, so this is slow. Thus, we want to do mini-batch where you process multiple sentences at the same time and