Term-based Retrieval

Description of this Post
Author
Published

February 13, 2024

Author
Published

February 13, 2024

1 Title

Slide 1

2 Document representation and matching

Slide 2

3 Outline

Slide 3

4 Outline

Slide 4

5 Documents as vectors

Slide 5

The documents are the columns. These columns are the vectors which are sparse. These are books. The rows are the characters

6 Match using cosine similarity

Slide 6

There is difference between cosine similarity and distance.

Similarity here a higher number is closer, this is just the dot product of the two vectors


7 Term frequency

Slide 7

This is more helpfull because it tell us is how salient a token is in a document

8 Term frequency

Slide 8

\(tf_{t,d}\) is the frequency of token \(t\) in document \(d\)

9 Retrieval Axioms

9.1 TFC2

Signal, we see one term once and the diff on looking it 20 times. But imagine we have also a doc where there is 21 terms, then the log make it the same, because both have about 20 terms

We penalize with these scoping words, we want to penalize because we do not want to get a document whit many stop words

10 Slide 10

LNCs, No adding more pages, we prefer the one document has less documents

The query: University of Amsterdam, we want the terms of the query to be closer to each other

11 Inverse document frequency

Slide 9

In B25 assignment there is a more complicated on how to calculate \(IDF\).

  • Term frequency \(TF\), is a proxy for term’s importance in a document

  • Inverse document frequency \(IDF\), is a proxy for a term’s descriptiveness. So how happy should your ranking function be when you see this term.

    • For instance the term ‘the’ not really happy, in contrast, ‘ipad 12X34’ very happy because this does not occur very much.
    • log(200/100) = 0,30 in contrast to, log(200/1) = 2,30.
    • For instance if you have a term that occurs in every document you would have log(200/200) = 0

\(df\) is the document frequency of token \(t\). For instance, token = ‘run’ df(run) = 8 documents, meaning the word ‘run’ appears in 8 documents

12 Inverse document frequency

Slide 10

Calpurnia is a very rare term that does not occur in all documents does, its \(IDF\) would be higuer

13 TF-IDF

Slide 11

One time versus twn times in the document, how likely we have seen that term in the document, so how useful is to find that term.

14 TF-IDF summary

Slide 12

15 Outline

Slide 13

16 Outline

Slide 14

17 Language model

Slide 15

18 Unigram language model example

Slide 16

19 Documents as distributions

Slide 17

What if the term does not appear, int he document even thought we are talkinig about say hollad? We then will not retrieve this document which is clearly not the sace

20 Match using query likelihood model (QLM)

Slide 18

21 Match using KL-divergence

Slide 19

The lower the divergene the lower the similarity

22 Outline

Slide 20

23 Jelinek-Mercer smoothing

Slide 21

24 Dirichlet smoothing

Slide 22

25 Language modeling for IR summary

Slide 23

26 Outline

Slide 24

27 BM25

Slide 25

28 BM25

Slide 26

29 BM25 for long queries

Slide 27

30 Experimental comparison

Slide 28

31 Experimental comparison

Slide 29

32 Content-based retrieval summary

Slide 30

33 Materials

Slide 31