Term-based Retrieval

Description of this Post

All

Information Retrieval

1 Title

2 Document representation and matching

3 Outline

4 Outline

5 Documents as vectors

The documents are the columns. These columns are the vectors which are sparse. These are books. The rows are the characters

6 Match using cosine similarity

There is difference between cosine similarity and distance.

Similarity here a higher number is closer, this is just the dot product of the two vectors

7 Term frequency

This is more helpfull because it tell us is how salient a token is in a document

8 Term frequency

\(tf_{t,d}\) is the frequency of token \(t\) in document \(d\)

9 Retrieval Axioms

9.1 TFC2

Signal, we see one term once and the diff on looking it 20 times. But imagine we have also a doc where there is 21 terms, then the log make it the same, because both have about 20 terms

We penalize with these scoping words, we want to penalize because we do not want to get a document whit many stop words

10 Slide 10

LNCs, No adding more pages, we prefer the one document has less documents

The query: University of Amsterdam, we want the terms of the query to be closer to each other

11 Inverse document frequency

In B25 assignment there is a more complicated on how to calculate \(IDF\).

Term frequency \(TF\), is a proxy for term’s importance in a document
Inverse document frequency \(IDF\), is a proxy for a term’s descriptiveness. So how happy should your ranking function be when you see this term.
- For instance the term ‘the’ not really happy, in contrast, ‘ipad 12X34’ very happy because this does not occur very much.
- log(200/100) = 0,30 in contrast to, log(200/1) = 2,30.
- For instance if you have a term that occurs in every document you would have log(200/200) = 0

\(df\) is the document frequency of token \(t\). For instance, token = ‘run’ df(run) = 8 documents, meaning the word ‘run’ appears in 8 documents

12 Inverse document frequency

Calpurnia is a very rare term that does not occur in all documents does, its \(IDF\) would be higuer

13 TF-IDF

One time versus twn times in the document, how likely we have seen that term in the document, so how useful is to find that term.

14 TF-IDF summary

15 Outline

16 Outline

17 Language model

18 Unigram language model example

19 Documents as distributions

What if the term does not appear, int he document even thought we are talkinig about say hollad? We then will not retrieve this document which is clearly not the sace