03 - Content-based Filtering Flashcards
1
Q
How to represent textual data?
A
- Tabular data
- Vectors
- Points
2
Q
How does tabular data for Recommender Systems look?
A
- Every document/instance is represented as one row of a table/matrix or as a vector
- Every column (feature) corresponds to a term
- All documents are vectors in a vector space
- Every term corresponds to one dimension in the vector space
- Every instance represents one feature vector or point in a n-dimensional vector space
3
Q
How to measure the relevance of a document?
A
- Euclidian Distance (L2 Norm)
- Manhatten Distance (L1 Norm)
- Cosine Similarity (Distance)
4
Q
What are possible problems with the Euclidian Distance (L2 Norm)?
A
- Length of document
- Totally irrelevant documents could be close to each other
5
Q
What is the Inverse Document Frequency (IDF)?
A
IDF = log((Number of documents in corpus)/(Number of documents in D, that contain the searched term))
6
Q
What is TF-IDF?
A
- Weight of a term is based on two factors TF and IDF
- IDF is specific for corpus D and does not change for new documents
- TF is specific for each document
7
Q
Why is IDF not ideal?
A
If there are no documents with this term, there is a division by 0
8
Q
What is a possible extension for IDF?
A
If you have two documents that are equally relevant, you could define more criteria e.g. the age of the document