tentamen Flashcards
Bloom filter
checking in a data base if a credit card number is valid costs too much time => use a hash function to hash every known legal card number, number x is legal, h(x) = 1 (just 1 bit)
using one hash function does have a high false positive rate of 11.75%, but using 2 hash functions already drops the false positive rate to 5%
Bloom filter false positive rate
n = nr of objects
N = nr of bits
k = nr of hash functions
(1 - ((1 - 1/N)^N)^kn/N)^k = (1 - e^-kn/N)^k
TF-IDF
w_t,d = tf_t,d * idf_t
tf = f/maximum nr of occurrences of any term in that document
=> tf = 1 als het het meest frequente woord is in d
idf = inverse document frequency = log_2(N/df_t) with N the total number of documents and df_t the number of documents t occurs in
=> NB: 2-log
als je de frequentie van het meest frequente woord niet weet, kun je de vraag niet beantwoorden
N = nr of bits (bins)
n = nr of objects to be hashed
nr of unique items after hashing all documents?
- wat is de kans dat 1 specifieke hash (bin) wordt gebruikt in een experiment? => 1/N
- wat is de kans dat die in 1 experiment niet wordt gebruikt? => (1 - 1/N)
- wat is de kans dat die specifieke hash in alle experimenten niet wordt gebruikt? => (1 - 1/N)^n
- aantal lege bins totaal is N(1 - 1/N)^n
- aantal niet-lege bins totaal is N - N(1 - 1/N)^n =N(1 - (1 - 1/N)^Nn/N)) = N(1 - (1/e)^n/N)
- 1/e = 0.37
matrix factorization
- Assume that each user and each movie can be represented by fixed-length vectors of “features” that characterize them: “user vector” U and “item vector” V
- Assume that each rating can be approximated by a product of corresponding vectors: rating=sum(user vector U x item vector V)
- Find optimal values of all feature vectors that minimize SSE
matrix factorization: gradient descent
- Initialize all variables at random
- Iterate over all records (user, item, rating):
- calculate the direction of steepest descend of function f() (find a vector of partial derivatives)
- make a step of size l_rate in this direction
till convergence or a stop criterion is met
iterate:
err := rating – u_user *v_item
u_user := u_user+ l_rate * err * v_item
v_item := v_item + l_rate * err * u_user
until convergence
Matrix Factorization: Gradient Descent with Regularization
- Initialize all variables at random
- Iterate over all records (user, item, rating):
- calculate the error:
err = rating – u_user*v_item - update u_user and v_item:
u_user =u_user+l_rate * (err * v_item - lambda * u_user)
v_item=v_item+l_rate * (err * u_user - lambda * v_item)
until convergence
Matrix Factorization: alternating least squares
- Initialize all variables at random
- Repeat:
- Freeze the user features and treat all item features as variables;
- Solve the resulting Least Squares Problem.
- Freeze the item features and unfreeze the user features;
- Solve the resulting Least Squares Problem
2^30
ongeveer 1G
collaborative filtering with advantages and limitations
The process of identifying similar users and recommending what similar users like
- represent data in a utility matrix
- define neighbourhood: cosine, Jaccard, …
- make predictions using weighting scheme
advantages:
- can recommend items that are not linked to the user’s earlier choices (useful for promotions)
- considers the opinions of a wide spectrum of users
limitations:
- sparsity of data
- individual characteristics are not taken into account
- tends to recommend popular items (convergence around few items)
- computationally very expensive (time & memory)
how to avoid overfitting in UV matrix decomposition? 3 options
- Avoid favoring the first components to be optimized by only moving the value of a component a fraction of the way, so for example half way, from its current value toward its optimized value.
- Stop revisiting elements of U and V well before the process has converged.
- Take several different UV decompositions, and when predicting a new entry in the matrix M, take the average of the results of using each decomposition.
2 classes of recommendation systems
- content-based: measure similarity by looking for common features of the items
- collaborative filtering: measure similarity of users by their item preferences and/or measure similarity of items by the users who like them
content/memory based recommending approach
- construct for every item its profile: a vector of features that characterize the item
- construct for every user their profile: average profile of the items they like
- recommend items that are closest to user profile
- closeness measures: Jaccard, cosine, …
pros and cons content/memory based approach
advantages:
- item-profiles constructed up front (without historical data)
- natural way of item clustering
- intuitively simple
disadvantages:
- memory & cpu-time expensive (O(N^2))
- low accuracy
model-based recommending approach
Once we have item profiles we can train, for every user, a classification (or regression) model which predicts rating from the item profile:
- model: a decision tree, neural network, SVM,
- input: item profile
- output: user ratings
But expensive to build and maintain and low accuracy
PCA
Principal Component Analysis
Given n variables x_1, x_2, …, X_n
PCs are linear combinations of x_1, x_2, …, X_n such that:
1. they are orthogonal to each other
2. they maximize the variance of projections
3. the first PC explains most of the variance, second PC less etc.
Metric multi-dimensional scaling
Given n points p1, …, p_n in highly dimensional space, find a mapping p_i -> q_i (q in low dimensional space) that preserves the orignial distances as much as possible
optimization problem
General multi-dimensional scaling
Given n objects p1, …, p_n and a similarity (distance) measure between them.
Find a mapping p_i -> q_i (q in low dimensional space) that preserves the orignial distances as much as possible
optimization problem
Locally linear embeddings (LLE)
- for each object X_i, find a few neighbouring objects
- measure distances between X_i and these neighbours
- find Y_i in low dimensional space that preserve all mutual distances => a very simple optimization problem!
Euclidean distances
d([x1, x2, …, xn] , [y1, y2, …, yn]) =
sum(|xi - yi|^r)^(1/r)
r can be 1, 2, up to infinity