tentamen Flashcards

Question 1

Q

Bloom filter

Answer

A

checking in a data base if a credit card number is valid costs too much time => use a hash function to hash every known legal card number, number x is legal, h(x) = 1 (just 1 bit)

using one hash function does have a high false positive rate of 11.75%, but using 2 hash functions already drops the false positive rate to 5%

Question 2

Q

Bloom filter false positive rate

Answer

A

n = nr of objects
N = nr of bits
k = nr of hash functions

(1 - ((1 - 1/N)^N)^kn/N)^k = (1 - e^-kn/N)^k

Question 3

Q

TF-IDF

Answer

A

w_t,d = tf_t,d * idf_t

tf = f/maximum nr of occurrences of any term in that document
=> tf = 1 als het het meest frequente woord is in d

idf = inverse document frequency = log_2(N/df_t) with N the total number of documents and df_t the number of documents t occurs in
=> NB: 2-log

als je de frequentie van het meest frequente woord niet weet, kun je de vraag niet beantwoorden

Question 4

Q

N = nr of bits (bins)
n = nr of objects to be hashed
nr of unique items after hashing all documents?

Answer

A

wat is de kans dat 1 specifieke hash (bin) wordt gebruikt in een experiment? => 1/N
wat is de kans dat die in 1 experiment niet wordt gebruikt? => (1 - 1/N)
wat is de kans dat die specifieke hash in alle experimenten niet wordt gebruikt? => (1 - 1/N)^n
aantal lege bins totaal is N(1 - 1/N)^n
aantal niet-lege bins totaal is N - N(1 - 1/N)^n =N(1 - (1 - 1/N)^Nn/N)) = N(1 - (1/e)^n/N)
1/e = 0.37

Question 5

Q

matrix factorization

Answer

A

Assume that each user and each movie can be represented by fixed-length vectors of “features” that characterize them: “user vector” U and “item vector” V
Assume that each rating can be approximated by a product of corresponding vectors: rating=sum(user vector U x item vector V)
Find optimal values of all feature vectors that minimize SSE

Question 6

Q

matrix factorization: gradient descent

Answer

A

Initialize all variables at random
Iterate over all records (user, item, rating):
- calculate the direction of steepest descend of function f() (find a vector of partial derivatives)
- make a step of size l_rate in this direction

till convergence or a stop criterion is met

iterate:

err := rating – u_user *v_item
u_user := u_user+ l_rate * err * v_item
v_item := v_item + l_rate * err * u_user

until convergence

Question 7

Q

Matrix Factorization: Gradient Descent with Regularization

Answer

A

Initialize all variables at random
Iterate over all records (user, item, rating):

calculate the error:
err = rating – u_user*v_item
update u_user and v_item:
u_user =u_user+l_rate * (err * v_item - lambda * u_user)
v_item=v_item+l_rate * (err * u_user - lambda * v_item)

until convergence

Question 8

Q

Matrix Factorization: alternating least squares

Answer

A

Initialize all variables at random
Repeat:
- Freeze the user features and treat all item features as variables;
- Solve the resulting Least Squares Problem.
- Freeze the item features and unfreeze the user features;
- Solve the resulting Least Squares Problem

Question 9

Q

2^30

Answer

A

ongeveer 1G

Question 10

Q

collaborative filtering with advantages and limitations

Answer

A

The process of identifying similar users and recommending what similar users like

represent data in a utility matrix
define neighbourhood: cosine, Jaccard, …
make predictions using weighting scheme

advantages:
- can recommend items that are not linked to the user’s earlier choices (useful for promotions)
- considers the opinions of a wide spectrum of users

limitations:
- sparsity of data
- individual characteristics are not taken into account
- tends to recommend popular items (convergence around few items)
- computationally very expensive (time & memory)

Question 11

Q

how to avoid overfitting in UV matrix decomposition? 3 options

Answer

A

Avoid favoring the first components to be optimized by only moving the value of a component a fraction of the way, so for example half way, from its current value toward its optimized value.
Stop revisiting elements of U and V well before the process has converged.
Take several different UV decompositions, and when predicting a new entry in the matrix M, take the average of the results of using each decomposition.

Question 12

Q

2 classes of recommendation systems

Answer

A

content-based: measure similarity by looking for common features of the items
collaborative filtering: measure similarity of users by their item preferences and/or measure similarity of items by the users who like them

Question 13

Q

content/memory based recommending approach

Answer

A

construct for every item its profile: a vector of features that characterize the item
construct for every user their profile: average profile of the items they like
recommend items that are closest to user profile
closeness measures: Jaccard, cosine, …

Question 14

Q

pros and cons content/memory based approach

Answer

A

advantages:
- item-profiles constructed up front (without historical data)
- natural way of item clustering
- intuitively simple

disadvantages:
- memory & cpu-time expensive (O(N^2))
- low accuracy

Question 15

Q

model-based recommending approach

Answer

A

Once we have item profiles we can train, for every user, a classification (or regression) model which predicts rating from the item profile:
- model: a decision tree, neural network, SVM,
- input: item profile
- output: user ratings

But expensive to build and maintain and low accuracy

Question 16

Q

PCA

Answer

A

Principal Component Analysis

Given n variables x_1, x_2, …, X_n
PCs are linear combinations of x_1, x_2, …, X_n such that:
1. they are orthogonal to each other
2. they maximize the variance of projections
3. the first PC explains most of the variance, second PC less etc.

Question 17

Q

Metric multi-dimensional scaling

Answer

A

Given n points p1, …, p_n in highly dimensional space, find a mapping p_i -> q_i (q in low dimensional space) that preserves the orignial distances as much as possible

optimization problem

Question 18

Q

General multi-dimensional scaling

Answer

A

Given n objects p1, …, p_n and a similarity (distance) measure between them.
Find a mapping p_i -> q_i (q in low dimensional space) that preserves the orignial distances as much as possible

optimization problem

Question 19

Q

Locally linear embeddings (LLE)

Answer

A

for each object X_i, find a few neighbouring objects
measure distances between X_i and these neighbours
find Y_i in low dimensional space that preserve all mutual distances => a very simple optimization problem!

Question 20

Q

Euclidean distances

Answer

A

d([x1, x2, …, xn] , [y1, y2, …, yn]) =
sum(|xi - yi|^r)^(1/r)

r can be 1, 2, up to infinity

Question 21

Q

Jaccard distance

Answer

A

1 - (X n Y)/(X u Y)

Question 22

Q

Cosine distance

Answer

A

measure the angle between the vector of two points, this will be between 0 and 180

given vectors x and y, the cosine distance is their dot-product, divided by the L_2 norms of x and y (their Euclidian distances from the origin)

cos(theta) = p1 * p2 / |p1| * |p2|

Question 23

Q

Hamming distance

Answer

A

the number of components in which two vectors differ

for Boolean vectors

Question 24

Q

Data stream model

Answer

A

data enter the system at a rapid rate, at one or more input ports
the system cannot store the entire stream accessibly
=> How to make critical calculations about the stream using a limited amount of fast (RAM) memory?

applications: detecting breaking news, unusual amount of hits on pages, detecting traffic jams

Question 25

Q

sampling a data stream in limited RAM for the problem of a bank wanting a representative sample of 3% of all their data

Answer

A

Instead of pre-specifying the percentage of clients, specify the amount of available RAM

use a hash function with L buckets and put all records from these buckets to your sample
whenever the size of the sample reaches the size of the available RAM, remove all records from bucket L and set L to L-1

Question 26

Q

problem: your RAM can store s records
=> how could you sample an infinite stream of records, in such a way that at any moment your RAM keeps a random, uniformly distributed, sample of s records? i.e. every record from the stream has the same chance of being kept in RAM

Answer

A

solution: reservoir sampling

initialization: Store the first s records of the stream in your RAM. At this moment n=s and the probability of an element entering RAM the is s/n (accidentally, it’s 1)

inductive step:
- When the (n+1)th element arrives, decide with probability s/(n+1) to keep the record in RAM
- If you choose to keep it, throw one of the previously stored records out, selected with equal probability, and use the freed space for the new record

Question 27

Q

Flajolet-Martin approach for counting distinct elements in a stream

Answer

A

hash passing data stream elements into short bitstrings and store only the length of the longest tail of 0’s. The more distinct elements in the stream, the longer the longest tail of 0’s

The probability that a given h (a ) ends in at least r 0’s is 2^-r

The number of distinct elements is estimated to be 2^R (the maximum tail of 0’s)

Question 28

Q

3 steps for similarity testing (documents)

Answer

A

shingling: convert documents to sets of strings that appear in the doc
minhashing: convert large sets to short signatures, while preserving similarity
LSH: focus on pairs of signatures likely to be similar

Question 29

Q

signatures

Answer

A

hash each column in the shingle/document matrix to a small signature sig(C) such that
1. Sig (C) is small enough that we can fit a signature in main memory for each column
2. Sim (C1, C2) is almost the same as the similarity of Sig (C1) and Sig (C2)

Question 30

Q

Locality Sensitive Hashing (LSH)

Answer

A

comparing signatures of all pairs of columns is quadratic in the number of columns

split the columns of the signature matrix into several bands of the same size
if two columns are very similar then it is very likely that at least at one block they will be identical. Instead of comparing blocks, send them to buckets with help of hash functions!
Candidate pairs are those that hash to the same bucket for at least one band. Such pairs are often similar to each other, but rarely they might be not similar (“false positives”). Therefore, check if candidate pairs are really similar!

Tune the amount of bands b and rows r to catch most similar pairs, but few non-similar pairs.

Probability that the signatures will agree in all rows of at least one band is 1 - (1 - s^r)^b
example on slide 38 week 5.1

Question 31

Q

LSH for cosine similarity

Answer

A

LSH is based on the idea of hash functions: the chance that two objects will hash to the same value is the same as the similarity of these objects

Instead of minhashing we now use random projections: hash functions which return 1 or -1 depending on whether x is below or above some randomly chosen hyperplanes
=> Pick a random vector which determines a hash function with two buckets => Claim: Prob[h(x)=h(y)] = cosine_sim(x,y)

Prob[A and B on different sides of the line] = θ/180
Prob[A and B on the same side of the line] = 1-θ/180

Pick some number of random vectors, and hash your data
for each vector => the result is a signature = a vector of +1’s and –1’s that can be used for LSH like the minhash signatures for the Jaccard similarity

voor getallenvoorbeeld example 3.22 boek

Question 32

Q

dealing with dead ends

Answer

A

drop dead ends from the graph recursively including their incoming arcs (example 5.4)
taxation (modify the random surfing process)

Question 33

Q

taxation

Answer

A

allowing each random surfer a small probability of teleporting to a random page, rather than following an out-link from their current page

v’ = βMv + (1 − β)e/n
where β is the probability that the random surfer follows an out-link from their present page and n the number of pages in the graph (e is een vector met enen)

example 5.6

Question 34

Q

random surfer model as Markov Process

Answer

A

page importance = fequency with which the surfer visits the page

transition matrix used for iterative calculation of the page probability distribution

Markov Process converges if the graph is strongly connected and there are no dead ends

Question 35

Q

dead end

Answer

A

nodes with no links going out => over time, the probability of ending up in each page grows towards 0

Question 36

Q

spider traps

Answer

A

a node where the only outgoing link goes to itself, so over time the probability of ending up there is 1 and 0 for any other node

Question 37

Q

idea strategy behind MapReduce

Answer

A

Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks.

Question 38

Q

DFS components

Answer

A

distributed file system

chunk servers: files are divided into chunks, which are replicated on different compute nodes in different racks so we don’t lose all copies due to a rack failure
master node stores metadata about where files are stored (might be replicated)
client library: talks to master to find chunk servers and access the data

Question 39

Q

MapReduce

Answer

A

Map task turns chunk into sequence of key-value pairs
key-value pairs from each Map task are collected by a master controller and sorted by key and all key-value pairs with the same key wind up at the same Reduce task
Reduce task works on one key at a time and combines all the values associated with that key in some way

Question 40

Q

DFS: coordination of the Master

Answer

A

keep task status
When a map task completes, it sends the master the location (on local file system) and sizes of its R intermediate files, one for each reducer
detect failures (periodic checks)

Question 41

Q

matrix-vector multiplication MapReduce

Answer

A

Each Map task will operate on a chunk of matrix M. From each matrix element m_ij it produces key-value pair (i, m_{ij}v_j)

The Reduce task sums all values associated with a given key i, resulting in pair (i, x_i)

Question 42

Q

matrix-matrix multiplication MapReduce

Answer

A

Map function: for each matrix element m_ij we produce all key-value pairs ((i, k), (M, j, m_ij)) for k = 1, 2, . . . up to the number of columns of N and for each matrix element n_jk produce key-value pairs ((i, k), (N, j, n_jk))
for i = 1, 2, . . . up to the number of rows in M.

=> each key (i, k) will have an associated list with all the values (M, j, m_ij ) and (N, j, n_jk), for all possible values of j.

Reduce function: needs to connect the two values on the list that have the same value of j, for each j. The jth values on each list must have their third components, m_ij and n_jk multiplied. These products are summed and the result is paired with (i, k) in the output

Question 43

Q

how does PCA work?

Answer

A

find the mean value of the data points on every axis
shift the center to the origin
find a line through the origin that best fits the data points (PC1)
PC2 is the line orthogonal to PC1
project the points onto PC1 and PC2 to get the PCA plot

Question 44

Q

t-SNE

Answer

A

find clusters: each point moves closer towards points that are near and repels points that are farther away

t-SNE knows how to move points because first it calculates the actual similarity scores for each set of points (into matrix format). Then, the points are randomly projected onto a lower dimensional space and the newly calculated similarity score (based on t-distribution) matrix is transformed (step-by-step) to look more similar to the actual similarity matrix

Question 45

Q

bagging

Answer

A

given training set T and a learning method:
1. create a number (100) of versions of T by resampling with replacement: T1, …, T100
2. for each Ti build a classifier
3. return an ensemble of classifiers and take the majority vote (classification) or the average (regression)

advantages:
- no thinking, no tuning
- better accuracy

disadvantages:
- more computations (but easy to run in parallel)
- loss of interpretability

reduces the variance (limitation of the training data)

Question 46

Q

boosting

Answer

A

build a sequence of classifiers C1 to Ck as follows:
C1 is trained on the original data
C2 pays more attention to cases misclassified by C1
C3 pays more attention to cases misclassified by C1 and C2

assign to every case in T a weight that is proportional to the error made on this case by the current ensemble
try to correct more errors => attach bigger weights to misclassified cases, meaning the misclassified cases are included in the training set with a higher probability

boosting develops a number of ‘experts’ that specialize in different regions of the data

properties:
- reduces the error on the training set exponentially fast
- does not overfit the training set
- models expensive to build and difficult to interpret

Question 47

Q

Random Forest algorithm

Answer

A

for b = 1 to B:
- draw a bootstrap sample of size N from training data
- grow a random-forest tree Tb to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size is reached:

i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.

output the ensemble of trees

Question 48

Q

Random Forest advantages

Answer

A

superior accuracy
no cross-validation needed (OOB error)
few parameters to tune, so highly robust
trivial to parallelize

Question 49

Q

out-of-bag samples (OOB)

Answer

A

for each example, construct its random forest predictor by averaging only those trees corresponding to bootstrap samples in which this example did not appear

OOB error estimate is almost identical to N-fold cross validation

Question 50

Q

Random Forest hyperparameters

Answer

A

m: number of selected input variables
tree depth
minimum node size

Question 51

Q

XGBoost

Answer

A

compose tree by calculating the Gain from using a certain variable split
Gain = similarity score_left + similarity score_right - similarity score_root
apply pruning: prune the lowest branch if Gain - gamma is negative
regularization parameter lambda

prediction = initial prediction + learning rate * tree output

repeatedly builds new trees based on new (smaller) residuals