tentamen Flashcards

1
Q

Bloom filter

A

checking in a data base if a credit card number is valid costs too much time => use a hash function to hash every known legal card number, number x is legal, h(x) = 1 (just 1 bit)

using one hash function does have a high false positive rate of 11.75%, but using 2 hash functions already drops the false positive rate to 5%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bloom filter false positive rate

A

n = nr of objects
N = nr of bits
k = nr of hash functions

(1 - ((1 - 1/N)^N)^kn/N)^k = (1 - e^-kn/N)^k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

TF-IDF

A

w_t,d = tf_t,d * idf_t

tf = f/maximum nr of occurrences of any term in that document
=> tf = 1 als het het meest frequente woord is in d

idf = inverse document frequency = log_2(N/df_t) with N the total number of documents and df_t the number of documents t occurs in
=> NB: 2-log

als je de frequentie van het meest frequente woord niet weet, kun je de vraag niet beantwoorden

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

N = nr of bits (bins)
n = nr of objects to be hashed
nr of unique items after hashing all documents?

A
  • wat is de kans dat 1 specifieke hash (bin) wordt gebruikt in een experiment? => 1/N
  • wat is de kans dat die in 1 experiment niet wordt gebruikt? => (1 - 1/N)
  • wat is de kans dat die specifieke hash in alle experimenten niet wordt gebruikt? => (1 - 1/N)^n
  • aantal lege bins totaal is N(1 - 1/N)^n
  • aantal niet-lege bins totaal is N - N(1 - 1/N)^n =N(1 - (1 - 1/N)^Nn/N)) = N(1 - (1/e)^n/N)
  • 1/e = 0.37
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

matrix factorization

A
  • Assume that each user and each movie can be represented by fixed-length vectors of “features” that characterize them: “user vector” U and “item vector” V
  • Assume that each rating can be approximated by a product of corresponding vectors: rating=sum(user vector U x item vector V)
  • Find optimal values of all feature vectors that minimize SSE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

matrix factorization: gradient descent

A
  1. Initialize all variables at random
  2. Iterate over all records (user, item, rating):
    - calculate the direction of steepest descend of function f() (find a vector of partial derivatives)
    - make a step of size l_rate in this direction

till convergence or a stop criterion is met

iterate:

err := rating – u_user *v_item
u_user := u_user+ l_rate * err * v_item
v_item := v_item + l_rate * err * u_user

until convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Matrix Factorization: Gradient Descent with Regularization

A
  1. Initialize all variables at random
  2. Iterate over all records (user, item, rating):
  • calculate the error:
    err = rating – u_user*v_item
  • update u_user and v_item:
    u_user =u_user+l_rate * (err * v_item - lambda * u_user)
    v_item=v_item+l_rate * (err * u_user - lambda * v_item)

until convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Matrix Factorization: alternating least squares

A
  1. Initialize all variables at random
  2. Repeat:
    - Freeze the user features and treat all item features as variables;
    - Solve the resulting Least Squares Problem.
    - Freeze the item features and unfreeze the user features;
    - Solve the resulting Least Squares Problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2^30

A

ongeveer 1G

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

collaborative filtering with advantages and limitations

A

The process of identifying similar users and recommending what similar users like

  1. represent data in a utility matrix
  2. define neighbourhood: cosine, Jaccard, …
  3. make predictions using weighting scheme

advantages:
- can recommend items that are not linked to the user’s earlier choices (useful for promotions)
- considers the opinions of a wide spectrum of users

limitations:
- sparsity of data
- individual characteristics are not taken into account
- tends to recommend popular items (convergence around few items)
- computationally very expensive (time & memory)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how to avoid overfitting in UV matrix decomposition? 3 options

A
  1. Avoid favoring the first components to be optimized by only moving the value of a component a fraction of the way, so for example half way, from its current value toward its optimized value.
  2. Stop revisiting elements of U and V well before the process has converged.
  3. Take several different UV decompositions, and when predicting a new entry in the matrix M, take the average of the results of using each decomposition.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

2 classes of recommendation systems

A
  • content-based: measure similarity by looking for common features of the items
  • collaborative filtering: measure similarity of users by their item preferences and/or measure similarity of items by the users who like them
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

content/memory based recommending approach

A
  • construct for every item its profile: a vector of features that characterize the item
  • construct for every user their profile: average profile of the items they like
  • recommend items that are closest to user profile
  • closeness measures: Jaccard, cosine, …
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

pros and cons content/memory based approach

A

advantages:
- item-profiles constructed up front (without historical data)
- natural way of item clustering
- intuitively simple

disadvantages:
- memory & cpu-time expensive (O(N^2))
- low accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

model-based recommending approach

A

Once we have item profiles we can train, for every user, a classification (or regression) model which predicts rating from the item profile:
- model: a decision tree, neural network, SVM,
- input: item profile
- output: user ratings

But expensive to build and maintain and low accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PCA

A

Principal Component Analysis

Given n variables x_1, x_2, …, X_n
PCs are linear combinations of x_1, x_2, …, X_n such that:
1. they are orthogonal to each other
2. they maximize the variance of projections
3. the first PC explains most of the variance, second PC less etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Metric multi-dimensional scaling

A

Given n points p1, …, p_n in highly dimensional space, find a mapping p_i -> q_i (q in low dimensional space) that preserves the orignial distances as much as possible

optimization problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

General multi-dimensional scaling

A

Given n objects p1, …, p_n and a similarity (distance) measure between them.
Find a mapping p_i -> q_i (q in low dimensional space) that preserves the orignial distances as much as possible

optimization problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Locally linear embeddings (LLE)

A
  1. for each object X_i, find a few neighbouring objects
  2. measure distances between X_i and these neighbours
  3. find Y_i in low dimensional space that preserve all mutual distances => a very simple optimization problem!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Euclidean distances

A

d([x1, x2, …, xn] , [y1, y2, …, yn]) =
sum(|xi - yi|^r)^(1/r)

r can be 1, 2, up to infinity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Jaccard distance

A

1 - (X n Y)/(X u Y)

22
Q

Cosine distance

A

measure the angle between the vector of two points, this will be between 0 and 180

given vectors x and y, the cosine distance is their dot-product, divided by the L_2 norms of x and y (their Euclidian distances from the origin)

cos(theta) = p1 * p2 / |p1| * |p2|

23
Q

Hamming distance

A

the number of components in which two vectors differ

for Boolean vectors

24
Q

Data stream model

A
  • data enter the system at a rapid rate, at one or more input ports
  • the system cannot store the entire stream accessibly
    => How to make critical calculations about the stream using a limited amount of fast (RAM) memory?

applications: detecting breaking news, unusual amount of hits on pages, detecting traffic jams

25
Q

sampling a data stream in limited RAM for the problem of a bank wanting a representative sample of 3% of all their data

A

Instead of pre-specifying the percentage of clients, specify the amount of available RAM

  • use a hash function with L buckets and put all records from these buckets to your sample
  • whenever the size of the sample reaches the size of the available RAM, remove all records from bucket L and set L to L-1
26
Q

problem: your RAM can store s records
=> how could you sample an infinite stream of records, in such a way that at any moment your RAM keeps a random, uniformly distributed, sample of s records? i.e. every record from the stream has the same chance of being kept in RAM

A

solution: reservoir sampling

initialization: Store the first s records of the stream in your RAM. At this moment n=s and the probability of an element entering RAM the is s/n (accidentally, it’s 1)

inductive step:
- When the (n+1)th element arrives, decide with probability s/(n+1) to keep the record in RAM
- If you choose to keep it, throw one of the previously stored records out, selected with equal probability, and use the freed space for the new record

27
Q

Flajolet-Martin approach for counting distinct elements in a stream

A

hash passing data stream elements into short bitstrings and store only the length of the longest tail of 0’s. The more distinct elements in the stream, the longer the longest tail of 0’s

The probability that a given h (a ) ends in at least r 0’s is 2^-r

The number of distinct elements is estimated to be 2^R (the maximum tail of 0’s)

28
Q

3 steps for similarity testing (documents)

A
  1. shingling: convert documents to sets of strings that appear in the doc
  2. minhashing: convert large sets to short signatures, while preserving similarity
  3. LSH: focus on pairs of signatures likely to be similar
29
Q

signatures

A

hash each column in the shingle/document matrix to a small signature sig(C) such that
1. Sig (C) is small enough that we can fit a signature in main memory for each column
2. Sim (C1, C2) is almost the same as the similarity of Sig (C1) and Sig (C2)

30
Q

Locality Sensitive Hashing (LSH)

A

comparing signatures of all pairs of columns is quadratic in the number of columns

  • split the columns of the signature matrix into several bands of the same size
  • if two columns are very similar then it is very likely that at least at one block they will be identical. Instead of comparing blocks, send them to buckets with help of hash functions!
  • Candidate pairs are those that hash to the same bucket for at least one band. Such pairs are often similar to each other, but rarely they might be not similar (“false positives”). Therefore, check if candidate pairs are really similar!

Tune the amount of bands b and rows r to catch most similar pairs, but few non-similar pairs.

Probability that the signatures will agree in all rows of at least one band is 1 - (1 - s^r)^b
example on slide 38 week 5.1

31
Q

LSH for cosine similarity

A

LSH is based on the idea of hash functions: the chance that two objects will hash to the same value is the same as the similarity of these objects

Instead of minhashing we now use random projections: hash functions which return 1 or -1 depending on whether x is below or above some randomly chosen hyperplanes
=> Pick a random vector which determines a hash function with two buckets => Claim: Prob[h(x)=h(y)] = cosine_sim(x,y)

Prob[A and B on different sides of the line] = θ/180
Prob[A and B on the same side of the line] = 1-θ/180

Pick some number of random vectors, and hash your data
for each vector => the result is a signature = a vector of +1’s and –1’s that can be used for LSH like the minhash signatures for the Jaccard similarity

voor getallenvoorbeeld example 3.22 boek

32
Q

dealing with dead ends

A
  1. drop dead ends from the graph recursively including their incoming arcs (example 5.4)
  2. taxation (modify the random surfing process)
33
Q

taxation

A

allowing each random surfer a small probability of teleporting to a random page, rather than following an out-link from their current page

v’ = βMv + (1 − β)e/n
where β is the probability that the random surfer follows an out-link from their present page and n the number of pages in the graph (e is een vector met enen)

example 5.6

34
Q

random surfer model as Markov Process

A

page importance = fequency with which the surfer visits the page

transition matrix used for iterative calculation of the page probability distribution

Markov Process converges if the graph is strongly connected and there are no dead ends

35
Q

dead end

A

nodes with no links going out => over time, the probability of ending up in each page grows towards 0

36
Q

spider traps

A

a node where the only outgoing link goes to itself, so over time the probability of ending up there is 1 and 0 for any other node

37
Q

idea strategy behind MapReduce

A

Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks.

38
Q

DFS components

A

distributed file system

  • chunk servers: files are divided into chunks, which are replicated on different compute nodes in different racks so we don’t lose all copies due to a rack failure
  • master node stores metadata about where files are stored (might be replicated)
  • client library: talks to master to find chunk servers and access the data
39
Q

MapReduce

A
  1. Map task turns chunk into sequence of key-value pairs
  2. key-value pairs from each Map task are collected by a master controller and sorted by key and all key-value pairs with the same key wind up at the same Reduce task
  3. Reduce task works on one key at a time and combines all the values associated with that key in some way
40
Q

DFS: coordination of the Master

A
  • keep task status
  • When a map task completes, it sends the master the location (on local file system) and sizes of its R intermediate files, one for each reducer
  • detect failures (periodic checks)
41
Q

matrix-vector multiplication MapReduce

A

Each Map task will operate on a chunk of matrix M. From each matrix element m_ij it produces key-value pair (i, m_{ij}v_j)

The Reduce task sums all values associated with a given key i, resulting in pair (i, x_i)

42
Q

matrix-matrix multiplication MapReduce

A

Map function: for each matrix element m_ij we produce all key-value pairs ((i, k), (M, j, m_ij)) for k = 1, 2, . . . up to the number of columns of N and for each matrix element n_jk produce key-value pairs ((i, k), (N, j, n_jk))
for i = 1, 2, . . . up to the number of rows in M.

=> each key (i, k) will have an associated list with all the values (M, j, m_ij ) and (N, j, n_jk), for all possible values of j.

Reduce function: needs to connect the two values on the list that have the same value of j, for each j. The jth values on each list must have their third components, m_ij and n_jk multiplied. These products are summed and the result is paired with (i, k) in the output

43
Q

how does PCA work?

A
  1. find the mean value of the data points on every axis
  2. shift the center to the origin
  3. find a line through the origin that best fits the data points (PC1)
  4. PC2 is the line orthogonal to PC1
  5. project the points onto PC1 and PC2 to get the PCA plot
44
Q

t-SNE

A

find clusters: each point moves closer towards points that are near and repels points that are farther away

t-SNE knows how to move points because first it calculates the actual similarity scores for each set of points (into matrix format). Then, the points are randomly projected onto a lower dimensional space and the newly calculated similarity score (based on t-distribution) matrix is transformed (step-by-step) to look more similar to the actual similarity matrix

45
Q

bagging

A

given training set T and a learning method:
1. create a number (100) of versions of T by resampling with replacement: T1, …, T100
2. for each Ti build a classifier
3. return an ensemble of classifiers and take the majority vote (classification) or the average (regression)

advantages:
- no thinking, no tuning
- better accuracy

disadvantages:
- more computations (but easy to run in parallel)
- loss of interpretability

reduces the variance (limitation of the training data)

46
Q

boosting

A

build a sequence of classifiers C1 to Ck as follows:
C1 is trained on the original data
C2 pays more attention to cases misclassified by C1
C3 pays more attention to cases misclassified by C1 and C2

  • assign to every case in T a weight that is proportional to the error made on this case by the current ensemble
  • try to correct more errors => attach bigger weights to misclassified cases, meaning the misclassified cases are included in the training set with a higher probability

boosting develops a number of ‘experts’ that specialize in different regions of the data

properties:
- reduces the error on the training set exponentially fast
- does not overfit the training set
- models expensive to build and difficult to interpret

47
Q

Random Forest algorithm

A
  1. for b = 1 to B:
    - draw a bootstrap sample of size N from training data
    - grow a random-forest tree Tb to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size is reached:

i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.

  1. output the ensemble of trees
48
Q

Random Forest advantages

A
  • superior accuracy
  • no cross-validation needed (OOB error)
  • few parameters to tune, so highly robust
  • trivial to parallelize
49
Q

out-of-bag samples (OOB)

A

for each example, construct its random forest predictor by averaging only those trees corresponding to bootstrap samples in which this example did not appear

OOB error estimate is almost identical to N-fold cross validation

50
Q

Random Forest hyperparameters

A
  • m: number of selected input variables
  • tree depth
  • minimum node size
51
Q

XGBoost

A
  • compose tree by calculating the Gain from using a certain variable split
  • Gain = similarity score_left + similarity score_right - similarity score_root
  • apply pruning: prune the lowest branch if Gain - gamma is negative
  • regularization parameter lambda

prediction = initial prediction + learning rate * tree output

repeatedly builds new trees based on new (smaller) residuals