C3 Flashcards

1
Q

vector space model

A

dimensions are the words, documents are vectors: shows which words a document contains

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 problems vector space model

A

synonymy: many ways to refer to the same object (bike and bicycle)

polysemy: many words have more than one distinct meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

word embeddings model

A

represent words in a continuous vector space

  • relatively low dimensional vector space
  • semantically and syntactically similar words are mapped to nearby points (distributional hypothesis)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

feedforward network

A

multilayer network in which the units are connected with no cycles

  • three kinds of nodes: input, hidden, output
  • each layer is fully connected (in standard architecture)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

classification with feedforward network

A

Binary: single output node

multi-class: output node for each category, output layer gives probability distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

word2vec

A

train a neural classifier on a binary prediction task: is word w likely to show up near the word bicycle? => Take the learned classifier weights on the hidden layer as the word embeddings

computationally efficient predictive model for learning word embeddings from raw text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

training word2vec

A

supervised learning problem on unlabeled data: self-supervision

  1. treat target word and a neighbouring context word as positive examples
  2. randomly sample other words in the lexicon to get negative samples
  3. train a classifier to distinguish those two cases
  4. learned weights are the embeddings
  • maximize similarity of (target word, context word) pairs drawn from positive examples
  • minimize similarity of (w,c_neg) pairs from negative examples
  • each word + context is a classification problem (adjust current vector or not)

=> weights get updated while the model processes the collection (minimize and maximize dot-products)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3 advantages of word2vec

A
  • it scales: can be trained on billion word corpora in limited time + possibility of parallel training
  • pre-trained word embeddings trained by one can be used by others
  • incremental training: train one piece, save results, continue later
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

4 text mining tasks that benefit from using word2vec

A
  • synonym detection
  • richer word and context representation for named entity recognition
  • document similarity
  • finding word associations/clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

tf-idf

A

w_t,d = tf_t,d * idf_t

tf = term frequency of word t in document d = 1 + log_10(count(t,d))

idf = inverse document frequency = log_10(N/df_t) with N the total number of documents and df_t the number of documents t occurs in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

PMI

A

Pointwise Mutual Information

PMI(w,c) = log_2(P(w,c)/P(w)P(c))

estimate of how much more the two words co-occur than we expect by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

first-order co-occurrence

A

when two words are typically nearby each other (“wrote” and “book”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

second-order co-occurrence

A

when two words have similar neighbours (“wrote” and “said”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

distibutional hypothesis

A

words that occur in similar contexts tend to be similar (the context of a word defines its meaning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

proposed neural architectures for computing word vectors

A
  • word2vec
  • GloVe
  • FastText
  • ELMo
  • BERT
How well did you know this?
1
Not at all
2
3
4
5
Perfectly