00 Overview Flashcards

1
Q

Sequence and grammar

A

Sequences:

  • Probabilities over strings: n-gram models: Linear and surface oriented
  • Sequence classification: HMMs add one layer of abstraction; class labels as hidden variables. But still only linear

Grammar; adds hierarchical structure

  • Shift focus from “sequences” to “sentences”
  • Identifying underlying structure using formal rules
    • Declarative aspect: formal grammar
    • Procedural aspect: parsing strategy
  • Learn probability distribution over the rules for scoring trees
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

We have been doing data-driven learning by counting observations in what contexts?

A
  • feature vectors in semantic spaces; bag-of-words, etc.
  • previous n-1 words in n-gram models
  • previous n-1 states in HMMs
  • local sub-trees in PCFGs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Vector space models

A
  • Data representation based on a spatial metaphor
  • Objects modeled as feature vectors positioned in a coordinate system
  • Semantic spaces = VS for distributional lexical semantics
  • Some issues:
    • Usage = meaning? (The distributional hypothesis)
    • How do we define context / features? (BoW, n-grams, etc.)
    • Text normalization (lemmatization, stemming, etc.)
    • How do we measure similarity? Distance / proximity metrics. (Euclidean distance, cosine, dot-product, etc.)
    • Length-normalization (ways to deal with frequency effects / length-bias)
    • High-dimensional sparse vectors (i.e. few active features; consequences for low-level choice of data structure, etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification

A

Supervised learning from labeled training data.

Given data annotated with predefinded class labels, learn to predict membership for new/unseen objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cluster analysis

A

Unsupervised learning from unlabeled data.
􏰀 Automatically forming groups of similar objects.
􏰀 No predefined classes; we only specify the similarity measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Issues with categorization tasks

A
  • Measuring similarity
  • Representing classes (e.g. exemplar-based vs. centroid-based)
  • Representing class membership (hard vs. soft)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Examples of vector space classifiers

A

Rocchio vs. kNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Differences between Rocchio and kNN

A
  • Centroid- vs exemplar-based class representation
  • Linear vs non-linear decision boundaries
  • Assumptions about the distribution within the class
  • Complexity in training vs complexity in prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Evaluation of classifiers

A
  • Accuracy, precision, recall and F-score
  • Multi-class evaluation: Micro-/macro-averaging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of clustering

A
  • Flat /partitional clustering
  • Hierarchical clustering
    • Agglomerative clustering (bottom-up)
    • Divisive clustering (top-down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Flat clustering algorithm

A
  • Example: k-Means.
  • Partitioning viewed as an optimization problem:
  • 􏰀Minimize the within-cluster sum of squares.
  • Approximated by iteratively improving on some initial partition.
  • Issues: initialization / seeding, non-determinism, sensitivity to outliers, termination criterion, specifying k, specifying the similarity function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Agglomerative clustering

A
  • Bottom-up hierarchical clustering
  • Resulting in a set of nested partitions, often visualized as a dendrogram. 􏰀
  • Issues:
    • Linkage criterions — how to measure inter-cluster similarity:
      • Single, Complete, Centroid, or Average Linkage
    • Cutting the tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Divisive clustering

A
  • Top-down hierarchical clustering
  • Generates the tree by iteratively applying a flat clustering method.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Structured Probabilistic Models

A
  • Switching from a geometric view to a probability distribution view.
  • Model the probability that elements (words, labels) are in a particular configuration.
  • We looked at many of the same concepts over structures that were linear or hierarchical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What have we been modelling with Structured Probabilistic Models?

A

Linear

  • With string is more likely?
  • Which tag sequence is most likely?

Hierarchical

  • Which tree structure is most likely?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Structured Probabilistic Models: Types

A

Linear

  • n-gram language models
    • Chain rule combines conditional probabilities to model context
    • Markov assumption allows us to limit the length of context
  • Hidden Markov Model
    • addad a hidden layer of abstraction: PoS tags
    • also uses the chain rule with the Markov assumption

Hierarchical

  • (Probabilistic) Context-Free Grammars (PCFGs)
    • Hidden layer of abstraction: trees
    • Chain rule over (P)CFG rules
17
Q

n-gram language models

A

n-gram language models

  • Linear Structured Probabilistic Model
  • Chain rule combines conditional probabilities to model context
  • Markov assumption allows us to limit the length of context
18
Q

Hidden Markov Model

A
  • Linear Structured Probabilistic Model
  • A hidden layer of abstraction: PoS tags
  • Uses the chain rule with the Markov assumption
19
Q

PCFGs

A

(Probabilistic) Context-Free Grammars

  • Hierarchical Structured Probabilistic Model
  • Hidden layer of abstraction: trees
  • Chain rule over (P)CFG rules:
20
Q

Maximum Likelihood Estimation

A
21
Q

HMMs can be used to:

A
  • calculate the probability of a string
  • find the most likely state sequence for a particular observation sequence
22
Q

CFGs can be used to:

A
  • A CFG can recognise strings that are a valid part of the defined language.
  • A PCFG can calculate the probability of a tree (where the sentence is encoded by the leaves).
23
Q

n-gram models can be used to

A

n-gram models can be used to calculate the probability of a string

24
Q

When to use the Viterbi algorithm?

A
  • HMM: Efficiently find the most likely state sequence
  • PCFG: Efficiently find the most likely parse tree
25
Q

When to use the Forward algorithm?

A

HMM: To efficiently calculate the probability of the obervation sequence

26
Q

When to use chart parsing?

A

To efficiently explore the parse tree search space

27
Q

Dynamic programming: What are our sub-problems?

A
  • HMM: Our sub-problems are prefixes to our full sequence
  • (P)CFG parsing: Our subproblems are sub-trees which cover sub-spans of our input
28
Q

Evaluation

A

Linear

  • Tag accuracy is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed.

Hierarchical

  • Coverage is a measure of how well a CFG models the full range of the language it is designed for.
  • The ParsEval metric evaluates parser accuracy by calculating the precision, recall and F1 score over labelled constituents.
29
Q

Tag accuracy

A

Tag accuracy is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed.

30
Q

Coverage

A

Coverage is a measure of how well a CFG models the full range of the language it is designed for.

31
Q

ParsEval

A

The ParsEval metric evaluates parser accuracy by calculating the precision, recall and F1 score over labelled constituents.