00 Overview Flashcards
Sequence and grammar
Sequences:
- Probabilities over strings: n-gram models: Linear and surface oriented
- Sequence classification: HMMs add one layer of abstraction; class labels as hidden variables. But still only linear
Grammar; adds hierarchical structure
- Shift focus from “sequences” to “sentences”
- Identifying underlying structure using formal rules
- Declarative aspect: formal grammar
- Procedural aspect: parsing strategy
- Learn probability distribution over the rules for scoring trees
We have been doing data-driven learning by counting observations in what contexts?
- feature vectors in semantic spaces; bag-of-words, etc.
- previous n-1 words in n-gram models
- previous n-1 states in HMMs
- local sub-trees in PCFGs
Vector space models
- Data representation based on a spatial metaphor
- Objects modeled as feature vectors positioned in a coordinate system
- Semantic spaces = VS for distributional lexical semantics
- Some issues:
- Usage = meaning? (The distributional hypothesis)
- How do we define context / features? (BoW, n-grams, etc.)
- Text normalization (lemmatization, stemming, etc.)
- How do we measure similarity? Distance / proximity metrics. (Euclidean distance, cosine, dot-product, etc.)
- Length-normalization (ways to deal with frequency effects / length-bias)
- High-dimensional sparse vectors (i.e. few active features; consequences for low-level choice of data structure, etc.)
Classification
Supervised learning from labeled training data.
Given data annotated with predefinded class labels, learn to predict membership for new/unseen objects.
Cluster analysis
Unsupervised learning from unlabeled data.
Automatically forming groups of similar objects.
No predefined classes; we only specify the similarity measure.
Issues with categorization tasks
- Measuring similarity
- Representing classes (e.g. exemplar-based vs. centroid-based)
- Representing class membership (hard vs. soft)
Examples of vector space classifiers
Rocchio vs. kNN
Differences between Rocchio and kNN
- Centroid- vs exemplar-based class representation
- Linear vs non-linear decision boundaries
- Assumptions about the distribution within the class
- Complexity in training vs complexity in prediction
Evaluation of classifiers
- Accuracy, precision, recall and F-score
- Multi-class evaluation: Micro-/macro-averaging
Types of clustering
- Flat /partitional clustering
- Hierarchical clustering
- Agglomerative clustering (bottom-up)
- Divisive clustering (top-down)
Flat clustering algorithm
- Example: k-Means.
- Partitioning viewed as an optimization problem:
- Minimize the within-cluster sum of squares.
- Approximated by iteratively improving on some initial partition.
- Issues: initialization / seeding, non-determinism, sensitivity to outliers, termination criterion, specifying k, specifying the similarity function.
Agglomerative clustering
- Bottom-up hierarchical clustering
- Resulting in a set of nested partitions, often visualized as a dendrogram.
- Issues:
- Linkage criterions — how to measure inter-cluster similarity:
- Single, Complete, Centroid, or Average Linkage
- Cutting the tree
- Linkage criterions — how to measure inter-cluster similarity:
Divisive clustering
- Top-down hierarchical clustering
- Generates the tree by iteratively applying a flat clustering method.
Structured Probabilistic Models
- Switching from a geometric view to a probability distribution view.
- Model the probability that elements (words, labels) are in a particular configuration.
- We looked at many of the same concepts over structures that were linear or hierarchical
What have we been modelling with Structured Probabilistic Models?
Linear
- With string is more likely?
- Which tag sequence is most likely?
Hierarchical
- Which tree structure is most likely?
Structured Probabilistic Models: Types
Linear
- n-gram language models
- Chain rule combines conditional probabilities to model context
- Markov assumption allows us to limit the length of context
- Hidden Markov Model
- addad a hidden layer of abstraction: PoS tags
- also uses the chain rule with the Markov assumption
Hierarchical
- (Probabilistic) Context-Free Grammars (PCFGs)
- Hidden layer of abstraction: trees
- Chain rule over (P)CFG rules
n-gram language models
n-gram language models
- Linear Structured Probabilistic Model
- Chain rule combines conditional probabilities to model context
- Markov assumption allows us to limit the length of context

Hidden Markov Model
- Linear Structured Probabilistic Model
- A hidden layer of abstraction: PoS tags
- Uses the chain rule with the Markov assumption

PCFGs
(Probabilistic) Context-Free Grammars
- Hierarchical Structured Probabilistic Model
- Hidden layer of abstraction: trees
- Chain rule over (P)CFG rules:

Maximum Likelihood Estimation

HMMs can be used to:
- calculate the probability of a string
- find the most likely state sequence for a particular observation sequence
CFGs can be used to:
- A CFG can recognise strings that are a valid part of the defined language.
- A PCFG can calculate the probability of a tree (where the sentence is encoded by the leaves).
n-gram models can be used to
n-gram models can be used to calculate the probability of a string
When to use the Viterbi algorithm?
- HMM: Efficiently find the most likely state sequence
- PCFG: Efficiently find the most likely parse tree
When to use the Forward algorithm?
HMM: To efficiently calculate the probability of the obervation sequence
When to use chart parsing?
To efficiently explore the parse tree search space
Dynamic programming: What are our sub-problems?
- HMM: Our sub-problems are prefixes to our full sequence
- (P)CFG parsing: Our subproblems are sub-trees which cover sub-spans of our input
Evaluation
Linear
- Tag accuracy is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed.
Hierarchical
- Coverage is a measure of how well a CFG models the full range of the language it is designed for.
- The ParsEval metric evaluates parser accuracy by calculating the precision, recall and F1 score over labelled constituents.
Tag accuracy
Tag accuracy is the most common evaluation metric for POS tagging, since usually the number of words being tagged is fixed.
Coverage
Coverage is a measure of how well a CFG models the full range of the language it is designed for.
ParsEval
The ParsEval metric evaluates parser accuracy by calculating the precision, recall and F1 score over labelled constituents.