00 Overview Flashcards
Sequence and grammar
Sequences:
- Probabilities over strings: n-gram models: Linear and surface oriented
- Sequence classification: HMMs add one layer of abstraction; class labels as hidden variables. But still only linear
Grammar; adds hierarchical structure
- Shift focus from “sequences” to “sentences”
- Identifying underlying structure using formal rules
- Declarative aspect: formal grammar
- Procedural aspect: parsing strategy
- Learn probability distribution over the rules for scoring trees
We have been doing data-driven learning by counting observations in what contexts?
- feature vectors in semantic spaces; bag-of-words, etc.
- previous n-1 words in n-gram models
- previous n-1 states in HMMs
- local sub-trees in PCFGs
Vector space models
- Data representation based on a spatial metaphor
- Objects modeled as feature vectors positioned in a coordinate system
- Semantic spaces = VS for distributional lexical semantics
- Some issues:
- Usage = meaning? (The distributional hypothesis)
- How do we define context / features? (BoW, n-grams, etc.)
- Text normalization (lemmatization, stemming, etc.)
- How do we measure similarity? Distance / proximity metrics. (Euclidean distance, cosine, dot-product, etc.)
- Length-normalization (ways to deal with frequency effects / length-bias)
- High-dimensional sparse vectors (i.e. few active features; consequences for low-level choice of data structure, etc.)
Classification
Supervised learning from labeled training data.
Given data annotated with predefinded class labels, learn to predict membership for new/unseen objects.
Cluster analysis
Unsupervised learning from unlabeled data.
Automatically forming groups of similar objects.
No predefined classes; we only specify the similarity measure.
Issues with categorization tasks
- Measuring similarity
- Representing classes (e.g. exemplar-based vs. centroid-based)
- Representing class membership (hard vs. soft)
Examples of vector space classifiers
Rocchio vs. kNN
Differences between Rocchio and kNN
- Centroid- vs exemplar-based class representation
- Linear vs non-linear decision boundaries
- Assumptions about the distribution within the class
- Complexity in training vs complexity in prediction
Evaluation of classifiers
- Accuracy, precision, recall and F-score
- Multi-class evaluation: Micro-/macro-averaging
Types of clustering
- Flat /partitional clustering
- Hierarchical clustering
- Agglomerative clustering (bottom-up)
- Divisive clustering (top-down)
Flat clustering algorithm
- Example: k-Means.
- Partitioning viewed as an optimization problem:
- Minimize the within-cluster sum of squares.
- Approximated by iteratively improving on some initial partition.
- Issues: initialization / seeding, non-determinism, sensitivity to outliers, termination criterion, specifying k, specifying the similarity function.
Agglomerative clustering
- Bottom-up hierarchical clustering
- Resulting in a set of nested partitions, often visualized as a dendrogram.
- Issues:
- Linkage criterions — how to measure inter-cluster similarity:
- Single, Complete, Centroid, or Average Linkage
- Cutting the tree
- Linkage criterions — how to measure inter-cluster similarity:
Divisive clustering
- Top-down hierarchical clustering
- Generates the tree by iteratively applying a flat clustering method.
Structured Probabilistic Models
- Switching from a geometric view to a probability distribution view.
- Model the probability that elements (words, labels) are in a particular configuration.
- We looked at many of the same concepts over structures that were linear or hierarchical
What have we been modelling with Structured Probabilistic Models?
Linear
- With string is more likely?
- Which tag sequence is most likely?
Hierarchical
- Which tree structure is most likely?