Vector Spaces Flashcards
Distributional hypothesis
If two words have similar contexts, we can assume that they have similar meanings.
A distributional approach to lexical semantics
- Record contexts of words across a large collection of texts (corpus)
- Each word is represented by a set of features
- Each feature records some property of the observed context
- Words that are found to have similar contexts are expected to also have similar meaning.
Context windows
”I bake bread for breakfast”
- Context = neighborhood of +-n words left/right of the focus word.
- Features for +-1: {left: bake, right: for}
- Some variants: distance weighting, ngrams.
Bag of Words (BoW)
- Context: all co-occuring words, ignoring the linear ordering
- Features: {I, bake, for, breakfast}
- Some variants: Sentence-level, document-level
Grammatical context
- Context: The grammatical relations to other words
- Intuition: When words combine in a construction they often impose semantic constraints on each-other.
- Requires deeper linguistic analysis than simple BoW-approaches
- Features: {dir_obj(bake), prep_for(breakfast)}
Tokenization
Splitting a word into sentences and words or other units.
Stop-list
Filter out closed-class words or function words. The idea is that only content words provivde relevant context
Lemmatized string from the raw string “The programmer’s program had been programmed”.
“The programmer ‘s program have be program”
“Relatedness” vs. “Sameness”
Similarity in domain:
{car, gas, road, service, traffic, driver}
Similarity in content:
{car, train, bicycle, truck, vehicle, airplane}
Vector space model
- A general model for representing data based on a spatial metaphor
- Each object is represented as a vector (or point) positioned in a coordinate system
- Each coordinate (or dimension) of the space corresponds to some descriptive and measurable property (feature) of the objects.
- To measure similarity of two objects, we can measure their geometrical distance / closeness in the model.
- Vector representations are foundational to a wide range of ML methods.
Semantic spaces
Semantic spaces AKA distributional semantic models or word space models
A semantic space is a vector space model where:
- Points represent words
- Dimensions represent contexts of use
- Distance in the space represents semantic similarity.
One standard metric for spatial proximity
Euclidian distance:
Norm of a vector
Norm of a vector
Potential problem with Euclidian distance
It is very sensitive to extreme values and the length of the vectors.