CS 7641 Final Flashcards
Supervised Learning
Use labeled training data to generalize labels to new instances
(function approximation)
unsupervised learning
making sense out of unlabeled data
(data description)
Single Linkage Clustering
- Consider each object a cluster (n objects)
- Define intercluster distance as the distance between the two closest two points in the two clusters
- Merge two closest clusters
- Repeat n-k times to make k clusters
Intercluster Distance in Single Linkage Clustering
the distance between the 2 closest points in the 2 clusters
Inter cluster distance represents
domain knowledge
T/F Single Link Clustering is Deterministic
True. Unless there are ties, it will give us an answer.
In SLC, if distance can be represented by edge link of a graph, what algorithm is it the same as?
Minimum Spanning Tree
What is the running time of Single Link Clustering (SLC) in terms of n points and k clusters
O(n^3). O(n^2) to look at all distances to find closest pairs and have to do this n times
Issues with SLC
Could end up with weird clusters that surround other clusters
k-means clustering algorithm
- pick k centers (art random)
- each center “claims” its closest points
- recompute the centers by averaging the clustered points
- repeat until convergence
When does k-means converage
When recomputing the centers does not move
T/F the centers have to be a point in the collection of objects
False
k-means in euclidean space process
circle between partition assignment and recalculating the centers
Which optimization algorithm is most like k-means? HC, GA, or SA
Hill Climbing. You take a step towards a configuration that is better than you have before
K-means runs out of configurations and converges in finite time when
finite configurations, break ties consistently, and never go into a configuration with a higher error
Big O for Each iteration of k-means clustering is
polynomial (pretty fast). O(k n): look at distance between each k center and n points and run through all the points to redefine clusters. could also have extra d if d dimensions
Big O for Finite iterations of k-means
exponential O(k^n). first of n objects can be in any of k clusters, 2nd of n objects can be in k clusters, etc. In practice, it is much faster
If ties are broken consistently in k-means then error
decreases [with one exception]
T/F k-means cannot get stuck
False (local optima can still happen)
How to avoid k-means getting stuck
random restart OR do initial analysis and pick centers in good spots
Soft Clustering Overview
Points can be probabilistically from one cluster or another
Soft Clustering Idea
Assume the data was generated by
1. select one of k gaussians uniformly [fixed known variance]
2. sample x from Gaussian
3. repeat n times
Task: find a hypothesis that maximizes the prob of the data (ML)
What is the maximum likelihood gaussian (soft clustering)
The ML mean of the Gaussian is the mean of the data
Expectation Maximum
Has two phases expectation which is soft clustering and maximization
How is expectation maximization like k-means
k-means if cluster assignments use arg max (prob of 0 or 1). It improves in a probabilistic metric like k-means improves in squared error metric
Properties of Em
- Monotonically non-decreasing likelihood (it’s not getting worse)
- Does not converge b/c infinite configurations b/c infinite number of probabilities (partially does)
- Will not Diverge (numbers don’t blow up and become infinitely large)
- Can get stuck (local optima)
- Works with any distribution (If EM solvable)
Clustering Properties
- Richness
- Scale Invariance
- Consistency
Richness
For any assignment of objects to clusters, there is some distance matrix D such that your clustering algorithm returns that clustering.
Scale Invariance
scaling distances by a positive value does not change the clustering
Consistency
Shrinking intra cluster distances and expanding intercluster distances does not change the clustering
Impossibility Theorem
No clustering can achieve all three of richness, scale invariance, consistency
Curse of dimensionality
Data needed grows exponentially with features (2^n)
Big O of feature selection
np-hard, 2^n, exponential
Two approaches to feature selection
filtering and wrapping
filtering overview
set of features as input, passes to same search algorithm that outputs fewer features
wrapping overview
set of features, search over some subset of features, hands to leaning algorithm which reports how well those features performed
Pros of filtering
speed: faster than wrapping because you don’t have to worry about what the learner does.
Ignores the learning problem
cons of filtering
look at features in isolation.
Pros of wrapping
takes into account model bias and learning itself
con of wrapping
extremely slow compared to filtering
ways to do feature filtering
- information gain
- variance, entropy
- “useful features”
- independent/non redundant
- ex: DT, NN (trim features with low weights)
ways to do feature wrapping
- hill climbing
- randomized Optimization
- forward search
- backward search
forward search
Look at all features in isolation, keep whatever feature is best. Then look at adding any of the remaining features and choose whichever is best in combination with first feature, etc. This is similar to hill climbing
backward search
start with all the features and figure out which one can you eliminate, one at a time until you get to a subset that performs well
Relevance: strongly relevant
xi is strongly relevant if removing is degrades the Bayes optimal classifier
Relevance: weakly relevant
Xi is weakly relevant if
- not strongly relevant
- and there exists some subset of features such that adding xi to the subset improves the bayes optimal features
Relevance measures effect on
Bayes optimal classifier. INFORMATION
Usefulness measures effect on
a particular predictor. ERROR given model/learner (c is useful in neural nets)
feature transformation
the problem of pre-processing a set of features to create a new (smaller? more compact?) feature set, while retaining as much (relevant? useful?) information as possible
T/F Feature selection is a subset of feature transformaiton
True
feature transformation vs feature selection
transformation: linear combination of original features like 2x1 + x2 as 1 feature. selection: have x1, x2, x3, x4 and only take x1 and x2
ad hoc information retrieval problem (google problem)
a whole bunch of databases and you want to retrieve the subset of documents that are relevant to the query (features)
ad hoc - you don’t know what they queries are
polysemy
words can have multiple meanings
ex: car- Vehicle and also stupid thing in LISP
results in false positives
synonomy
many words mean the same thing. ex: car and automobile are same thing
results in false negative
principal components analysis overview
finds directions that maximizes variance (principal component)
and finds directions that are mutually orthogonal
principal component analysis properties
- maximize variance (principal component)
- mutually orthogonal - global algorithm
- you can prove best reconstruction (minimize l2 error by projecting onto a new line)
- you can throw away dimensions with the smallest eigenvalues
best reconstruction in principal component analysis (PCA)
it is simply a re-labeling of the dimensions. The directions just tell you how far along each axis it is. It is just a linear rotation of the original dimensions. the new dimensions that PCA returns does not lose any information. It’s just a rotation of the data. You can reconstruct your original data with the information
mutually orthogonal in PCA
Which means it’s a global algorithm. All of the directions and the new features that they find have a big global constraint → have to be mutually orthogonal.
What does PCS fundamentally ask
Is there another basis, which is a
linear combination of the original basis, that best expresses our data set?
eigenproblem
a computational problem that can be solved by finding the eigenvalues and/or eigenvectors of a matrix. In PCA, we are analyzing the covariance matrix (see the paper for details)
PCA vs ICA overview
PCA: correlation, maximizes variance => reconstruction
ICA: independence, linear transformation into new feature space such that each new feature is independent of each other and mutual information as high as possible between original and new features
Blind Source Separation Problem (cocktail)
It is difficult to listen to one conversation at once when there are a lot of conversations going on at the same time. People are hidden variables in here and microphones are the observables
Independent Components Analysis (ICA) goal
figure out the hidden variables from observables
PCA vs ICA
- mutually orthogonal
- mutually independent
- maximal variance
- maximal mutual information
- ordered features
- bag of features
- mutually orthogonal: PCA
- mutually independent: ICA
- maximal variance: PCA
- maximal mutual information: ICA
- ordered features: PCA
- bag of features: ICA and ish PCA
Which is global and which is local: PCA vs ICA
PCA is global (in face finds brightness then average face)
ICA finds local/structure (in faces it finds noses then eyes, in the natural world, it finds edges. In documents it’s topics)
RCA: Random Components Analysis
Generates random directions
projects data into those directions
works very well if the next thing you do is classification
What is the big advantage of RCA
Fast (much faster than PCA or ICA)
Other advantages: cheap, simple, easy,
LDA: Linear Discriminant Analysis
Find a projection that discriminates based on the label. Pays attention to how the resulting components are used
“In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes.”
ICA is important for finding
structure and edges
PCA vs ICA: which uses primarily Probability vs primarily Linear algebra
ICA: probability
PCA: linear algebra
Markov Decision Process: Model
Transition State. T(s,a,s’) is transition probability of being in state s, take action a, and end up in state s’. This function produces the P(s’|s,a): The probability of a new state given your current state and given the action you take
Markov Decision Process: Actions (A)
The things that you can do in a state (up, down, left, right)
Markov Decision Process: Reward
Scalar Reward for being in a state, taking action, etc. (Red and green states in board example). Represents our Domain Knowledge
Markov Decision Process: Policy
A policy is a solution to Markov decision process. Tells you what action to take from a particular state
Markov: Infinite Horizons
We assumed there is not a time that is clicking. If there is then policy would depend on state and time
Markov: Utility of Sequences
if I prefer one sequence of events today, then I prefer the same sequence tomorrow
Markov: discounted
discounted rewards allows us to go infinite distance in finite time because it is geometric. Rmax/1-gamma
Markov: Utility of a policy at a state
the expected outcome from that point if we follow the utility function f
Markov: utility vs reward
Utility: reward for that state and all the reward from that point on (Long term)
Reward: Immediate: reward (Short term)
Bellman Equation
Key equation for RL. A recursive equation that defines the true value of being in some particular state including policy, rewards, transition matrix, gammas, etc. It is the utility of the state.
Markov Decision: Value Iteration
start w/ arbitrary utilities, update utilities based on neighbors, repeat until convergence
Markov Decision: Policy Iteration
start with some policy (guess), evaluate its utility, improve policy by updating it to the policy that takes the action that maximizes the expected utility based on what we just calculated. Repeat
What is Q-Learning
Evaluating the Bellman equations from data
Markov Property (2 things)
- only the present matters
- things are stationary (the rules don’t change)
Advantage of policies in Markov decisions vs “plans”
They are robust to the stochasticity of the world
Markov: Why use gamma^t in utility of sequences
Rewards are substantial at first and quickly trail off
RL: What is Q(s,a) function in plain english
the value for arriving in s, leaving via a, and proceeding optimally thereafter
RL: What happens if you set alpha (learning rate) of 1
Full learning, forget everything you learn and just jump to new v
RL: What happens if you set alpha (learning rate) of 0
Won’t learn at all. Will just assign v to v
Q-learning family of algorithms varies on what 3 things (generally)
- how initialize Q hat
- how to decay alpha sub t
- how to choose actions
How to choose actions in Q-learning
Can choose Q hat which is greedy (always choose best Q-hat), but can hit local mins. Solve this problem with simulated annealing like approach (take random actions sometimes)
Could choose Q hat randomly, but then you are not using what you learned about Q hat so far. You learn optimal policy, but don’t follow it.
Problem with random restarts in Q-learning
Very slow! takes a long time to get to infinity without restarts.. even longer with
Fundamental Resul Theorem
in a 2 player, zero sum deterministic game of perfect information, minimiax = maximin and there always exists an optimal pure strategy for each player (with rational agent)
nash equilibrium
if and only if each player won’t change their strategy given all the other player’s strategy
T/F Any N.E. will survive elimination of strictly dominated strategies
True
T/F There is always a N.E. (maybe mixed) for finite strategies and finite player games
True
T/F In the n player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination it he unique N.E
True
Tit-for-tat
First round = corporate. All future rounds = copy the opponent’s previous move. A strategy for infinite games.
Folk Theorem in Math
General idea: in repeated games, the possibility of retaliation opens the door for cooperation
Folk Theorem: In math: results known, at least to experts in the field, and considered to have established status, but not published in complete form
game theory: feasible region
average payoff of some point strategy
Game Theory: Folk Theorem
Any feasible payoff profile that strictly dominates the minimax/security profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor. Proof: if it strictly dominates the minimax profile, can use it as a threat. Better off doing what you are told
Grim Trigger
As long as you cooperate, you get mutual benefit. If you ever defect “deal out vengeance” forever
Implausible Threat
The vengeance could cost more than trying to cooperate so it’s not realistic to always punish opponent
Game Theory: Subgame Perfect
Each player is always taking a best response independent of history
Game Theory: Pavlov
cooperate if agree, defect is disagree
T/F is Pavlov sub game perfect?
Yes. No matter what state they are in, the average reward is mutual cooperation
Computational Folk Theorem
Can build pavlov like machine for any game. construct subgame perfect nash equilibrium for any game in polynomial time. This is because if it’s pavlov, we can get to pavlov quickly, and if not it is zero-sum like (solve an LP), or at most one player improves
bimatrix game
2 players and each player has its own reward matrix and is average reward repeated game
what makes stochastic games more interesting
actions that the players take impact not just the rewards but also future states
How can you use bellman equation in zero sum stochastic games
use minimax on q value instead of max
Properties of Q learning in zero sum stochastic games
value iteration works
minimax-Q converges under same conditions that Q does
unique solution to Q*
policies can be computed independently
update efficient
Q functions sufficient to specify policy
General Sum Stochastic Games
Can’t do minimax anymore. Instead use Nash of Q.
What properties change for general sum stochastic games
Value iteration doesn’t work
Nash doesn’t converge
No unique solutions to Q*
Policies can not be computed independently
The update is not efficient unless P=PPAD
Q functions are not sufficient to specify policy
Expectation Maximization - soft clustering / expectation step
We know the means (assume the means) and then calculate how likely it is that the data came from the means. Knowing means and variances allows you to calculate which point came from which cluster. For each point, which cluster is it from?
Expectation Maximization - Maximization steps
We know the clusterings, so calculate the means and variances for these clusters.
monotonically non decreasing
finding higher and higher likelihoods
Why care about feature selection
- knowledge discovery - it is useful to keep understand what features are meaningful (interpretability and insight)
- curse of dimensionality
What algorithm looks at feature selection
Decision Trees (a type of filtering) and in particular, information gain
What algorithm did we learn in the first half of the course that does feature transformation
neural nets
principal component analysis. Why can throw away eigenvalues
You project onto n dimensions with n features. As you move from 1st principal component to n principal components, the eigenvalues monotonically non-increasing. You can throw away the ones with the least eigenvalue = throw away the ones with the least variance
Con of PCA
If one of your original dimensions is directly related to the data but there is gaussian noise, PCA will end up throwing away that original dimension
PCA is like filtering or wrapping
filtering. It transforms into a new space where features can be filtered
T/F PCA goal is to find independent projects
False. It can happen to find independent projects. It’s finding uncorrelated dimensions
How can ICA want to maximize mutual information and also mutually independent
Max mutual info between new features and old features
Mutually independent between new features
What feature transformation problem does much better in blind source separation problem and is directional
ICA. Directional meaning it gives very different answers if you rotate the matrix that it is given.
RL: 3 different types of RL
Policy search: map states to actions through policies
Value function based: map state to value through utilities
Model-Based: map states and actions to new states and rewards through transitions and rewards
RL: Q-Learning what is U(s) and pi(s)
U(s) = max over all actions of Q(s,a) in that state
pi(s) = argmax over all action of Q(s,a)
RL: What is Q-learning
evaluating the Bellman equation from data. We know the transitions, not the model. We know
Q-learning converges if
- we visit s,a infinitely often so it needs to run a very long time
- s’ needs to be drawn from the actual transition probabilities T(s,a,s’)
- rewards need to be drawn from reward function R(s)
minimax
Players consider worst case counter. A is trying to max and B is trying to min. So A finds the maximum minimum and B is trying to find the minimum maximum. There always exists an optimal pure strategy for each player
Relaxing Which constraint makes minimax fail?
change from perfect information to hidden information. Minimax works in 2-sum zero-sum deterministic OR nondeterministic games of perfect information with pure strategies
mixed strategy vs pure strategy
mixed strategy, you choose with probability
Game Theory: finite state strategy. How to solve?
You can use MDPs! Only the last state matters, etc.
minmax profile
pair of payoffs, one for each player, that represents the payoff achieved a player defending itself against a malicious adversary
security level
when you can have mixed strategy with minmax profile
q-learning does not work with general stochastic games, what could work?
repeated stochastic games (folk theorem)
cheap talk -> correlated equilibrium. players can talk a little bit
cognitive hierarchy -> best response. Rather than solve for equilibria, best response to what you believe the other player will do
side payments - (cocoa values) players can pay the opponent to help them