Final Flashcards
Quiz: PCA vs ICA
- Mutually Orthogonal
- PCA
- Mutually Independent
- ICA
- uncorrelated ends up being independent for certain distributions (so PCA could find independent features, but not as a fact)
- Maximal Variance
- PCA
- NO on ICA
- independent variables would be combined for maximal variance (central limit theorem)
- Maximal Mutual Information
- ICA
- Maximize joint mutual info between original and transformed features
- ICA
- Ordered Features
- PCA
- eigenvalues ordered by variance
- Not for ICA
- kurtosis can be used to order but generally order is not considered
- PCA
- Bag of Features
- ICA
Definition of Utility (U)
U(s) = R(s) + gamma * maxa sums’ [T(S,a,s’) U(s’)]
Long term value of being in a state
What to do when we are facing Tit For Tat?
- Always defect
- best for low gamma
- always cooperate
- best for high gamma
Way to implement feature selection for wrapping
- Hill Climbing
- Randomized Optimization
- Forward/Backward Selection
How do ICA and PCA differ?
- ICA
- solves blind source separation problem
- directional
- images
- finds noses, mouth, hair
- local “parts of”
- natural scenes
- finds edges
- independent components of the world are edges
- documents
- gives topics
- can understand fundamental features of data
- about probability, information theory
- more expensive, doesn’t always produce answer
- PCA
- not directional
- images
- finds direction of maximal variance (brightness)
- finds the average face next
- global
- natural scenes
- brightness
- average scene
- about linear algebra
What is KL Divergence
Measures the difference between any two distributions (distance measures)
Impossibility Theorem
- No clustering scheme can achieve all three of:
- richness
- scale invariance
- consistency
- Mutually contradictory properties
- Can re-define the properties and get *nearly* all 3
Relevance
Feature xi
- Xi is strongly relevant if removing it degrades BOC (Bayes Optimal Classifier, best on avg, weighted avg of hypotheses)
- Xi is weakly relevant if
- not strongly relevant
- there exists subset of features that adding Xi improves BOC
- Otherwise Xi is irrelevant
Measures effect on BOC
Relevance ~ Information
Minimax
- 2 agents A, B
- A considers worst case counter while maximizing own value
- B considers worst case counter while minimizing A value
- Finding the maximum minimum or minimum maximum
prisoners dilemma
The prisoner’s dilemma is a standard example of a game analyzed in game theory that shows why two completely rational individuals might not cooperate, even if it appears that it is in their best interests to do so
Issues with SLC
- Using the distances (closest) only can result in weird cluster shapes (stringy clusters because they are in the same cluster if they are close anywhere)
Repeated Games and the Folk Theorem
- In repeated games, the possibility of retaliation opens the door for cooperation
- Folk Theorem Definition (math):
- Results known, at least to experts in the field, and considered to have established status, but not published in complete form
- Everybody knows, no one gets credit, in the cloud of understanding
- Folk Theorem Definition (game theory):
- Describes the set of payoffs that can result from Nash strategies in repeated games
- Any feasible payoff profile that strictly dominates the minmax/security profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor.
KMeans as Optimization - Definition
- Configurations
- Center, P
- Scores
- Err(P, center) = sum_x ( euclidean distance(Center_P(x) - x) )
- Neighborhood
- P, center = set of pairs
- (P’, center) U (P, center’ )
- change one or the other
- (P’, center) U (P, center’ )
- P, center = set of pairs
- A lot like Hill Climbing
Desirable Clustering Properties
- Richness
- For any assignment of objects to clusters, there is some distance matrix such that partition returns that clustering
- all inputs (distance) are valid and any clustering can be valid
- Not limited in clusters that can be expressed
- Scale-invariance
- change of units does not change clustering
- inches instead of feet
- change of units does not change clustering
- Consistency
- shrinking intra-cluster distances and expanding inter-cluster distances does not change clustering
- make similar more similar and less similar even less similar, it should not change the groupings
Entropy
Tells us if a feature provides any information
Min # of yes/no questions needed to convey information
Discounted Rewards
- Sum from t = 0 to inf : gamma<em>t</em> R(St)
- Geometric series
- Rmax / (1 - gamma)
- gamma = 0
- Rmax
- gamma = 1
- prior case (no discount)
- gamma = 0
- infinite distance in finite time
- add infinite numbers -> get finite number
Markov Decision Process Framework
- States
- S
- coordinates, descriptors, represent where we are
- Model
- T(s,a,s’) ~ Pr(s’ | s, a)
- Action
- A(s), A
- Reward
- R(s), R(S,a), R(s,a,s’)
- Policy
- pi(s) -> a
- optimal policy, pi*
Sequences of Rewards: Assumptions
- Infinite Horizons
- you can live forever
- if finite
- may take risks if horizon is short
- policy can change even if state is unchanged
- lose stationarity in model
- Utility of Sequences
- 2 sequences with same start, have same preference with start state missing
- stationary preferences
- if I prefer one set of sequences today, I will tomorrow
If X and y are independent, what are their joint and conditional entropy?
Joint Entropy:
H(X, y) = H(X) + H(y)
Conditional Entropy:
H(y | X ) = H(y)
What is the utility of a sequence of states?
- Sum of the reward over the states
- Utility of infinite sequences equal infinity
- two sequences, one with larger rewards randomly, both are infinite, neither are better
- it doesn’t matter what you do if always positive rewards
Implausible Threats
- Idea that a grim trigger threat is implausible.
- Why would opposing agent threaten when the result is less reward for themselves too?
- Potential issue with Grim Trigger
*
How can achievable averages payoffs be found in Repeated Games and the Folk Theorem
- Plot strategy pairs onto a 2 player plot
- Form a convex hull between strategy pairs
- “feasible region”
- Points within the convex hull are valid average payoffs for join strategies
3 fundamental theorems of nash equilibria
- In the n-player pure strategy game, if elemination of strictly dominated strategies eliminates all but one combination, that combination is the unique N.E.
- ex: prisoners dilemma
- any N.E will survive elimination of strictly dominated strategies
- if n is finite and all states are finite, there exists a (mixed) N.E.
tf-idf
- term frequency - inverse document frequency
- nearest-neighbor in textual data
What is the Q Function?
Q(s, a) = R(s) + gamma sums’ [T(s,a,s’) maxa’ Q(s’,a’)]
Value for arriving in a state s, leaving via a, then proceeding optimally thereafter.
MDP: Reward
- Reward
- R(s), R(S,a), R(s,a,s’)
- mathematically equivalent
- encompasses domain knowledge
- R(s), R(S,a), R(s,a,s’)
Principal Components Analysis (PCA)
- Eigenproblem
- Find direction of maximal variance
- Finds orthogonal directions
- PC2 is orthogonal to PC1
- Properties
- Global algorithm
- Best reconstruction
- no information lost (with all PCs kept)
- Minimize L2 (squared) error moving from n to m dimensions
- Each dimension has eigenvalue
- non-negative
- monotonically non-increase from PC1 to PC2 to PCn
- Can through away higher eigenvalues (least variance)
- eigenvalue of 0 is ignorable (0 entropy, meaningless)
- well studied (fast algorithms)
- Filter method
- Can be difficult with classification if feature associated with label has low variance (the feature may be inadvertently thrown out)
Random Components Analysis
- Generate random directions and then projects data out into those directions
- works well
- At what?
- Before classification
- Maintains signal in lower dimension
- Picks up some correlations
- Before classification
- At what?
- projects in higher dimensional space than other algorithms like PCA
- Advantage over PCA/ICA
- fast
If 2 coins are independent, what is the entropy and mutual information between them?
I(A,B) = H(A) - H(A|B) = 1 - 1 = 0 (no mutual info between them)
H(A , B) = H(A) + H(B) = 1 + 1 = 2
H(A) = H(B) = -0.5 log_2(0.5) - 0.5 log_2(0.5) = 1 (Maximum entropy)
mixed strategy vs pure strategy
- mixed strategy
- some distribution over strategies
- pure strategy
- mixed strategy with all probability mass on a single strategy
- probability 1 of chosen strategy
Properties of EM
- Monotonically non-decreasing likelihood
- does not converge (in practice does)
- in strange examples
- will not diverge
- can get stuck
- random restart
- works with any distribution (if E, M solvable)
- expectation is expensive (generally)
Forward & Backward Feature Selection
- Forward
- Pass all features to learner
- Start with no features
- Repeat until error doesnt reduce
- Add feature to get best score
- Pass all features to learner
- Backward
- Pass all features to learner
- Start with all features
- Repeat until error doesn’t reduce
- Remove feature to get best score
- Pass all features to learner
- Like hill-climbing
MDP: Model
- Model
- T(s,a,s’) ~ Pr(s’ | s, a)
- probability of transition into s’ given that you were in state s and took action a
- rules of the game
- physics of the world
- T(s,a,s’) ~ Pr(s’ | s, a)
Feature Selection: Why?
- Knowledge Discovery
- interpretability & insight
- which features matter
- allow lower dimensions for understanding
- interpretability & insight
- Curse of dimensionality
- need exponentially more data for more features
Value Iteration
- Repeat until convergence
- Start w/arbitrary utilities
- Update utilities based on neighbors
-
U_prime(s)t+1 = R(s) + gamma * maxa sums’ [T(s,a,s’) U_primet(s’)]
- update utility of s by looking at utilities of all other states
- the R(s) is truth
-
U_prime(s)t+1 = R(s) + gamma * maxa sums’ [T(s,a,s’) U_primet(s’)]
- Why do we converge?
- Add truth to wrong.. more truth to wrong… more truth to wrong… converge to true as wrong is miniscule
- Works because we propagate value across the states
- Order matters, not absolute values
- We don’t care about the actual value, just need utility good enough
Curse of Dimensionality
Amount of data needed is 2N number of features
Adding more features means exponentially more data is needed
Linear Discrimant Analysis (LDA)
- finds a projection that discriminates based on the label
- like supervised learning
- Not filter method like others, cares about label explicitly
*
Importance different between MDPs and supervised learning
- delayed rewards
- take many actions prior to reward
- ex: chess
- temporal credit assignment
- minor changes matter
- reward changes affect policy
Iterated Prisoner’s Dilemma with Uncertain Ending
- With probability gamma, game continues
- else, game over
- Expected # of rounds
- Finite with gamma < 1
- 1 / (1 - gamma)
- reminiscent of discount factor
Tit-For-Tat
- iterated prisoner’s dilemma strategy
- cooperate on first round
- copy opponents previous move thereafter
- strategies
- always defect
- cooperate, defect, defect, defect, …
- always cooperate
- cooperate forever
- defect, cooperate, defect, cooperate, …
- cooperate, defect, cooperate, defect, …
- TFT (against another TFT agent)
- cooperate forever
- always defect
What is the best response to each strategy (gamma > 1/6)
- Mutual Best Response
- Pair of strategies where eaach best respone to other. Nash Equilibria
- In the below chart, D - D and TFT - TFT are Nash Equilibria
- TFT - TFT Cooperative Nash
MDP: States
- States
- S
- coordinates, descriptors, represent where we are
Policy Iteration
- Algorithm
- Start with initial policy (arbitrary)
- Evalute policy
- calculate Ut for policy (utility)
- Improve Policy
- pit+1 = argmaxa sum T(s,a,s’) Ut(s’)
- Ut(s) = R(s) + gamma sums’ T(s, pit(s), s’) Ut(s’)
- n linear (unlike VI) equations in n unknowns
- Fewer iterations than Value Iterations
- guaranteed to converge
mechanism design
- set up incentives to get particular behavior
- economics/government
- ex: tax breaks for mortgage
Bellman Equation
- True utility of a state
- reward for the state plus discounted rewards
-
R(s) + gamma * maxa sums T(s,a,s’) U(s’)
- non-linear