Exam 2 Flashcards
Unsupervised Learning
- Data Description
Single Linkage Cluster - SLC
- Consider each object a cluster (n objects)
- Define inter-cluster distance as the distance between the closest points in the two clusters
- Merge the two closest clusters
- Repeat n-k times to make k clusters
- How you define the inter-cluster distance = domain knowledge
Single Linkage Cluster Run Time
O(n^3)
Issues with SLC
Might end up with wrong clusters, depending on the distance definition
k-means clustering
- Pick k random center points
- Each center claims its closest points
- Recompute the centers by averaging the clustered points
- Repeat until converge
Configurations (k-means as an optimization problem)
center
P
Scores (k-means as an optimization problem)
How much error you introduce by representing these points as centers
Neighborhood (k-means as an optimization problem)
The neighborhood of a configuration is the set of pairs where you keep the centers the same and you change the partitions or you keep the partitions the same and you move the centers
Properties of k-means clustering
- Each iteration is polynomial O(kn)
- Finite (exponential) iterations O(k^n)
- Error decreases (if ties are broken consistently)
- Can get stuck, if it started with wrong centers. This is similar to converging to a local optima. One solution is to use random restarts.
Soft Clustering
Attaches cluster probability to each point instead of specific cluster
Task is to find a hypothesis that maximizes the probability of the data (Maximum likelihood)
Maximum Likelihood Gaussian
The Maximum Likelihood mean of the Gaussian u is the mean of the data
Expectation Maximization
- Assigns a cluster probability to each point
- Use the new u_j to recompute E[Z_ij]
- Expectation: E[Z_ij] defines the probability that element i was produced by cluster j.
- Maximization: u_j defines the mean of cluster j
EM properties
- Monotonically non-decreasing likelihood: It’s not getting worse
- Does not converge
- Will not diverge
- Most probably will get stuck in a local optima. Use random restarts
- Works with any distribution (if E and u are solvable)
Richness (Clustering Properties)
For any assignment C of objects to clusters, there’s some distance matrix D such that P_d returns that clustering C
Scale-invariance (Clustering Properties)
Scaling distances by a positive value doesn’t change the clustering
Consistency (Clustering Properties)
Shrinking intra-cluster distances (moving points towards each other) and expanding inter-cluster (moving clusters away from each other) distances doesn’t change the clustering
Impossibility Theorem (Clustering Properties)
There’s no clustering algorithm that can achieve the three properties
Time to select most relevant subset of features M, where M <= N
2^N
Approaches to Feature Selection
Filtering and Wrapping
Filtering (Feature Selection)
- Apply a search algorithm to produce fewer features to be passed to the learning algorithm
- Faster than wrapping
- Doesn’t account for the relationships between features
- Ignores the learning process (no feedback)
- We can use an algorithm (ex: Decision Tree) that can select the best features, then use another algorithm, with another inductive bias, for learning
- We can use different criterion to evaluate the usefulness of a subset of features:
information gain, variance, entropy, eliminate dependent features
Wrapping
- Given a set of features, apply a search algorithm to produce fewer features, pass them to the learning algorithm, then use the output to update the selected set of features
- Takes into account the learning model’s bias
- Very slow
We can use different techniques to search:
1. Randomized Optimization algorithms
2. Forward selection: select a feature and evaluate the effect of creating different combinations of this feature with other features. Stop once you stagnate. This is similar to hill climbing
3. Backward Elimination: start with all the features and evaluate the effect of eliminating each feature
Forward selection
select a feature and evaluate the effect of creating different combinations of this feature with other features. Stop once you stagnate. This is similar to hill climbing
Backward Elimination
start with all the features and evaluate the effect of eliminating each feature
x_i is <> relevant if removing it degrades the Bayes Optimal Classifier
strongly
x_i is <> relevant if its not strongly relevant or there’s a subset of features S such that adding x_i to S improves the Bayes Optimal Classifier
weakly
Relevance (Feature Selection)
Measures the effect of Bayes Optimal Classifier, how much information a particular feature provides
Usefulness (Feature Selection)
Measures effect on error, measures if a feature helps in minimizing error given a particular model
PCA
Example of an eigenproblem which will transform the features set by:
- Finding the direction (vector) that maximizes variance, this is called the Principal Component
- Finding directions that are orthogonal to the Principal component
- Each Principal component has a prescribed eigenvalue
- PCA gives the ability to do reconstruction, b/c its a linear rotation of the original space that minimizes L2 error by moving N to M dimensions so we don’t lose information
- We can center the problem around the origin by subtracting the mean of. the data
- in effect, PCA produces a set of orthogonal gaussians
- PCA is a global algorithm that is very fast.
ICA
- Attempts to maximize independence
- Tries to find a linear transformation of the feature space, such that each of the individual new features are mutually statistically independent
- The mutual info b/w any 2 random features equals 0
- The mutual info b/w the new features set and the old features set is as high as possible
Random Components Analysis
- Similar to PCA but instead of generating directions that maximize variance, it generates random directions
- It captures some of the correlations that works well with classification settings
- Faster than PCA and ICA
Linear Discriminant Analysis
- Finds a projection that discriminates based on the label, it finds projections of features that ultimately align best with the desired output (wrapping function instead of filtering)
5 Parts to a Markov Decision Process
Markov Decision Problem
- State - representation
- Model - Transition model, probability of transitioning to a new state given a state and an action
- Actions - Things the object is allowed to do in a particular state s
- Reward - Reward the object gets when transitioning to a state s, which tells it the “usefulness” of transitioning to that state
Markov Decision Problem Solution
- Policy - Takes a state and returns the proper action. A problem might have different policies
Temporal Credit Assignment
We don’t have instant rewards for each action, we have the problem of figuring out which specific action(s) lead us to a positive/negative outcome
A small negative reward will encourage the agent to take the <> path to the positive outcome
Shortest
A big negative reward will encourage the agent to take the <> path no matter what the result is, so it might end up with a negative outcome
shortest
You can take <> paths to avoid possible risks if you have an infinite time horizon
longer
If discount rate is 0 then:
we get the first reward and everything else will fall off to nothing
If discount rate is 1 then:
we get a maximized reward
Utility of Sequences
Summation of gamma to the t times reward of the state at time t
Equals R_max divided by (1 - gamma)
Value Iteration (Bellman Equation)
- Start with arbitrary utilities
- Update utilities based on neighbors
- Repeat until convergence
Guaranteed to converge because with each step we’re adding R(s), which is a true value.
Policy Iteration (Bellman Equation)
- Start with an arbitrary policy pi_0
- Evaluate how good that policy is by calculating the utility of using the policy
- Improve the policy
In MDP, our input is
a model consisting of a transition function T and a reward function R, and the intended output is to compute the policy pi (Planning)
In Reinforcement Learning, the inputs are
transitions (initial state, action, reward, result state, …), and the intended output is to “learn” the policy pi.
Policy Search Algorithm (RL)
Mapping states to actions.
Problem with this approach is that the data doesn’t reveal which actions need to be taken (Temporal Credit Assignment Problem)
Value Function based Algorithms (RL)
Mapping states to values.
Use argmax to turn into policy
Model-based Algorithm (RL)
Mapping (states & actions) to (next state & reward)
Turn this into a utility function using Bellman equations then use argmax to get the policy. This is computationally indirect
Q-function
Utility of leaving state s via action a, which is the reward of state s plus the discounted expected value of taking action a multiplied by the value of the optimum action in state s_prime
Q-Learning
Estimating the value of Q(s,a) based on transitions and rewards
Issue is that we don’t have R(s) and T(s, a, s_prime) since we’re not in an MDP setting
Q_hat(s,a)
Estimate of the Q-function that we will update by a learning rate of alpha_t in the direction of the immediate reward r plus the estimated value of the next state
Exploration
Getting the data that you need (learning)
Exploitation
Stop learning and actually use what you’ve already learn
Exploration-Exploitation dilemma
There is only one agent interacting with the world with conflicting actions
Game Theory
Mathematics of conflicts of interests when making decisions.
Related to AI when there’s multiple agents acting in the environment
In MDP we have the notion of a “policy”, which is mapping states to actions.
In Game Theory, we have the notion of a <>
Strategy
Strategy (Game Theory)
Mapping of “all possible” states to actions
Fundamental Result (Game Theory)
Everyone is trying to maximize his/her own reward while knowing that everyone is trying to do the same
Nash Equilibrium (Game Theory)
We’re in a situation where no one has any reason to change his strategy
Mixed Strategies (Game Theory)
Instead of making the same decisions given the same circumstances (pure strategy), we assign probabilities to our state (distribution over strategy)
Pure Strategies (Game Theory)
Make same decisions given the same circumstances
3 Theorems resulting from Nash Equilibrium
- In the n-player pure strategy game, if equilibrium of strictly dominated strategies eliminates all but one combination, that combination is the Nash equilibrium
- Any Nash equilibrium will survive elimination of strictly dominated strategies
- If the number of players is finite and we have finite number of strategies for each player, there exists at least one (possibly mixed) Nash equilibrium