final exam review Flashcards

1
Q

How is entropy used in clustering?

A

entropy (IG) ranks attributes by how much they contribute information. 0 = no info, 1 = max info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe SLC algo

A
  1. each obj starts as own cluster
  2. merge 2 closest clusters
  3. repeat n-k times
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is an important consideration for SLC?

A

how you define distance what is “close” for clusters/data points. This can be considered your domain knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe K-means algo

A
  1. pick k centers at random.
  2. for each center
    a. claim the closest points
    b. recompute the centroid (dimensional mean)
  3. repeat until convergence

define a consistent tie-breaking metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what rand opt also is K-means similar to?

A

hill climbing - take steps to move to a better configuration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a pitfall of K-means? How can you avoid it?

A

can get stuck in local optima. avoid by doing random restarts, or defining your space to be points of the convex hull.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

K-means time complexity

A

each iteration O(Kn)

total number of iterations = O(K^n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

K-means properties

A

monotonically non-increasing in error.

squared error metric for the distance from centers/means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

EM algo overview

A

soft clustering technique that assigns based on probability distributions.

  1. select k gaussians distributions
  2. find prob that each point (x_i) would be in each gaussian distribtuion
  3. take avg of all x_is in each distribution (cluster)
  4. recompute cluster based on wgt’d avg of the x_i’s
  5. repeat until the x’s don’t move that much
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

EM Properties

A
  • does not diverge
  • doesn’t fully converge, but practically does
  • can get stuck in local optima, need restarts
  • monotonically non-decreasing likelihood.
  • If E, M solvable, works with any distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can EM or K-means ever act in the same way?

A

if probabilities were argmax to 0 or 1, EM would function same as k-means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some properties of clustering?

A

richness
scale invariance
consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is cluster richness

A

all possible partitions are possible. for any cluster that you want, there is some distribution where P_distribution can produce that cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

scale invariance

A

mult. list by a positive scalar does not change clustering (ie changing units)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

cluster consistentcy

A

making similar things more similar and different things more different will not change the groupings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is the impossibility thm

A

you can never have all 3 three of consistency, richness, and scale invariance for a cluster algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is game theory?

A

the mathematics of conflict. games can have single or multiple agents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is a strategy

A

mapping of all possible states to an action s.
each player has their own strategy
can only map to states attainable by a player

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is the difference between a pure and a mixed strategy?

A

pure represent straight moves by a player, while mixed strategies are based on a probability distribution over the players’ strategies.

In a pure strategy a player chooses an action for sure, whereas in a mixed strategy, they choose a probability distribution over the set of actions available to them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is minimax

A

for a game btw A and B, A considers the best case counter, while B considers the worst case counter.

this is all about pov.

A picks the strategy that maxes it’s value while knowing your opponent is trying to minimize your value.

also known as covering your ass.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is the fundamental results for a 2 player, zero sum, deterministic game of perfect info?

A
  • minimax = maximin
  • there always exists a pure strategy for each player.
  • this is saying the value found in the matrix that describes the game is when A, B both try to maximize their strategies acting rationally.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is the effect of hidden information?

A
  • mini poker

- minimax does not equal maximin b/c one player’s strategy depends on what the opponent is going to do.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is a Nash Equilbrium

A

for n players with strategies S1,..,Sn.
These n players with strategies are in a Nash eq. iff for all states the player chooses the strategy that maximizes their utility.

This is saying that a no player in the game has a reason to change its strategy.

a unique NE exists if the elimination of strictly dominated strategies eliminates all but one combo.

any NE will survive the elimination of strictly dominated states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

T/F : for a finite # of players and finite # of strategies, there exists a NE

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

T/F: in a repeated game, the last state is only one that matters

A

True, because eventually the NE will dominate. Even though we’ve built up trust eventually we will defect because it is in our best interest.

n repeated game -> n repeated NE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

what is the expected number of rounds for an IPD?

A

1/1-gamma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

describe tit for tat

A
  • cooperate on the 1st round

- each round thereafter, copy the opponents previous move.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

T/F: when facing TFT, you should always cooperate for low gamma

A

False. You should always defect for low gamma because there will be a small number of games being played.

For high gamma, you should always cooperate when facing TFT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a finite state strategy?

A

a strategy based on deterministic states where the choices of one players impacts their own payoff and the future decisions of the opponents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

what is the folk theorem?

A

in repeated games, the possibility of retaliation opens the door for cooperation.

Any feasible payoff profile that strictly dominates the minimax/security level profile can be realized as a NE payoff profile, with sufficiently large gamma.

This is proved because if the strategy strictly dominates the minimax profile, it can be used as a threat. the player is best off doing what it is told.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

what is the minimax profile for an IPD?

A

it is the pair of payoffs, one for each player, that represent the payoffs that can be attainted by a player defending itself from a malicious opponent.

  • this can be solved as a zero-sum game.
  • this can be considered the feasible region of a 2 player game.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

what is the security level profile?

A

these are payoffs in an acceptable region that are better than what each player can guarantee themselves in an adversarial situation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

what is subgame perfect?

A
  • each player takes the best response independent of the history in the game.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

what makes a game not subgame perfect?

A
  • game is not SGP if there exists some history of actions given to players and there is a point where one of the players is not taking a best response
  • player makes a threat, but when it comes time to act on the threat, it is doing something that is worse for itself than what it would do otherwise.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Describe Grim Trigger

A
  • cooperate, but if ever defect, always defect thereafter.
  • strategy that can be used to prove folk theorem
  • implausible threat -> always taking vengeance and forgoing rewards is irrational and unreasonable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

T/F: Grim Trigger vs. TFT is subgame perfect

A

False. this game is not subgame perfect because grim trigger will defect in one situation where it should cooperate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

what is an example of a sub perfect game?

A

Pavlov v. Pavlov

average reward for this game is mutual cooperation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

what is the difference between a plausible and implausible threat?

A

implausible threat - taking vengeance and forgoing rewards without a way to improve it for yourself.

plausible threat - defect knowing that you will be ok/get a positive state afterwards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

how can you update the q-learning equation to be applicable for a stochastic game?

A
  • utilize joint actions of the players

- change the discounted expected value of the next step to be the minimax of the Q value projection state action pair:

40
Q

what are the properties of a zero-sum stochastic game?

A
  • VI works
  • minimax Q converges
  • there exists a unique solo to Q*
  • policies can be computed independently
  • update efficient
  • Q fans are sufficient to specify policy
41
Q

what is the issue with general sum stochastic games?

A

when changing the discounted expected value of the next step to utilize the nash equilibrium of a joint action, the solver loses all leverage.

  • VI does not work b/c the fxn does not converge
  • minimax-q does not converge
  • there no unique solo to Q*
  • policies cannot be computed independently bc Nash is joint behavior.
  • update is not efficient
  • q fxns are not sufficient to specify the policy.
42
Q

What is the No Free Lunch Theorem?

A
  • an impossibility theorem telling us that a general-purpose universal optimization strategy is impossible.
  • the only way one strategy can outperform another is if it is specialized to the structure of the specific problem under consideration
  • ‘if anything is possible, then nothing can be expected’
  • any two optimizations algorithms are equivalent when their performance is averaged across all possible problems
43
Q

What does the NFLT tell us?

A

if we cannot make any prior assumptions about the optimization problem we are trying to solve, no strategy can be expected to perform better than any other.

44
Q

For the matrix interp of NFLT, what are the rows and columns?

A
  • the rows of the problem matrix (P) are the strategies
  • the columns are the universe of all possible problems.
  • the entries are the performances of those strategies on the problems.
45
Q

T/F: for the NFLT, on average all search algorithms perform no better than random search.

A

True. If we don’t have any prior assumptions about the function we are searching through, then we can expect to do no better than on average

46
Q

what is an MDP and what is needed to define it?

A

Markov decision process.

State, model, actions, rewards, policy.

47
Q

T/F: MDP’s care about data from past, present, and future

A

False. For MDP’s, only the present mattes. The rules are stionary.

48
Q

how can you add history to your MDP?

A

in the model, the current state can remember everything with more information form the past. Add history by adding more information to your state.

49
Q

T/F: Reward and utility are the same thing.

A

False!
Reward is immediate
Utility includes reward, but also the discounted expected value of future state stemming from the current. utility is all about delayed reward (temporal credit assignment).

50
Q

what is the utility of an infinite horizon problem?

A

R_Max/(1 - gamma)

51
Q

describe process of VI

A
  1. starting from goal state, calc value of the neighbor states w/ref to the discount factor.
  2. for subsequent iterations, VI considers states further and further from the goal.
  3. VI converges when:
    - VI converges on fixed values for each state
    - policy is found by taking argmax wrt possible actions for each state.
52
Q

T/F: Value iteration is a linear calculation

A

False. VI uses an argmax when finding the discounted expectation of utility given the original estimate.

53
Q

describe policy iteration

A
  1. start with an initial guess at a policy (Random)
  2. evaluate the given policy for all states described by the policy.
  3. after evaluation, PI explores different action that could improve the value assigned to each state.
    - if value improves w/new action, update the policy.
  4. repeat 2 and 3 until the policy does not change.

this is a linear calculation with n equations in n unknowns.

54
Q

what ares differences between PI and VI-

A
  • VI is non-linear while PI is linear.
  • PI is linear because rather than finding the max over all actions, we already know the action as determined by the current policy.
  • PI can take fewer iterations to converge, but each iteration is more computationally expensive when compared to VI.
55
Q

what makes Q-learning model free?

A

Q-learning is not provided any domain knowledge in terms of the transition probabilities and rewards.

56
Q

What is the Q-value

A

the value for arriving in state S, leaving via action a, and adding the discounted expected value for taking a to s’.

once you are in s’, take the highest Q-value and proceed optimally thereafter.

57
Q

what is the constraints on the learning rate for q-learning?

A

LR must go to infinity in limit.

LR^2 must be finite in limit.

58
Q

Describe the Q-learning algorithm

A
  • assign random q-values or zero q-values to environment.
  • run Q-update by taking current state, action pair and update a small amount (alpha) in direction of immediate reward and add the discounted estimated value of the next state.
  • choose action for Q by taking a simulated annealing approach. Take random action or best action depending on parameter epsilon.
  • repeat until convergence.
59
Q

what is needed for Q-learning convergence.

A

assume that system is a deterministic MDP
reward is bounded
agent select actions such that it will visit each state action pair an infinitely often.

60
Q

what is the exploitation exploration dilemma?

A

describes a fundamental tradeoff in reinforcement learning.

  • how can you balance learning about your environment and using (getting rewards from) your environment
  • these are conflicting objectives that are limited by computation constraints, time, data acquisition, etc.
61
Q

what randomized optimization algorithm is utilized for q-learning? how?

A

simulated annealing like approach for action selection.
take a random action, sometimes a mixture of choosing randomly and using Q-hat.
- choose optimal action w/ prob 1 - e
- chose random action w/prob e

as long as e > 0 and small and MDP is fully connected, you can visit each an infinite amount of times.

62
Q

what is a fully connected MDP

A

each action can lead to a state. all states can be reached by an action from other states.

63
Q

what is the filtering algorithm for feature selection?

A

take set of features and uses an optimization algo to reduce the number of features and provide to the learner.

64
Q

what are pros and cons of filtering

A

pro:
- FAST
con:
- no learner feedback
- review features in isolation, no feature pairings

65
Q

describe wrapping

A

takes features and feeds subsets of them to a learner
learner reports how well the subset performed
update the subset of features
repeat until added features do not improve

66
Q

what are pro and cons of wrapping?

A

pros:

  • learner gets good features through communicating with the search algo.
  • this technique takes advantage of the learner’s bias.

cons:
- SLOW

67
Q

what are examples of domain knowledge for filtering?

A
Info gain, 
variance, 
entropy, 
"useful features (useful as deemed by the learner)
independent/non-redundancy
68
Q

what are the types of relevance and how are they defined?

A

strongly relevant
- x_i is strongly relevant if removing it degrades the bayes optimal classifier

weakly relevant

  • x_i is not strongly relevant
  • x_i can pair with other elements to improve the BOC.
  • Some subset of features S where the addition of x_i improves the performance of the BOC

irrelevant
- x_i provides no info to the BOC

69
Q

what is the bayes optimal classifier?

A

BOC is the best you can do on average. This does not correspond to any particular classifier, but rather the best one for this use case.
BOC takes the weighted average of all hypotheses, based on the probability of the correct hypothesis given the data.

70
Q

T/F: usefulness and relevance can be used to describe the same thing.

A

False
Relevance describes the information gained by a feature when used in a learner.
Usefulness describes the error decreased by a feature when used in a learner.

71
Q

what is usefulness in the case of feature selection?

A

Usefulness measures the effect on a particular predictor/classifier
It can be measured as the error given a particular model/learner.
As usefulness increases, this decreases the error for your learner

72
Q

what is feature transformation

A

the problem of pre-processing a set of features to create a new features set, while retaining as much info as possible.

73
Q

T/F: Feature selection and feature transformation both output a new subset of the original features to be used by a learner

A

False
Feature selection does return a subset of the features
Feature transformation creates a new single feature that is a linear combination of the original subset

74
Q

what is polseny? what is synonomy?

A

In information retrieval:
polseny are words with multiple meanings -> gives false positives
synonomy are many words that mean the same thing -> give false negatives

75
Q

What does PCA do?

A

find the direction of maximal variance of the data. PCA also finds the directions that are mutually orthogonal. This can be considered as maximizing the uniqueness of the vectors so they aren’t close together. Each vector tells a different unique story about the data.

PCA transforms the data to a new space where feature selection can work.

76
Q

What are some properties of PCA?

A
  • Provides the best reconstruction error -> minimizes L2 error when moving to a smaller dimensional basis.
  • maximizes the variance
  • orthogonal: global algorithm that finds the perpendicular (in 2D case) components
  • eigenproblem: each PCA has an eigenvalue that monotonically non-increases as PCA # goes up
  • well studied so fast implementations exist.
  • produces ordered features: based on eigenvalues
  • finds global features
77
Q

what does an eigenvalue of 0 mean for a PCA component

A

means there is zero entropy and the feature is irrelevant

78
Q

what feature selection algorithm is PCA similar to?

A

filtering because you remove components with small eigenvalues

79
Q

what does ICA attempt to do?

A

ICA tries to maximize independence through a linear transformation. Given hidden variables X and observables (Y), ICA wants to find mutually independent features, and maximize mutual information between the original data and newly found features.

the observables use the mutual information to reconstruct mixed data (aka transformed projection) to get independence.

ICA tries to get feature mutual independence without losing any information from the output (observables).

Think about the cocktail party problem.

80
Q

T/F: ICA is a linear transformation

A

True

81
Q

What is the cocktail party problem?

A

talkers (hidden variables) and microphones (observables)
microphones pick up a little bit of every talker (linear combination)
how can we extract individual voices from the combined data.

we desire information between the mic’s and the voices to be as high as possible (max mutual info) w/each feature being independent

82
Q

what is the difference between hidden variables and observables in the the BSS?

A

hidden variables: causing events to happen
observables: recording the events. Each observable contains a linear combination of all the hidden variables. Given the observables, can reconstruct the hidden variables.

83
Q

Describe some properties of ICA

A
  • mutual independence between the newly transformed features
  • maximizes mutual information among the newly transformed features and the original data/features
  • find local features
  • directional
84
Q

T/F: there are cases where PCA and ICA act the same way for a set of features

A

True
If the distributions are all gaussian, PCA and ICA can produce the same features because PCA will be able to find uncorrelated features.

85
Q

what is the central limit theorem

A

the addition of many linear combinations of mutual independent variables provides a gaussian in the limit.

86
Q

when run over a face, what would PCA find? ICA?

A

PCA:
brightness (usually thrown out bc it is the avg), average face -> global features

ICA:
noses, eye selectors, hair selectors -> Local features

87
Q

what do PCA, ICA, RCA, and LDA essentially do?

A

they all find and give insight to the fundamental features of your data

88
Q

what is the main benefit to RCA?

A

It fast boi

89
Q

what is LDA?

A

finds projections/features that discriminates based on the label.
in the binary case, LDA is similar to the sum.
LDA uses the values of the projections to re-represent the data.

90
Q

T/F: PCA performs well on non-gaussian distributions

A

False:
PCA desires gaussian distributions
ICA works well on non-gaussian data.

91
Q

What are the assumptions of PCA?

A
  1. assumes all bases vectors are orthonormal.

2. the directions with the largest variances are the most “important” or most principal.

92
Q

What properties does SLC have when n/2 clusters are reached?

A

Scale invariance, consistency.

Not rich b/c there is a fixed number of clusters

93
Q

What properties does SLC have when clusters formed are theta units apart?

A

richness, consistency

no scale invariance due to the fixed distance separation

94
Q

what properties does SLC have when the clusters are theta/w units where w = max intercluster distance.

A

Richness, scale invariance

Not consistent bc as w increases there can be a point where theta/w is so small nothing gets clustered.

95
Q

how can the E and M steps of EM be interpreted?

A

E-step can be interpreted as constructing a lower bound to the posterior distribution. This is a “soft” assignment that, once computed, assigns a posterior probability to each possible association of each individual sample

M-step optimizes this bound, thereby improving the estimate for the unknowns. The lower bound is maximized, and the corresponding new estimate is guaranteed to lie closer to the location of the nearest local maximum of the likelihood.

96
Q

Sox time complexity

A

O(n^3)