Midterm Flashcards

Question

What is the difference between eager learners and lazy learners?

Answer 1

Lazy: the deployment is slow & often not a good abstraction of the data ( instance-based learning) basically just remembers the data, or little processing - k nearest neighbour is an example Eager: builds a model early & deployment is fast - builds a model before getting test data

Answer 2

1. runtime: if the algorithm does not behave at most linear with the # of attributes, the runtime increase too quickly 2. Amount of Data: The amount of samples needed to cover the space with equal density grows exponentially with the number of dimensions 3. Distances: distances between the data becomes meaningless. The maximum distance between data points does not grow linearly with the # of dimensions.

Answer 3

An instance based classifier Memorizes the entire training data and performs classification only if attributes of a record match one of the training examples exactly There's no real learning or model building

Answer 4

Uses k "closest" points (nearest neighbours) for performing classification.

Answer 5

It splits of the solution space for nearest neighbour (it's basically your model). If the new data point lies on a line, then it's equal distance to 2 points, if it's on an intersection, then it's equal distance to 3 points.

Answer 6

Compute the distance between two points using Euclidean distance. Determine the class from nearest neighbour list - take the majority vote of class labels among the k-nearest neighbours - Weigh the vote according to distance weight factor, w = 1/2 d²

Answer 7

1. The set of stored records 2. Distance metric to compute distances between records 3. The value of k, the nmber of nearest neighbours to retrieve

Answer 8

1. compute distance to other training records 2. Identify k neareast neighbours 3. Use class labels of nearest neighbours to determine the class label of an unknown record (eg. by taking a majority vote)

Answer 9

If k is too small, it is sensitive to noise points If k is too learge, neighboourhood may include points from other classes

Answer 10

Eucliedean measure is susceptible to highly dimensional data (curse of dimensionality) which can produce counter intuition results The solution is to normalize the vectors to unit length

Answer 11

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes eg. height of a person may vbary frm 1.5 to 1.8 m weight of a person my vary from 90lb to 300lb.

Answer 12

It's a lazy learner It does not build models explicitly classifying unknown records are relatively expensive deployment is slow

Answer 13

Parallel Examplar-Based Learning an algorithm that sums up the distances between all of the classes in a data set

Answer 14

Two variations of gentic algorithms: 1. represents DNA with random mutjations 2. Parents that create offspring

Answer 15

1. Mutation (single parent): randomly change one or more parts of the sequence 2. Crossover (2 parents): randomly pick a location in the genes and combine parts from both parents

Answer 16

Create a matrix with the distances between cities Represent travel as a string of cities The solution would involve swapping two cities \* Need to come up with a good encoding to represent the solution

Answer 17

The sequences represent the solutions, not the data

Answer 18

A fitness function is applied to evaluate the goodness of each solution

Answer 19

# Choose a representation Choose a fitness function While (termination criteria is not satisfied) determine potential parents from population by applying a futness function (want high scores) create offspring add ofspring to population if necessary, delete lower scoring solutions

Answer 20

At startup multiple possible solutions are created randomly. In general, this is a problem. The solution to this is to re-run the algirhtm and the best result is picked. There is also randomness as the model is being built. This is usually a good thing because it tends to more often find a global minimum.

Answer 21

The best value given neighbouring values

Answer 22

the overall best value / solution

Midterm Flashcards

(46 cards)