Midterm Flashcards

1
Q

What fields are apart of data mining?

A

Machine learning Databases Visualization Application Domain Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the data mining process

A

Collect data Prepare data Build a model Evaluate model Deploy model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two types of data mining techniques?

A

Predictive (supervised) predict (discrete or continuous) class attribute based on other attribute values. This is like learning from a teacher. Descriptive (unsupervised) Discover structure of data without prior knowledge of class labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is big data?

A

Data that is to large to be analyzed with today’s resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the algorithm for k-means (clustering)

A
  1. Randomly pick k cluster centers 2. Assign every object to its nearest cluster center 3. Move each cluster center 4. Repeat steps 2,3, until stopping criterion is satisfied
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the different types of attributes and the properties associated to each.

A

Nominal - distinctness Ordinal - distinctness, order Interval - distinctness, order, +, - Ration - distinctness, order, +, -, *, /

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 3 types of data sets?

A

Record - Data matrix - Document data - Transaction data Graph - objects with relationships to other objects - objects that have sub-objects Ordered - Spatial data - Temporal data - Sequence - Sequential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the different sampling technique

A

Simple random sampling - each object is selected with equal probability - (with replacement, without replacement) Stratified sampling - split data into several partitions, and take random samples from each partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is under sampling?

A
  1. include all of the minority classes 2. sample randomly from the majority class with or without replacement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is over sampling?

A

It’s the opposite of under sampling 1. include all samples from the majority class 2. sample the minority class with replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two ways of mapping categorical values to numerical data (transformation for preprocessing)?

A

1 of n - create 1 new attribute for each possible value of the categorical attribute m of n - create new attributes such that each categorical value is mapped to a unique representation using the new attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 3 ways values can be missing?

A
  1. Missing Completely at Random (MCAR) eg. randomly crossing out things in your data. This is the only time it is acceptable to substitute average of known values, but this reduces the variance of the set. 2. Missing at Random - the reason why it is missing is related to value of another attribute (eg. bias facebook page) 3. Non Ignorable Data - the reason why the value is missing is related to the value itself (eg. limitation of sensor instrument below a certain threshold)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is normalization and what is it used for?

A

-equalize the weights associated with attribute values - gives each attribute the same importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 3 types of normalization?

A
  • min-max normalization - z-score normalization - decimal scaling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the significance of sample size?

A

With sample size, there is a trade off between accuracy and speed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe feature subset selection.

A
  1. Remove redundant features - duplicate info contained in other attributes 2. remove irrelevant features - not relevant to the data mining task at hand
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe feature creation

A

mapping data to a new space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the different types of attribute transformations?

A

categorical to numeric (1 of n and n of m) numeric to categorical (binning and discretization) -PCA -Normalization and Standardization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Describe Hunt’s algorithm

A

if the current node contains records that all belong to the same class, then the current node is a leaf node with that class. Otherwise split the data into smaller subsets. Repeat recursively.

20
Q

What is the difference between over fitting and under fitting?

A

overfitting: remember the data so much that the model can’t generalize underfitting: the model is not built out enough to fit anything

21
Q

What is bias?

A

a systematic shift in “ground truth (information that we are trying to discover)” eg. only surveying students in cois 4400 to find out if students at trent like classes (only 4th year students can take the course etc…)

22
Q

How can bias be avoided with training data?

A
  1. always randomize data (shift the orders of the rows) 2. split the data into 2 chunks (test & training)
23
Q

What are the 2 ways to split data into a test set and training set for decision trees?

A
  1. Out of bag estimation
    a) sample repeatedly from the data set with replacement and add selected samples to the training set
    b) repeat until the training set has as many samples as the original data set (some of them will be repeats)
    c) use the values that weren’t picked for the test set
  2. K fold Cross Validation a) split the data into k chunks b) for each chunk, use as a test set and the others for a training set c) at the end, sum up the errors
24
Q

When building decision trees, the training and test set have to be i.i.d. Explain what this is.

A

i.i.d: independent and identically distributed. have same distributions for training and test set as the original data set. we also need to ensure that the probability of chosing one, does not affect another value

25
Q

What is the difference between eager learners and lazy learners?

A

Lazy: the deployment is slow & often not a good abstraction of the data ( instance-based learning) basically just remembers the data, or little processing - k nearest neighbour is an example Eager: builds a model early & deployment is fast - builds a model before getting test data

26
Q

Describe the Curse of Dimensionality

A
  1. runtime: if the algorithm does not behave at most linear with the # of attributes, the runtime increase too quickly 2. Amount of Data: The amount of samples needed to cover the space with equal density grows exponentially with the number of dimensions 3. Distances: distances between the data becomes meaningless. The maximum distance between data points does not grow linearly with the # of dimensions.
27
Q

What is a rote learner?

A

An instance based classifier

Memorizes the entire training data and performs classification only if attributes of a record match one of the training examples exactly

There’s no real learning or model building

28
Q

Describe the k-nearest neighbour classifier.

A

Uses k “closest” points (nearest neighbours) for performing classification.

29
Q

What is a voronoi diagram?

A

It splits of the solution space for nearest neighbour (it’s basically your model).

If the new data point lies on a line, then it’s equal distance to 2 points, if it’s on an intersection, then it’s equal distance to 3 points.

30
Q

Describe the steps in the nearest neighbour algoirhtm.

A

Compute the distance between two points using Euclidean distance.

Determine the class from nearest neighbour list

  • take the majority vote of class labels among the k-nearest neighbours
  • Weigh the vote according to distance

weight factor, w = 1/2 d2

31
Q

What 3 things does a nearest neighbour classifier require?

A
  1. The set of stored records
  2. Distance metric to compute distances between records
  3. The value of k, the nmber of nearest neighbours to retrieve
32
Q

How does nearest neighbour classify an unknown record?

A
  1. compute distance to other training records
  2. Identify k neareast neighbours
  3. Use class labels of nearest neighbours to determine the class label of an unknown record (eg. by taking a majority vote)
33
Q

Describe some of the issues of choosing the value of k for the k neearest neighbour algorithm.

A

If k is too small, it is sensitive to noise points

If k is too learge, neighboourhood may include points from other classes

34
Q

What is one of the problems with using Euclidean measure with K-nearest neighbour?

A

Eucliedean measure is susceptible to highly dimensional data (curse of dimensionality) which can produce counter intuition results

The solution is to normalize the vectors to unit length

35
Q

The describe the scaling issues that can be encoutnered while performing the k neareast neighbour algorithm

A

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

eg. height of a person may vbary frm 1.5 to 1.8 m

weight of a person my vary from 90lb to 300lb.

36
Q

What type of learning is K Nearest Neighbour?

A

It’s a lazy learner

It does not build models explicitly

classifying unknown records are relatively expensive

deployment is slow

37
Q

What is PEBLS?

A

Parallel Examplar-Based Learning

an algorithm that sums up the distances between all of the classes in a data set

38
Q

What are genetic algoirthms?

A

Two variations of gentic algorithms:

  1. represents DNA with random mutjations
  2. Parents that create offspring
39
Q

What are the two ways that offspring are created in genetic algoirhtms?

A
  1. Mutation (single parent): randomly change one or more parts of the sequence
  2. Crossover (2 parents): randomly pick a location in the genes and combine parts from both parents
40
Q

How could you solve the travelling salves person problem using genetic algorithms?

A

Create a matrix with the distances between cities

Represent travel as a string of cities

The solution would involve swapping two cities

* Need to come up with a good encoding to represent the solution

41
Q

What do the sequences represent in genetic algorithms?

A

The sequences represent the solutions, not the data

42
Q

How is the quality of a solution evaluated in genetic algoirthms?

A

A fitness function is applied to evaluate the goodness of each solution

43
Q

What is the algorithm involved in genetic algoirhtms?

A

Choose a representation

Choose a fitness function

While (termination criteria is not satisfied)

determine potential parents from population by applying a futness function (want high scores)

create offspring

add ofspring to population

if necessary, delete lower scoring solutions

44
Q

Describe the component of randomness in genetic algorithms

A

At startup multiple possible solutions are created randomly. In general, this is a problem. The solution to this is to re-run the algirhtm and the best result is picked.

There is also randomness as the model is being built. This is usually a good thing because it tends to more often find a global minimum.

45
Q

What is a local minimum?

A

The best value given neighbouring values

46
Q

What is a global minimum?

A

the overall best value / solution