Exam 1 Flashcards

1
Q

Where is the most effort put in data mining?

A

Data preparation and cleaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the data-related steps in the CRISP-DM guide?

A

Select/find data, Clean the Data, Prepare the data, Integrate the data, and Format the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is data represented?

A

Numeric: Continuous attributes. Measurements, int, float data types

Nominal - values are symbolic - labels: sunny, old, yellow: can perform equality checks. Categorical coding; may code “1” but has no arithmetic meaning.

Ratio - The measurement scheme defines a zero point a distance, temperature differential but not the temperature itself; math operations are valid.

Ordinal - rank order: “cold, cool, warm, hot” or “good, better, best”. No distance between them, cna beform equality checks.

Interval - ordered and measured in fixed units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When is numeric data easy to interpret?

A

When defined ranges exist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you measure something good, bad, healthy?

A

Need domain expert.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some cautions on data cleaning?

A

Document what you do, work carefully, don’t make assumptions, be aware of bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some was to introduce bias?

A

Language - different terms or grammars to describe the domain, data attributes, or the problem.

Search - look at other search options

Overfitting - results provide a solution based on bad assumptions/patters, or search stops too soon.

Actions already perform on the data.

How the data was gathered (how questions were asked, how responses were interpreted, who asked the questions, how samples were selected).

Synonym for “error”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some examples of data cleaning?

A

Handling invalid values, duplicates, missing data, data entry errors, converting data to specific values in order to perform correct measurements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is meant by dirty data?

A

Data that is incorrect, inaccurate, irrelevant or incomplete.

Data needing to be converted (nominal to numeric)

Data with different formats or coding schemes (such as dates)

Data from >1 file with different field delimiters

Data that is coded

Data that must be summarized “rolled up”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does data get “dirty”?

A

Inconsistent definitions, meanings (especially when combining different sources)

Data entry mistakes

Collection errors

Corrupted data transmissions

Conversion errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some data issues?

A

Out of range entries

Unknown, unrecorded or irrelevant data

Missing values

Language translation issues

Unavailable readings

Inapplicable data (asking a male if pregnant)

Customer provided incorrect data

Duplicate data

Stale data

Unavailable data

Data may be available but not in electronic form.

Data associated with the wrong person

User provided wrong data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Consider representing dates as YYYYMM or YYYYMMDD. What’s good about this formatting? What is the limitation?

A

Good: You can sort the data.
Limitation: Does not perserve intervals. (ie. 20040201 - 20040131

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some legacy issues when it comes to dates?

A

Y2K - 2 digit year. - Year 02 is it 1902 or 2002? Depends on context (child birthday or a year a house was built). Typical approach is to set a cutoff year. If YY < cutoff, then 20YY else 19YY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some reasons values may be missing?

A
  • They are unknown, unrecorded, irrelevant data
  • Malfunctioning equipment
  • Changes in the design
  • Collation/merge of different datasets.
  • Unavailable data
  • Removals because of security or privacy issues.
  • Translation issues (especially languages)
  • That adata being used for a different purposes than originally planned (ethical/ legal issues)
  • self-reporting - people may omit if the input mechanism does not require an input.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How should one deal with missing values?

A
  • Ignore the attribute or entire instances. (May throw out the needle in the haystack!)
  • Try to estimate or predict: use mean, mode, or median values. Relatively easy and not bad on average.
  • Treat as seperate value
  • Look for , 0, “.”, 999, N/A Decide on a standard and create a new value.
  • Does missing imply a default value?
  • Compute the value based on previous values.
  • If inserting zeros for missing values, think about what it has done to the mean and standard deviation.
  • Be careful when using tools (some have default operations to handle missing data)
  • Randomly select values from current distribution (pro: won’t change overall shape of the curve - little impact on the mean).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Again, what are some sources of inaccurate data? :)

A
  • Data entry mistakes
  • Measurement errors
  • Outliers previously removed
  • Duplicates
  • stale data
  • Different values. New York, NY, N.Y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can you find inaccurate data?

A

Look for the obvious (run statistical tools), Look for nonsensical data (negative grade or age).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is discretization?

A
  • Binning
  • Useful for generating summary data
  • produces discrete values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is one issue that can come from binning with equal-width?

A

It could result in clumping. For example, if 99% of employees earn 0-200,000 and the owner makes 2,000,000. With a width of 200,000 only one person in the upper bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can we even out the distribution?

A

By binning with equal-height. Instead of defining bin sizes of range N, assign N values to each bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When binning…

A
  • Do not split repeated values across bins

- Create a separate bin for special values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Talk about the considerations with equal-width and equal-height binning.

A

Equal-height is usually preferred because it avoids clumping. Equal-width is simplest and good in many situations. However, equal-height usually gives better results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Is by bins okay?

A

AFter you create bins, create a histogram of values and look at the general shape of it. Jagged shape may indicate a weakness in the way the bins were formed so try different number of bins and different boundaries (shit ranges).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why use rollup?

A

Can help reduce the complexity of your model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is an outlier in data mining?

A

Any value that doesn’t really look like most of the others in the data set.

May just be a unique data point, or it could really be an outlier (an noise, error).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do you handle outliers?

A
  • Do nothing
  • Enforce upper and lower bounds
  • Let binning handle the problem.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Where can outlier exist?

A

In a normal distribution, they could be >3 standard deviations from your mean. With a bimodal distribution, they could be at the middle or at the ends.

Some are easy to find such as negative age or an age > 120, negative number of children, gender that’s not M/F.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How can you tell an outlier fron an error?

A

Usually you can’t. Don’t discard outliers unless you are sure they are really outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the nest approach to dealing with outliers?

A
  • Work with your domain expert.
  • Try to help identify why the data values are extreme.
  • Do remove the outliers if you think they will negatively impact your analysis.
  • Check the source and quality of the raw data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the scare with data warehouses?

A

If they clean the data and remove the outliers, the needle in the haystack you were looking for may have been removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why use transformations?

A

Such as house prices with a skewed tail due to extremely high house prices. May need to transform the data to make neater bins or to scale for visualizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is one approach to transform data?

A
Apply log10 function to numeric values. 
Log(10)1 = 0
log(10)10 = 1
log(10)100 = 2
log(10)1000 = 3

To care for log0 being undefined, we can bump all values by 1 and take the absolute value to handle negative values. This scales the data making it easier to visualize and handle.

In general form: Log(10)|X + 1|. Add sin to regain the sign on negative side: SinXLog(10)|X+ 1|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is regression modeling?

A

Try to fit the data to a line, calculate the error, sometimes easy to identify extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the Curse of Dimensionality?

A

As the dimensionality increases (the number of attributes), the data becomes increasingly sparse in the space it occupies. Some clustering algorithms become less accurate.

35
Q

Why prune data?

A
  • Lots of attributes are hard to work with
  • Many attributes many not even matter
  • Many algorithms perform better if we reduce number of attributes being considered.
36
Q

How do we prune data?

A

Remove fields with little or no variability and select the N most important.

37
Q

What are some pruning techniques?

A

Look for redundant data.

Look for irrelevant features.

38
Q

What are some data cleaning problems?

A
  • Error correction and missing information is a challenging problem.
  • Incorrect or sloppy data cleaning can result in removing a large amount of data or incorrect data.
  • There is not enough information or application knowledge to know what to do: consult domain expert
  • Data cleaning is an iterative, time consuming.
  • Arguably most time consuming, painful, and expensive part of data mining.
39
Q

Why do we need a data management policy?

A
  • After cleaning and removing errors, you don’t want to repeat it if the collection changes or when new values are added.
  • When you clean the data, you need to document what you did to the data.
  • Need a way to audit and measure the resulting data.

Backups, deletes, Changes, Access, Last Updated.

40
Q

How can we check for data integrity?

A
  • Data quality: “noise” - modification of original values, outliers, missing values, duplicates.

Cost- Continuing pressure for larger and faster systems. Driven by the quantity of data and the need for esults in a limited amount of time.

41
Q

What is a quality issues?

A

Some classes have a very unequal frequency. For example, 99.99% of Americans are not terrorists and saying that you are correct 97% of the time doesn’t say much.

42
Q

What is metadata?

A

It encodes background knowledge. Can be used to restrict search space, may need to expand metadata attributes into multiple attributes.

43
Q

What are some questions to ask?

A

What data is available? Where will it come from? Is the data current/relevant? Is other data available? Is historic data available? Who is the domain expert? Which attributes are really important?

44
Q

How do you know if attributes are irrelevant?

A
  • They have no influence on the outcome (e.g customer ID).
  • May slow things down.
  • May be illegal to use.
  • Now always easy (for example increased sales during phases of the Moon).
45
Q

Why do we not use the whole data set?

A
  • Run the risk of overfitting the data.

- WHen model is built to fit training set, it loses the ability to generalize to new data.

46
Q

How is data split between the training set and the test set?

A

Theory: 70% of data for training, 30% for test set.

In practice: 50/50

47
Q

How do you build testing, training, and validation data sets?

A
  • Random without replacement.
  • Systematic: Every nth instance.
    Don’t select the first N - order may influence the data.
48
Q

What is the stratified random sample?

A

Select equal number of instances from each attribute value. Why? If looking for a small #, you are training on biased data.

49
Q

Important considerations for sampling?

A

Think counts, not percentages. Need enough data to saftely generalize. Otherwise, no general rules.

50
Q

What is the CRISP cycle?

A

Business Understanding Data Understanding -> Data preparation Modeling -> Evaluation -> business understanding OR deployment

51
Q

What are some reasons to cluster data?

A

Exploratory data analysis. In marketing discover customer groupings, in astronomy discover groups of similar objects, in earthquakes discover epicenters clusters along faults, in gene research group genes.

Identify patterns, document classification, targeting marketing, insurance, taxing.

52
Q

What is clustering?

A

Form of unsupervised learning as there are no predefined classes. The goal is to find ‘natural’ grouping of instances. Objects in a cluster are similar to those in the same cluster but dissimilar to objects in a different cluster.

53
Q

What are some methods to determining similarity?

A
  • visual inspection
  • mathematical measurements such as euclidian distance and manhattan distance.
    May need to weight more important attributes.
54
Q

What is a centroid?

A

Using euclidean distance or something similar, it is the average of the values corresponding attribute for all points in the cluster.

55
Q

What is K-Means clustering?

A

An exclusive clustering algorithm where each object is assigned to precisely one set of clusters.

56
Q

How can we measure the quality of a set of clusters?

A

By using an objective function and making it as small as possible.

57
Q

What is an objective function

A

An objective function takes the sum of the squares of the distances of each point from the centroid of the cluster which it is assigned.

58
Q

What is the major assumption of K-Means?

A

That the distance variance is an appropriate way to cluster.
Other concenrs: Clusters may be disjoint or overlap. Exclusive clustering - all instances must be clustered.

59
Q

What is the requirement to perform K-Means clustering?

A

Need a distance measurement.

60
Q

What is the K-Means algorithm?

A
  1. Choose a value K (number of clusters)
  2. Select k objects in an arbitrary fashion and use these as the initial set of k.
  3. Assign each object to the nearest centroid.
  4. Recalculate the centroids of the k clusters.
  5. Repeat step 3 and 4 until the centroids no longer move (convergence).
61
Q

What is convergence?

A
  • Data points no longer move into different clusters or the change in cluster assignments is less than some threshold.
62
Q

What are some of the k-means questions to ask?

A
  • Does it make sense?
  • Initial assignment has to be numeric and sensible
  • Does it end?
63
Q

What are some ways to determine the initial centroids?

A
  • Randomly select the initial K centroids from the data instances
  • Pick centroids that are the furthest distance
  • Make a quick pass of the data and use some basic stats to select the initial centroids.
64
Q

Instead of assigning initial centroids, what method can be used?

A

Initially assign the values then calculate the centroids.

65
Q

What are some ways to initially assign the values then calculate the centroids?

A
  • Randomly assign N/K instances to each cluster.
  • Assign sequentially N/K instances to cluster 1, then next N/K instances to cluster 2, ..
  • Assign instances to cluster 1.. K using a round robin
    .. then calculate the centroids.
66
Q

What is the pillar algorithm for K-Means?

A

It address initial centroid selection. The method esentially visualizing pillars which are placed the furthest apart from each other.

67
Q

How do you find the pillars?

A

Calculate the mean of all points
Calculate the distance from each point to the mean.
The point with the maximum distance is the 1st centroid.
Calculate the distance from each point to the 1st centroid.
Find the point the furthest distance from 1st centroid and this becomes the 2nd centroid.
Find the distance from all points to the 2nd centroid. Add this distance to the accumulated distance metric. 3rd centroid is the point with the largest distance metric.
Continue until all K centroids have been designated.

68
Q

What are some drawbacks of the pillars algorithm?

A
  • Must be able to identify outliers

- Time complexity is on the order of O(N+K) but increases significantly when you add in outlier identification.

69
Q

Why should you consider using median instead of mean for k-means?

A

K-means is sensitive to outliers and it’s not as affected by extreme values.

70
Q

What are some issues and problems with k-means?

A
  • Algorithm has a difficult time handling noisy data and outliers.
  • Since it uses the means, think about what happens to a mean when the data contains an outlier.
71
Q

Summarize K-Means

A
  • It always terminates but may not find the best clustering
  • Much depends on the initial selection of centroids
  • Difficult to visualize high dimension data
  • Outliers are a problem
  • Can be sensitive to data order.
72
Q

What is incremental clustering and why do we need it?

A

When we need to handle very large amounts of data.

  • Partition the data into clustered subsets
  • Incremental clustering
  • parallel implementation of an algorithm.
73
Q

What is an incremental clustering method?

A

Instance based clustering

74
Q

How does instance-based clustering work?

A

With instance-based incremental clustering, we build the clusters as instances are read in. After adding an instance, look to see if its in the best place; if not, restructure the clusters to make a better fit.

75
Q

What are some limitations of instance-based clustering?

A

May not build optimal clusters
Ordering of data will impact the clusters
Restructuring algorithm is often not enough to reverse the impact of a bad initial cluster assignment.

76
Q

What is the faster nearest neighbor calculation?

A

Represent the data in a tree such as a kD-tree. Easy to add new instances. More difficult to understand. Skewed dataset can result in imbalanced trees.

77
Q

What is hierarchical clustering?

A

Agglomerative (bottom-up) hierarchical clustering. Start each instance in its own cluster than repeatedly merge the closet pairs. Stop the merging with all instances in a single clusters. Produces a single hierarchical cluster. For n instances, iterate N-1 times.

78
Q

What is a dendrogram?

A
  • Produces a binary tree.
79
Q

Why use density-based clustering (DB scan)?

A

Clusters not clean and neat, may not have specific shapes.

80
Q

What is density based clustering?

A

Points are labeled as core, border (edge) or noise. Noise points are ignored initially. Edges are drawn between core points within a certain range (density reachable). Connected core points are assigned to a cluster. Assigns clusters based on the density of the instances.

81
Q

What is fuzzy clusters?

A

Sometimes instances could belong to more than one clusters. Assign objects a probability that it belongs in the cluster.

82
Q

What is EM?

A

Expectation Maximization. It is a probability-based clustering algorithms. Classifies using mean and standard deviation of values. Uses values if attribute is continuous or categorical if it is nominal attributes.

83
Q

What is cobweb?

A

WUses probablistic clustering, representing the probability of a value being present. Handles categorical attributes and continous attributes. Incremental clustering so sensitive to data order. BUilds a tree representation of the clusters (hierarchical).

84
Q

When to terminate the clustering process?

A

End the merger process when we have converted the original N objects to a small enough set of clusters.

  • Merge clusters only until some predefined number remain
  • Stop when newly created cluster fails to meet some criterion for its compactness such as average distance between objects is too high.