1. Data Flashcards

1
Q

What is data mining?

A

The extraction of useful knowledge from noisy data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 steps of successful data mining?

A

Acquisition, Marshaling, Analysis, Action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does data mining specifically focus on?

A

Turning sets of data into useful knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a relational database?

A

A database which is structured already

Ex: rows and columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of data pre-processing?

A

Preprocessing cleans up the data and makes it easier to analyze.

Think: cleaning, normalization, transformation, feature extraction, and selection.

It handles issues like missing values, outliers, and conflicts (like incorrect information).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are association rules?

A

A set of rules that characterize associations between items.

Think: interesting relationships between variables in large databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What kinds of things can we do with customer purchasing data?

A

We can analyze their spending habits to determine when and where consumers are most likely to purchase particular products. Knowing that, we can market those products to more effectively get consumers to buy them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What kind of data are we interested in (for this course)?

A

Primarily relational data. Most of our data will come in preprocessed lists that we will have to mine for relationships/associations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is clustering?

A

The process of partitioning a set of data into meaningful groupings so that each cluster differs from the next in some respect.

This is to help the user understand the natural structure of the data and gives insight into the data distribution. Can also be used as a preprocessing step for other algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does it mean for data to be discrete?

A

Finite or countably infinite values.

Examples: zip codes, age, eye color, number of whole numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does it mean for data to be continuous?

A

Data which is not restricted to defined, separate values. Their values can occupy a continuous range (infinitely specific).

Ex: temperature, real numbers, weight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the four types of data attributes?

A

Nominal data, ordinal data, interval data, and ratio data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are nominal data?

A

When data is labeled without any quantitative value.

Ex: male/female; hair color; north/south/east/west

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are ordinal data?

A

Data where the order is important (but the difference between the data is not necessarily known.

Ex: 1st place beat 2nd place, but we don’t know by how much.

Ex: “how do you feel today?”
- very unhappy, unhappy, ok, happy, very happy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are interval data?

A

Data where we know the order AND the difference between the values.

Note: interval data does NOT have a “true zero” (required to calculate ratios).

In other words, 10 deg + 10 deg is 20 deg, but 20 deg is not twice as hot as 10 deg, because there is no true zero on the Celsius scale.

Only + and - operations can be done on interval data.

Interval = space in between

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are ratio data?

A

Data that is ordered (ordinal), tells us the value between units (interval), AND has an absolute zero point, which allows for multiplication and division operations (and all the statistical power that comes with them).

Ex: height; weight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What operations can we perform on nominal data?

A

Counting and mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What operations can we perform on ordinal data?

A

Order, counting, mode, median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What operations can we perform on interval data?

A

Order, counting, mode, median, mean, quantify difference between each value, add/subtract values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What operations can we perform on ratio data?

A

Order, counting, mode, median, mean, quantify difference between each value, add/subtract values, multiply/divide values, find absolute zero

21
Q

What is the data quality issue and how do we address it?

A

Data is “dirty” in the real world. It’s often times incomplete, full of errors, outliers, and useless info, or otherwise inconsistent.

To address these issues, we do data preprocessing.

22
Q

What does it mean to sample data and why do we do it?

A

Data mine a subset of the data for analysis. As long as the sample is large enough and diverse enough to be representative of the original data set, we can learn the same knowledge without having to consider every single data point in the set.

23
Q

What are the four types of sampling?

A

Simple random sampling (equal probability of selecting a particular item)

Sampling with replacement

Sampling without replacement

Stratified sampling

24
Q

What is the process of attribute selection?

A

The process of discerning relevant attributes from irrelevant attributes.

We remove/ignore redundant attributes

25
Q

What is the process of dimensionality reduction?

A

Automatically detecting a relationship between multiple attributes and compressing them into fewer attributes

26
Q

How is attribute selection different than dimensionality reduction?

A

Selection involves making choices about which attributes to keep and which to discard. Reduction involves combining attributes so that we deal with fewer of them, but that the information in each original attribute is still present (in a compressed form)

27
Q

What is the process of descretization?

A

The conversion of numerical (continuous) data into categorical (discrete) data.

Continuous data is binned to become discrete data

28
Q

Is ordinal data a subset of interval data?

A

Yes, because the only thing ordinal data is lacking in order to be interval data is information on the difference between the units the data is expressed in.

With that info comes the ability to calculate mean and do operations like + and -

29
Q

What is an attribute (dimension)?

A

A characteristic or descriptor of data

30
Q

What are some examples of dimensions (attributes) of a “human” object?

A

Height, weight, age, eye color

31
Q

What is binning?

A

The same thing as discretization. The process of turning continuous data into discrete categories (bins)

32
Q

What is frequency in data mining?

A

How frequently a discrete datum (or bin) appears in a data set

33
Q

What is the difference between data frequency and mode?

A

Frequency measures how many times a particular datum appears in a data set, where mode simply identifies the datum with the greatest frequency in that set

34
Q

What is variance in data mining?

A

The average of the squares of the deviations of the data values from the mean.

35
Q

What is the formula for variance?

A

S^2 = (x1-mean)^2 +…+(xn-mean)^2 / (n-1)

36
Q

How do you calculate the standard deviation of a data set?

A

Take the square root of the variance

37
Q

What is covariance in data mining?

A

A measure of how changes to one dimension are associated with changes in a second dimension.

Covariance measures the degree to which two variables are linearly associated.

38
Q

What is visualization in data mining?

A

Converting data into visual format (because human brains are pretty good at pattern recognition)

Examples:

  • Histogram
  • Two-dimensional histogram
  • Box plots
  • Scatter plots
  • Correlation matrix
39
Q

Why do we do dimensionality reduction?

A

It aids in visualization, reduces data noise, makes it easier to do analysis, and still represents the data well with only minimal loss of information

40
Q

What are factors in data mining?

A

They are combinations of observed dimensions (post-reduction)

Observed data are then described in terms of these factors instead of the original dimensions.

41
Q

What does PCA stand for?

A

Principle Component Analysis

42
Q

What is the purpose of PCA?

A

It reduces the high dimensionality of big data sets to fewer dimensions that are easier for humans to comprehend and visualize.

The variation (signal) in a data set can be seen as representing the information that we would like to keep.

PCA reduces the dimensionality of data by creating new, artificial variables called principal components (linear combinations for the original variables) while still keeping as much variation as possible.

43
Q

What does it mean for PCA to be an unsupervised method?

A

It means that no information about dimensions is used in the dimension reduction.

PCA shows a visual representation of the dominant patterns in a data set

44
Q

What must be true to compress dimensions into principal components?

A

The two dimensions must be highly correlated or dependent so that they essentially tell us about the same underlying variance in the data. This way when they are compressed, little info on the original data is lost.

45
Q

What is the difference between how we handle related dimensions and unrelated dimensions?

A

Combine related dimensions and focus on uncorrelated/independent dimensions (especially those along which the data have high variance)

46
Q

What are the ideal conditions for principal component analysis?

A

We want a smaller set of dimensions that explain most of the variance in the original data, in more compact and insightful form

47
Q

What does it mean for two dimensions to be orthogonal?

A

90 degrees, or uncorrelated. Changes to one will not affect the other

48
Q

What are the four steps of PCA?

A
  1. Let Xbar be the mean vector
  2. Adjust the original data by the mean using x’=x-Xbar
  3. Compute the covariance matrix S of adjusted x’. S=1/nXX^T
  4. Find the eigenvectors and eigenvalues of S. Sa=lamda*a
49
Q

What do eigenvalues (lambda) correspond to?

A

The variance on a component