Chapter 2 - Data Flashcards

1
Q

Why is the type of data important to data mining?

A

The type of data determines which tools and techniques can be used to analyze it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is data quality important?

A

Improving data quality typically improves the quality of the resulting analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a data set?

A

A collection of data objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a data object?

A

record, point, vector, pattern, event, case, sample, observation, or entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are attributes?

A

A property or characteristic of an object that may vary from one object to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a **measurement scale **when referring to data mining?

A

A rule (function) that associates a numerical or symbolic value with an attribute of an object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the process of measurement when referring to data mining.

A

Using a measurement scale to associate a value with a particular attribute of a specific object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 4 properties (operations) of numbers that are typically used to describe attributes?

A
  1. Distinctness = and != 2. Order , and => 3. Addition + and - 4. Multiplication x and /
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 4 types of attributes?

A
  1. Nominal 2. Ordinal 3. Interval 4. Ratio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a nominal type of attribute?

A

The values of a nominal attribute are just different names that provide only enough information to distinguish one object from another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an ordinal type of attribute?

A

The values of an ordinal attribute provide enough information to order objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an interval type of attribute?

A

The differences between values of an interval attribute are meaningful (a unit of measurement exists).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a ratio type of attribute?

A

The ratio and differences are both meaningful in a ratio attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What type of attribute is the following: zip codes

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What type of attribute is the following: employee ID numbers

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What type of attribute is the following: eye colour

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What type of attribute is the following: gender

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What type of attribute is the following: hardness of minerals {good, better, best}

A

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What type of attribute is the following: grades

A

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What type of attribute is the following: street numbers

A

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What type of attribute is the following: calendar dates

A

Interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What type of attribute is the following: temperature in Celsius or Fahrenheit

A

Interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What type of attribute is the following: monetary quantities

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What type of attribute is the following: counts

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What type of attribute is the following: age

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What type of attribute is the following: mass

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What type of attribute is the following: length

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What type of attribute is the following: electric current

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What two types of attributes are categorical / qualitative?

A

Nominal and ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are qualitative attributes?

A

They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as categorical attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are categorical attributes?

A

They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as qualitative attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the two types of quantitative attributes?

A

Interval and ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are quantitative attributes?

A

They are represented by numbers and have most of the properties of numbers. Also known as numeric attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are numeric attributes?

A

They are represented by numbers and have most of the properties of numbers. Also known as quantitative attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are permissible transformations?

A

The meaning of a length of an attribute is unchanged if a different measurement scale is used. For example, using the metric or the imperial system does not change the length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Given the following transformation, what type of attribute could be used? If all employee ID numbers are reassigned, it would not make a difference

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Given the following transformation, what type of attribute could be used? An attribute encompassing the notion of good, better, best, can represented equally well by the values 1,2,3

A

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Given the following transformation, what type of attribute could be used? The Fahrenheit and Celsius temperature scales differ in their zero value and the size of a degree (unit).

A

Interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Given the following transformation, what type of attribute could be used? Length can be measured in meters or feet

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is a coefficient?

A

A number or symbol multiplied with a variable or an unknown quantity in an algebraic term. For example, 4 is the coefficient in the term 4x, and x is the coefficient in x(a + b).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

A way of distinguishing between attributes is by the number of values they can take. What are the two types of attributes in this case?

A

Discrete - has a finite or countably infinite set of values. Continuous - values are real numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is a real number?

A

In mathematics, a real number is a value that represents a quantity along a continuous line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is a discrete attribute?

A
  • has a finite or countably infinite set of values. - often represented using integer variables. - binary attributes are a special case of discrete attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is a continuous attribute?

A
  • values are real numbers - are typically represented as floating-point variables - practically, can only be measured and represented with limited precision
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Typically, nominal and ordinal attributes are: a) continuous b) binary or discrete

A

b) binary or discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Typically, interval and ratio attributes are: a) continuous b) binary or discrete

A

a) continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is an asymmetric attribute?

A
  • only presence (a non-zero attribute value) is important -eg. whether or not a student took a particular course
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are asymmetric binary attributes?

A
  • binary attributes where only non-zero values are important
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What are the general characteristics of data sets?

A
  1. Dimensionality 2. Sparsity 3. Resolution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

When describing the general characteristics of data sets, what is dimensionality?

A
  • the number of attributes that the objects in the data set possess
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

When describing the general characteristics of data sets, what is sparsity?

A
  • the amount of zero values within the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

When describing the general characteristics of data sets, what is resolution?

A
  • frequently possible to obtain data at different levels of resolution - properties of the data are different at different resolutions - eg. the surface of the earth: flat vs bumpy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What are 3 examples of record data, data sets?

A
  1. Transaction or Market Basket Data 2. The Data Matrix 3. Sparse Data Matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What are 2 examples of graph based, data sets?

A
  1. Data with Relationships among Objects 2. Data with Objects that are Graphs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are 4 examples of ordered data, data sets?

A
  1. Sequential Data
  2. Sequence Data
  3. Time Series Data
  4. Spatial Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is transaction or market basket data?

A
  • each record (transaction) involves a set of items eg. items purchased at a grocery store - fields are typically asymmetric attributes (most often binary)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is a data matrix?

A
  • if the data objects in a collection all have the same fixed set of numeric data objects (vectors) in a multidimensional space where each dimension represents a distinct attribute describing the object. - can be interpreted as an m by n matrix where there are m rows, one for each object, and n columns, one for each attribute. - standard matrix operations can be applied to transform and manipulate the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is a sparse data matrix?

A
  • a special case of data matrix which the attributes are of the same type and are asymmetric (only non-zero values are important). eg. document data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is Data with Relationships among Objects?

A
  • data objects are mapped to nodes of the graph while the relationships among objects are captured by the links between objects and link properties such as direction and weight. eg. web pages on the world wide web
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is Data with Objects that are Graphs?

A
  • the objects contain sub-objects that have relationships eg. structural and chemical compounds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is sequential data / temporal data?

A
  • an extension of record data where each record has a time associated to it eg. retail transaction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is time series data?

A
  • a special type of sequential data in which each record is a time series (series of measurements taken over time) eg. average monthly temperature of a city between range of dates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What is spatial data?

A
  • some objects have spatial attributes such as positions eg. weather data (precipitation, temperature, pressure) collected from a variety of geographical locations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is temporal autocorreltation?

A
  • especially applies to time series data - if two measurements are close in time, then the values of those measurements are often very similar.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is spatial autocorreltation?

A
  • especially applies to spatial data - objects that are physically close tend to be similar
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

How can non-record data be handled?

A

-Record-oriented techniques can be applied to non-record data 1. extracting features from data objects 2. use these features to create a record corresponding to each object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is the issue with applying record-oriented techniques to non record data?

A

It does not capture all of the information in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What two principles does data mining focus on when referring to data quality?

A
  1. the detection and correction of data quality problems 2. the use of algorithms that can tolerate poor data quality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What is data cleaning?

A

The detection and correction of data quality problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Name 3 errors that can occur with data collection

A
  1. Human error
  2. Limitations of measuring devices
  3. Flaws in the data collection process
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Define measurement error

A
  • refers to any problem resulting from the measurement process - eg. the record differs from the true value to some extent
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Define error (for continuous attributes)

A

the numerical difference of the measured and true value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Define data collection error

A

errors such as: omitting data objects or attribute values or inappropriately including a data object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Define noise with respect to data mining

A

The distortion of a value or the addition of spurious objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

This term is often used in connection with data that has a spatial or temporal component. Techniques from signal or image processing can be used to reduce it.

A

noise

76
Q

Define robust algorithms

A

algorithms that produce acceptable results even when noise is present

77
Q

Define artifacts with respect to data mining

A

deterministic distortions of the data eg. a streak in a photograph

78
Q

Define precision

A

the closeness of repeated measurements (of the same quality) to one another

eg. scale - measure same object 5 times using scale - the percission of the scale is the standard deviation

79
Q

Define bias

A

a systematic variation of measurements from the quantity being measured

eg. scale - measure same object 5 times using scale - the bias is the mean of the 5 measurements

80
Q

This is often measured by the standard deviation of a set of values

A

precision

81
Q

Define standard deviation

A

the Standard Deviation is a measure of how spread out values are.

it is represented by the the greek letter sigma σ

it is the square root of the Variance

82
Q

This is often measured by taking the difference between the mean of the set of values and the known value of the quantity being measured

A

bias

83
Q

This can only be determined for objects whose measured quantity is known by means external to the current situation

A

Bias

84
Q

Define Accuracy

A

The closeness of measurements to the true value of the quantity being measured

85
Q

Define significant digits with respect to accuracy and data mining

A

goal is to use only as many digits to represent the result of a measurement or calculation that is justified

86
Q

What are the two definitions of outliers?

A
  1. data objects that have characteristics that are different from most of the other data objects in the data set 2. values of an attribute that are unusual with respect to the typical values for that attribute
87
Q

What is the difference between outliers and noise?

A

Outliers can be legitimate data objects or values and, unlike noise, may be of interest.

88
Q

What 3 things can be done when an object or one or more attribute values are missing from a data set?

A
  1. Eliminate data objects or attributes 2. Estimate Missing Values 3. Ignore the missing values during analysis
89
Q

When an object or one or more attribute values are missing from a data set, what is the disadvantage of eliminating the object altogether?

A

Even a partial data object contain some info and analysis may be unreliable if many values are missing

90
Q

When an object or one or more attribute values are missing from a data set explain estimating missing values.

A
  • missing data can sometimes be reliably estimated - can use interpolation
91
Q

When an object or one or more attribute values are missing from a data set explain ignoring the missing values during analysis.

A
  • many data mining approaches can be modified to ignore missing values
92
Q

Define interpolation

A

An estimation of a value within two known values in a sequence of values

93
Q

Explain inconsistent values

A

When you can tell the data is not right - eg. a negative height - a zip code that doesn’t belong to a city

94
Q

Is it possible to correct inconsistent values?

A

Sometimes. For example, credit cards or product codes might have “check” digits The correctness of an inconsistency requires consistent or redundant information

95
Q

What two main issues need to be addressed with duplicate data?

A
  1. If there are two objects that actually represent a single object, any different attributes needs to be resolved 2. Care needs to be taken to avoid accidently combining data objects that are similar but not duplicates such as two distinct people with identical names.
96
Q

Define deduplication

A

The process of dealing with duplicates

97
Q

What are three issues related to the application of data quality?

A
  1. Timeliness - data can age as soon as its collected 2. Relevance - the data must contain the information necessary for its application 3. Knowledge about the data - data is normally accompanied by documentation that can affect analysis eg. a missing value is represented by -9999
98
Q

What are 7 topics related to data preprocessing?

A

Feature subset selection

Dimensionality reduction

Feature creation

Discretization and binarization

Aggregation

Sampling

Variable transformation

99
Q

Define aggregation with respect to data mining (data preprocessing)

A
  • a data mining preprocessing step - combining two or more objects into a single object eg. combining chain of store transactions into aggregated single store transactions
100
Q

What are the advantages of aggregation

A
  • smaller data sets require less memory and processing time and may permit the use of more expensive data mining algorithms - can provide a high-level view of the data - the behaviour of groups of objects or attributes is often more stable than that of individual objects or attributes
101
Q

What is a disadvantage of using aggregation?

A

The potential loss of interesting details

102
Q

Define sampling with respect to data mining (data preprocessing)

A
  • a data mining preprocessing step - selecting a subset of the data objects to be analyzed
103
Q

When is a sample representative with respect to data mining, preprocessing, sampling

A
  1. when the sample has approximately the same property (of interest) as the original set of data For example, if the mean (average) is the property of interest and has the same mean as the original data set
104
Q

What are 2 common sampling techniques used in data mining

A
  1. simple random sampling 2. stratified sampling
105
Q

Explain the simple random sampling, sampling technique used in data mining, preprocessing, and the two variations of simple random sampling.

A

There is an equal probabily of selecting any particular item.

There are 2 variations:

1) sampling without replacement (as each item is selected, it is removed from the population)
2) sampling with replacement (objects are not removed from the population and can be selected more than once)

106
Q

Explain the stratified sampling technique used in data mining, preprocessing, and the two variations of stratified sampling.

A

Starts with prespecified groups of objects

There are two variations:

1) equal numbers of objects are drawn from each group even though the groups are different sizes
2) number of objects drawn from each group is proportional to the size of that group

107
Q

Explain adaptive / progressive sampling

A

Starts with a small sample and then increases the sample size until a sample of sufficient size has been obtained

108
Q

What is required for adaptive / progressive sampling?

A

A way to evaluate the sample to judge if it is large enough

109
Q

Explain dimensionality reduction (data preprocessing)

A

Techniques that reduce the dimensionality (number of attributes) in a data set

110
Q

What are some of the advantages of dimensionality reduction?

A
  • many data mining algorithm work better if the number of attributes in the data is lower it can eliminate irrelevant features and reduce noise - can make data more easily visualized
111
Q

Define Singular Value Decomposition (SVD)

A

is a linear algebra technique for dimensionality reduction that is related to the Principal Components Analysis (PCA) technique.

112
Q

Define Feature Subset Selection (data preprocessing)

A

A way of reducing the dimensionality of data by using only a subset of the features.

113
Q

When using feature subset selection for dimensionality reduction, is information lost?

A

Information is not lost if redundant and irrelevant features are present.

114
Q

Define redundant features (dimensionality reduction)

A

features that duplicate much or all of the information contained in one or more other attributes eg. the purchase price of a product, and the amount of sales tax paid

115
Q

Define irrelevant features (dimensionality reduction, data preprocessing)

A

contain almost no useful information for the data mining task at hand

116
Q

What are the 3 standard approaches for feature subset selection (data preprocessing)?

A
  1. Embeded approaches - the data mining algorithm itself decides which attributes to use and which to ignore 2. Filter approaches - features are selected before the data mining algorithm is run 3. Wrapper approaches - use the data mining algorithm as a black box to find the best subset of attributes without enumerating all possible subsets
117
Q

Define feature weighting

A
  • an alternative to keeping or eliminating features - more important features are assigned a highler weight
118
Q

Define feature creation (data preprocessing)

A
  • create a new set of attributes that captures the important information in a data set from the original attributes, much more effectively
119
Q

What are 3 methodologies for creating new attributes (feature creation, data preprocessing)

A
  1. Feature extraction - the creation of a new set of features from the original raw data 2. Mapping the data to a new space - provides a new view of the data - techniques such as as a fourier transform or wavelet transform 3. Feature construction - One or more new features are constructed out of the original features that is more useful than the original features
120
Q

Define discretization (data preprocessing)

A

transforming a continuous attribute into a categorical attribute

121
Q

Define binarization (data preprocessing)

A

transforming both continuous and discrete attributes into one or more binary attributes

122
Q

Explain how to binarize a categorical attribute

A
  1. uniquely assign each value to an integer 2. convert each of these integers into a binary number
123
Q

What two subtasks are involved in transforming a continuous attribute to a categorical one (discretization of continuous attributes)

A
  1. decide how many categories to have 2. map the values of the continues attribute to these categories
124
Q

What is the difference between unsupervised and supervised discretization

A

If class information is used, the discretization is supervised; when no class information is used, it is unsupervised.

125
Q

Describe the equal width discretization approach

A
  • divides the range of the attributes into a user-specified number of intervals, each having the same width - can be badly affected by outliers
126
Q

Describe the equal frequency (equal depth) discretization approach

A
  • preferred over equal width approach - tries to put the same number of objects into each interval
127
Q

What are 3 common approaches to unsupervised discretization?

A
  1. equal width 2. equal frequency 3. K-means
128
Q

Define entropy

A

Entropy is best understood as a measure of uncertainty rather than certainty as entropy is larger for more random sources.) The source is also characterized by the probability distribution of the samples drawn from it. The idea here is that the less likely an event is, the more information it provides when it occurs.

129
Q

These type of approaches to discretization are the most promising

A

entropy

130
Q

What is a simple approach for partitioning a continous attribute?

A
  1. Start by bisecting the initial values so that the resulting two intervals give mimimum entropy.
  2. The splitting process is then repeated with another interval,with the worst (highest) entropy, until a user-specified number of intervals is reached or a stopping criterion is satisfied.
131
Q

What is a variable transformation?

A

a mathematical transformation that is applied to all values of a variable.

132
Q

What is standardization or normalization of a variable?

A

another type of variable transformation to make an entire set of values have a particular property

133
Q

Why is similarity and dissimilarity important to data mining?

A

they are used by a number of data mining techniques such as clustering, nearest neighbor classification, and anomaly detection.

134
Q

Define proximity

A
  • used to either refer to similarity or dissimilarity
  • the proximity between two objects is a function of the proximity between the corresponding attributes of the two objects
135
Q

Define similarity

A
  • The similarity between two objects is a numerical measure of the degree to which the two objects are a like.
  • Similarities are higher for pairs of objects that are more alike
  • usually not negative and between 0 (no similarity) and 1 (complete similarity)
136
Q

Define dissimilarity

A
  • The dissimilarity between two objects is a numerical measure of the degree to which two objects are different
  • dissimilarities are lower for more similar pairs of objects
  • the term distance is used as a synonym for dissimilarity
  • sometimes fall between 0 and 1, but also common to range from 0 to infinity
137
Q

What is a monotonic decreasing function used for in data mining?

A
  • to convert disimilarities to similarities and vice versa
  • or tranfsforming the values of a priximity measure to a new scale
138
Q

What is the Euclidean distance?

A

The straight-line distance between two points on a plane. Euclidean distance, or distance “as the crow flies,” can be calculated using the Pythagorean theorem.

139
Q

What is a distance matrix?

A

a matrix (two-dimensional array) containing the distances, taken pairwise, of a set of points. This matrix will have a size of N×N whereN is the number of points, nodes or vertices (often in a graph).

140
Q

What are some well known properties for distances such as the Euclidean distance?

A
  1. Positivity
    a) d(x,x) >= 0 for all x and y
    b) d(x,y) = 0 only if x = y
  2. Symmetry

d(x,y) = d(y,x) for all x and y

  1. Triangle Inequality

d(x,z) < = d(x,y) + d(y,z) for all points x, y and z

141
Q

Define similarity coefficients

A
  • Similarity measures between objects that contain only binary attributes
142
Q

What are the valid range of values for similarity coefficients?

A
  • They typically have values between 0 and 1.
  • A value of 1 indicates that the two objects are completely similar, while a value of 0 indicates that the objects are not at all similar.
143
Q

Define the Simple Matching Coefficient (SMC)

A
  • a type of similarity coefficient

SMC = number of matching attribute values / number of attributes

= f11 + f00
f01 + f10 + f11 + f00

144
Q

This similarity measure counts both presencs and absences.

A

The Simple Matching Coefficient (SMC)

145
Q

This similarity measure is useful when both positive and negative values carry equal information (symmetry). For example, gender (male and female).

A

Simple Matching Coefficient

146
Q

This similarity test could be used to find students who answered questions similarly on a test that consisted only of true and false questions.

A

Simple Matching Coefficient, similarity coefficient

147
Q

This similarity measure is frequently used to handle objects consisting of asymmetric binary attributes.

A

Jaccard coefficient

148
Q

Define the Jaccard Coefficent

A
  • a similarity measure that is frequently used to handle objects consisting of asymmetric binaryh attributes.
  • Often symbolized by J
149
Q

What is the formula for the Jaccard Coefficent?

A

J= number of matching presences
number of attributes not involved in 00 matches

= f11
f01 + f10 + f11

150
Q

Define cosine similarity

A

ignores 0-0 matches like the Jaccard measure, but also handles non-binary vectors

151
Q

This time of similarity measure is one of the most common measure of document similarity

A

cosine similarity

152
Q

This type of similarity measure does not take the magnitude of the two data objects into account when computing similarity

A

cosine similarity

153
Q

This coefficient can be used for document data and reduces to the Jaccard coefficient in the case of binary attributes

A

Tanimoto coefficient

154
Q

Define correlation

A

-the correlation between two data objects that habe binary or continous variables is a measure of the linear relationship between the attributes of the objects.

155
Q

What is the range of correlation?

A

Correlation is always in the range -1 to 1.

A correlation of 1 ( -1 ) means that x and y have a perfect positive (negative) linear relationship; that is, xk = ayk + b, where a and b are constants.

If the correlation is 0, then there is no linear relationship between the attributes of the two data objects.

156
Q

Define Bregman Divergences

A
  • a family of priximity functions that share some common properties
  • are loss or distortion functions
  • can be used as dissimilarity functions
157
Q

What are 3 issues with proximity calculation?

A
  1. how to handle the case in which attributes have different scales and /or are correlated
  2. how to calculate proximity between objects that are composed of different types of attributes e.g. quantitative / qualitative
  3. how to handle proximty calculation when attributes have different weights eg. when not all attributes contribute equally to the proximity of objects
158
Q

How is distance measures handled when attributes do not have the same range of values?

A

The mahalanobis distance is useful when attributes are correlated, have different ranges of values (different variances), and the distrubution of the data is approximately Gaussian (normal).

159
Q

What is Gaussian?

A

The graph of a Gaussian is a characteristic symmetric “bell curve” shape.

160
Q

How is similarity measures handled when attributes are of different types?

A

Compute the similarity between each attribute sepearately, and then combine these similarities using a method that results in a similarity between 0 and 1.

The overall ismiolarity is defined as the average of all the individual attribute similarties.

161
Q

When computing proximity how do you handle when some attributes are more important to the definition of priximity than others?

A

The formlas for proximity can be modified by weighting the contribution of each attribute.

162
Q

This type of proximity measure is often used for many types of dense, continuous data.

A

metric distance measures such as the Euclidean distances

163
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Time in terms of AM or PM

A

Binary, qualitative, ordinal

164
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Brightness as measured by a light meter.

A

Continuous, quantitative, ratio

165
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Brightness as measured by people’s judgments.

A

Discrete, qualitative, ordinal

166
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Angles as measured in degrees between 0◦ and 360◦.

A

Continuous, quantitative, ratio

167
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Bronze, Silver, and Gold medals as awarded at the Olympics.

A

Discrete, qualitative, ordinal

168
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Height above sea level.

Continuous, quantitative, interval/ratio (de-
pends on whether sea level is regarded as an arbitrary origin)

A
169
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Number of patients in a hospital.

A

Discrete, quantitative, ratio

170
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

ISBN numbers for books. (Look up the format on the Web.)

A

Discrete, qualitative, nominal (ISBN numbers do have order information, though).

171
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Ability to pass light in terms of the following values: opaque, translu-
cent, transparent.

A

Discrete, qualitative, ordinal

172
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Military rank.

A

Discrete, qualitative, ordinal

173
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):

Distance from the center of campus.

A

Continuous, quantitative, interval/ratio (depends)

174
Q

Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Coat check number.

A

Discrete, qualitative, nominal

175
Q

Can you think of a situation in which identification numbers would be useful for prediction?

A

One example: Student IDs are a good predictor of graduation date.

176
Q

Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why?

A

A feature shows spatial auto-correlation if locations that are closer to each other are more similar with respect to the values of that feature than locations that are farther away. It is more common for physically close locations to have similar temperatures than similar amounts of rainfall since rainfall can be very localized;, i.e., the amount of rainfall can change abruptly from one location to another. Therefore, daily temperature shows more spatial autocorrelation then daily rainfall.

177
Q

Give at least two advantages to working with data stored in text files instead of in a binary format.

A

(1) Text files can be easily inspected by typing the file or viewing it with a text editor.
(2) Text files are more portable than binary files, both across systems and programs.
(3) Text files can be more easily modified, for example, using a text editor or perl.

178
Q

Distinguish between noise and outliers. Be sure to consider the following questions.

Is noise ever interesting or desirable? Outliers?

A

No, by definition. Yes. (See Chapter 10.)

179
Q

Distinguish between noise and outliers. Be sure to consider the following questions.

Can noise objects be outliers?

A

Yes. Random distortion of the data is often responsible for outliers.

180
Q

Are noise objects always outliers?

A

No. Random distortion can result in an object or value much like a

normal one.

181
Q

Are outliers always noise objects?

A

No. Often outliers merely represent a class of objects that are different

from normal objects.

182
Q

Can noise make a typical value into an unusual one, or vice versa?

A

Yes.

183
Q

What is the range of values that are possible for the cosine measure?

A

Many times the data has only positive entries and in that case the range is [0, 1].

184
Q

If two objects have a cosine measure of 1, are they identical? Explain.

A

Not necessarily. All we know is that the values of their attributes differ by a constant factor.

185
Q

Proximity is typically defined between a pair of objects.

How might you define the distance between two sets of points in Eu-clidean space?

A

One approach is to compute the distance between the centroids of the two sets of points.

186
Q

Explain why computing the proximity between two attributes is often simpler than computing the similarity between two objects.

A

In general, an object can be a record whose fields (attributes) are of different types. To compute the overall similarity of two objects in this case, we need to decide how to compute the similarity for each attribute and then combine these similarities. In contrast, the values of an attribute are all of the same type, and thus, if another attribute is of the same type, then the computation of similarity is conceptually and computationally straightforward.

187
Q

What is the curse of dimensionality?

A

the phenomenon that many types of data anlysis become significantly harder as the dimensionality of the data increases.

As the dimensionality of the data increases, the data becomes increasingly sparse in the space that it occupies.