Chapter 2 - Data Flashcards

1
Q

Why is the type of data important to data mining?

A

The type of data determines which tools and techniques can be used to analyze it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is data quality important?

A

Improving data quality typically improves the quality of the resulting analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a data set?

A

A collection of data objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a data object?

A

record, point, vector, pattern, event, case, sample, observation, or entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are attributes?

A

A property or characteristic of an object that may vary from one object to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a **measurement scale **when referring to data mining?

A

A rule (function) that associates a numerical or symbolic value with an attribute of an object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the process of measurement when referring to data mining.

A

Using a measurement scale to associate a value with a particular attribute of a specific object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 4 properties (operations) of numbers that are typically used to describe attributes?

A
  1. Distinctness = and != 2. Order , and => 3. Addition + and - 4. Multiplication x and /
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 4 types of attributes?

A
  1. Nominal 2. Ordinal 3. Interval 4. Ratio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a nominal type of attribute?

A

The values of a nominal attribute are just different names that provide only enough information to distinguish one object from another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an ordinal type of attribute?

A

The values of an ordinal attribute provide enough information to order objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an interval type of attribute?

A

The differences between values of an interval attribute are meaningful (a unit of measurement exists).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a ratio type of attribute?

A

The ratio and differences are both meaningful in a ratio attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What type of attribute is the following: zip codes

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What type of attribute is the following: employee ID numbers

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What type of attribute is the following: eye colour

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What type of attribute is the following: gender

A

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What type of attribute is the following: hardness of minerals {good, better, best}

A

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What type of attribute is the following: grades

A

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What type of attribute is the following: street numbers

A

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What type of attribute is the following: calendar dates

A

Interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What type of attribute is the following: temperature in Celsius or Fahrenheit

A

Interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What type of attribute is the following: monetary quantities

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What type of attribute is the following: counts

A

Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What type of attribute is the following: age
Ratio
26
What type of attribute is the following: mass
Ratio
27
What type of attribute is the following: length
Ratio
28
What type of attribute is the following: electric current
Ratio
29
What two types of attributes are categorical / qualitative?
Nominal and ordinal
30
What are qualitative attributes?
They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as categorical attributes.
31
What are categorical attributes?
They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as qualitative attributes.
32
What are the two types of quantitative attributes?
Interval and ratio
33
What are quantitative attributes?
They are represented by numbers and have most of the properties of numbers. Also known as numeric attributes.
34
What are numeric attributes?
They are represented by numbers and have most of the properties of numbers. Also known as quantitative attributes.
35
What are permissible transformations?
The meaning of a length of an attribute is unchanged if a different measurement scale is used. For example, using the metric or the imperial system does not change the length.
36
Given the following transformation, what type of attribute could be used? If all employee ID numbers are reassigned, it would not make a difference
Nominal
37
Given the following transformation, what type of attribute could be used? An attribute encompassing the notion of good, better, best, can represented equally well by the values 1,2,3
Ordinal
38
Given the following transformation, what type of attribute could be used? The Fahrenheit and Celsius temperature scales differ in their zero value and the size of a degree (unit).
Interval
39
Given the following transformation, what type of attribute could be used? Length can be measured in meters or feet
Ratio
40
What is a coefficient?
A number or symbol multiplied with a variable or an unknown quantity in an algebraic term. For example, 4 is the coefficient in the term 4x, and x is the coefficient in x(a + b).
41
A way of distinguishing between attributes is by the number of values they can take. What are the two types of attributes in this case?
Discrete - has a finite or countably infinite set of values. Continuous - values are real numbers.
42
What is a real number?
In mathematics, a real number is a value that represents a quantity along a continuous line
43
What is a discrete attribute?
- has a finite or countably infinite set of values. - often represented using integer variables. - binary attributes are a special case of discrete attributes
44
What is a continuous attribute?
- values are real numbers - are typically represented as floating-point variables - practically, can only be measured and represented with limited precision
45
Typically, nominal and ordinal attributes are: a) continuous b) binary or discrete
b) binary or discrete
46
Typically, interval and ratio attributes are: a) continuous b) binary or discrete
a) continuous
47
What is an asymmetric attribute?
- only presence (a non-zero attribute value) is important -eg. whether or not a student took a particular course
48
What are asymmetric binary attributes?
- binary attributes where only non-zero values are important
49
What are the general characteristics of data sets?
1. Dimensionality 2. Sparsity 3. Resolution
50
When describing the general characteristics of data sets, what is dimensionality?
- the number of attributes that the objects in the data set possess
51
When describing the general characteristics of data sets, what is sparsity?
- the amount of zero values within the data
52
When describing the general characteristics of data sets, what is resolution?
- frequently possible to obtain data at different levels of resolution - properties of the data are different at different resolutions - eg. the surface of the earth: flat vs bumpy
53
What are 3 examples of record data, data sets?
1. Transaction or Market Basket Data 2. The Data Matrix 3. Sparse Data Matrix
54
What are 2 examples of graph based, data sets?
1. Data with Relationships among Objects 2. Data with Objects that are Graphs
55
What are 4 examples of **ordered data**, data sets?
1. Sequential Data 2. Sequence Data 3. Time Series Data 4. Spatial Data
56
What is transaction or market basket data?
- each record (transaction) involves a set of items eg. items purchased at a grocery store - fields are typically asymmetric attributes (most often binary)
57
What is a data matrix?
- if the data objects in a collection all have the same fixed set of numeric data objects (vectors) in a multidimensional space where each dimension represents a distinct attribute describing the object. - can be interpreted as an m by n matrix where there are m rows, one for each object, and n columns, one for each attribute. - standard matrix operations can be applied to transform and manipulate the data
58
What is a sparse data matrix?
- a special case of data matrix which the attributes are of the same type and are asymmetric (only non-zero values are important). eg. document data
59
What is Data with Relationships among Objects?
- data objects are mapped to nodes of the graph while the relationships among objects are captured by the links between objects and link properties such as direction and weight. eg. web pages on the world wide web
60
What is Data with Objects that are Graphs?
- the objects contain sub-objects that have relationships eg. structural and chemical compounds
61
What is sequential data / temporal data?
- an extension of record data where each record has a time associated to it eg. retail transaction
62
What is time series data?
- a special type of sequential data in which each record is a time series (series of measurements taken over time) eg. average monthly temperature of a city between range of dates
63
What is spatial data?
- some objects have spatial attributes such as positions eg. weather data (precipitation, temperature, pressure) collected from a variety of geographical locations
64
What is temporal autocorreltation?
- especially applies to time series data - if two measurements are close in time, then the values of those measurements are often very similar.
65
What is spatial autocorreltation?
- especially applies to spatial data - objects that are physically close tend to be similar
66
How can non-record data be handled?
-Record-oriented techniques can be applied to non-record data 1. extracting features from data objects 2. use these features to create a record corresponding to each object.
67
What is the issue with applying record-oriented techniques to non record data?
It does not capture all of the information in the data.
68
What two principles does data mining focus on when referring to data quality?
1. the detection and correction of data quality problems 2. the use of algorithms that can tolerate poor data quality
69
What is **data cleaning**?
The detection and correction of data quality problems.
70
Name 3 errors that can occur with data collection
1. Human error 2. Limitations of measuring devices 3. Flaws in the data collection process
71
Define measurement error
- refers to any problem resulting from the measurement process - eg. the record differs from the true value to some extent
72
Define error (for continuous attributes)
the numerical difference of the measured and true value
73
Define data collection error
errors such as: omitting data objects or attribute values or inappropriately including a data object
74
Define noise with respect to data mining
The distortion of a value or the addition of spurious objects
75
This term is often used in connection with data that has a spatial or temporal component. Techniques from signal or image processing can be used to reduce it.
noise
76
Define robust algorithms
algorithms that produce acceptable results even when noise is present
77
Define artifacts with respect to data mining
deterministic distortions of the data eg. a streak in a photograph
78
Define **precision**
the closeness of repeated measurements (of the same quality) to one another eg. scale - measure same object 5 times using scale - the percission of the scale is the standard deviation
79
Define bias
a systematic variation of measurements from the quantity being measured eg. scale - measure same object 5 times using scale - the bias is the mean of the 5 measurements
80
This is often measured by the standard deviation of a set of values
precision
81
Define standard deviation
the Standard Deviation is a measure of how spread out values are. it is represented by the the greek letter sigma σ it is the square root of the Variance
82
This is often measured by taking the difference between the mean of the set of values and the known value of the quantity being measured
bias
83
This can only be determined for objects whose measured quantity is known by means external to the current situation
Bias
84
Define Accuracy
The closeness of measurements to the true value of the quantity being measured
85
Define significant digits with respect to accuracy and data mining
goal is to use only as many digits to represent the result of a measurement or calculation that is justified
86
What are the two definitions of outliers?
1. data objects that have characteristics that are different from most of the other data objects in the data set 2. values of an attribute that are unusual with respect to the typical values for that attribute
87
What is the difference between outliers and noise?
Outliers can be legitimate data objects or values and, unlike noise, may be of interest.
88
What 3 things can be done when an object or one or more attribute values are missing from a data set?
1. Eliminate data objects or attributes 2. Estimate Missing Values 3. Ignore the missing values during analysis
89
When an object or one or more attribute values are missing from a data set, what is the disadvantage of eliminating the object altogether?
Even a partial data object contain some info and analysis may be unreliable if many values are missing
90
When an object or one or more attribute values are missing from a data set explain estimating missing values.
- missing data can sometimes be reliably estimated - can use interpolation
91
When an object or one or more attribute values are missing from a data set explain ignoring the missing values during analysis.
- many data mining approaches can be modified to ignore missing values
92
Define interpolation
An estimation of a value within two known values in a sequence of values
93
Explain inconsistent values
When you can tell the data is not right - eg. a negative height - a zip code that doesn't belong to a city
94
Is it possible to correct inconsistent values?
Sometimes. For example, credit cards or product codes might have "check" digits The correctness of an inconsistency requires consistent or redundant information
95
What two main issues need to be addressed with duplicate data?
1. If there are two objects that actually represent a single object, any different attributes needs to be resolved 2. Care needs to be taken to avoid accidently combining data objects that are similar but not duplicates such as two distinct people with identical names.
96
Define deduplication
The process of dealing with duplicates
97
What are three issues related to the application of data quality?
1. Timeliness - data can age as soon as its collected 2. Relevance - the data must contain the information necessary for its application 3. Knowledge about the data - data is normally accompanied by documentation that can affect analysis eg. a missing value is represented by -9999
98
What are 7 topics related to data preprocessing?
Feature subset selection Dimensionality reduction Feature creation Discretization and binarization Aggregation Sampling Variable transformation
99
Define aggregation with respect to data mining (data preprocessing)
- a data mining preprocessing step - combining two or more objects into a single object eg. combining chain of store transactions into aggregated single store transactions
100
What are the advantages of aggregation
- smaller data sets require less memory and processing time and may permit the use of more expensive data mining algorithms - can provide a high-level view of the data - the behaviour of groups of objects or attributes is often more stable than that of individual objects or attributes
101
What is a disadvantage of using aggregation?
The potential loss of interesting details
102
Define sampling with respect to data mining (data preprocessing)
- a data mining preprocessing step - selecting a subset of the data objects to be analyzed
103
When is a sample representative with respect to data mining, preprocessing, sampling
1. when the sample has approximately the same property (of interest) as the original set of data For example, if the mean (average) is the property of interest and has the same mean as the original data set
104
What are 2 common sampling techniques used in data mining
1. simple random sampling 2. stratified sampling
105
Explain the simple random sampling, sampling technique used in data mining, preprocessing, and the two variations of simple random sampling.
There is an equal probabily of selecting any particular item. There are 2 variations: 1) sampling without replacement (as each item is selected, it is removed from the population) 2) sampling with replacement (objects are not removed from the population and can be selected more than once)
106
Explain the stratified sampling technique used in data mining, preprocessing, and the two variations of stratified sampling.
Starts with prespecified groups of objects There are two variations: 1) equal numbers of objects are drawn from each group even though the groups are different sizes 2) number of objects drawn from each group is proportional to the size of that group
107
Explain adaptive / progressive sampling
Starts with a small sample and then increases the sample size until a sample of sufficient size has been obtained
108
What is required for adaptive / progressive sampling?
A way to evaluate the sample to judge if it is large enough
109
Explain dimensionality reduction (data preprocessing)
Techniques that reduce the dimensionality (number of attributes) in a data set
110
What are some of the advantages of dimensionality reduction?
- many data mining algorithm work better if the number of attributes in the data is lower it can eliminate irrelevant features and reduce noise - can make data more easily visualized
111
Define Singular Value Decomposition (SVD)
is a linear algebra technique for dimensionality reduction that is related to the Principal Components Analysis (PCA) technique.
112
Define Feature Subset Selection (data preprocessing)
A way of reducing the dimensionality of data by using only a subset of the features.
113
When using feature subset selection for dimensionality reduction, is information lost?
Information is not lost if redundant and irrelevant features are present.
114
Define redundant features (dimensionality reduction)
features that duplicate much or all of the information contained in one or more other attributes eg. the purchase price of a product, and the amount of sales tax paid
115
Define irrelevant features (dimensionality reduction, data preprocessing)
contain almost no useful information for the data mining task at hand
116
What are the 3 standard approaches for feature subset selection (data preprocessing)?
1. Embeded approaches - the data mining algorithm itself decides which attributes to use and which to ignore 2. Filter approaches - features are selected before the data mining algorithm is run 3. Wrapper approaches - use the data mining algorithm as a black box to find the best subset of attributes without enumerating all possible subsets
117
Define feature weighting
- an alternative to keeping or eliminating features - more important features are assigned a highler weight
118
Define feature creation (data preprocessing)
- create a new set of attributes that captures the important information in a data set from the original attributes, much more effectively
119
What are 3 methodologies for creating new attributes (feature creation, data preprocessing)
1. Feature extraction - the creation of a new set of features from the original raw data 2. Mapping the data to a new space - provides a new view of the data - techniques such as as a fourier transform or wavelet transform 3. Feature construction - One or more new features are constructed out of the original features that is more useful than the original features
120
Define discretization (data preprocessing)
transforming a continuous attribute into a categorical attribute
121
Define binarization (data preprocessing)
transforming both continuous and discrete attributes into one or more binary attributes
122
Explain how to binarize a categorical attribute
1. uniquely assign each value to an integer 2. convert each of these integers into a binary number
123
What two subtasks are involved in transforming a continuous attribute to a categorical one (discretization of continuous attributes)
1. decide how many categories to have 2. map the values of the continues attribute to these categories
124
What is the difference between unsupervised and supervised discretization
If class information is used, the discretization is supervised; when no class information is used, it is unsupervised.
125
Describe the equal width discretization approach
- divides the range of the attributes into a user-specified number of intervals, each having the same width - can be badly affected by outliers
126
Describe the equal frequency (equal depth) discretization approach
- preferred over equal width approach - tries to put the same number of objects into each interval
127
What are 3 common approaches to unsupervised discretization?
1. equal width 2. equal frequency 3. K-means
128
Define entropy
Entropy is best understood as a measure of uncertainty rather than certainty as entropy is larger for more random sources.) The source is also characterized by the probability distribution of the samples drawn from it. The idea here is that the less likely an event is, the more information it provides when it occurs.
129
These type of approaches to discretization are the most promising
entropy
130
What is a simple approach for partitioning a continous attribute?
1. Start by bisecting the initial values so that the resulting two intervals give mimimum entropy. 2. The splitting process is then repeated with another interval,with the worst (highest) entropy, until a user-specified number of intervals is reached or a stopping criterion is satisfied.
131
What is a variable transformation?
a mathematical transformation that is applied to all values of a variable.
132
What is standardization or normalization of a variable?
another type of variable transformation to make an entire set of values have a particular property
133
Why is similarity and dissimilarity important to data mining?
they are used by a number of data mining techniques such as clustering, nearest neighbor classification, and anomaly detection.
134
Define proximity
- used to either refer to similarity or dissimilarity - the proximity between two objects is a function of the proximity between the corresponding attributes of the two objects
135
Define similarity
- The similarity between two objects is a numerical measure of the degree to which the two objects are a like. - Similarities are higher for pairs of objects that are more alike - usually not negative and between 0 (no similarity) and 1 (complete similarity)
136
Define dissimilarity
- The dissimilarity between two objects is a numerical measure of the degree to which two objects are different - dissimilarities are lower for more similar pairs of objects - the term distance is used as a synonym for dissimilarity - sometimes fall between 0 and 1, but also common to range from 0 to infinity
137
What is a monotonic decreasing function used for in data mining?
- to convert disimilarities to similarities and vice versa - or tranfsforming the values of a priximity measure to a new scale
138
What is the **Euclidean distance**?
The straight-line distance between two points on a plane. Euclidean distance, or distance "as the crow flies," can be calculated using the Pythagorean theorem.
139
What is a **distance matrix**?
a matrix (two-dimensional array) containing the distances, taken pairwise, of a set of points. This matrix will have a size of N×N whereN is the number of points, nodes or vertices (often in a graph).
140
What are some well known properties for distances such as the Euclidean distance?
1. Positivity a) d(x,x) \>= 0 for all x and y b) d(x,y) = 0 only if x = y 2. Symmetry d(x,y) = d(y,x) for all x and y 3. Triangle Inequality d(x,z) \< = d(x,y) + d(y,z) for all points x, y and z
141
Define similarity coefficients
- Similarity measures between objects that contain only binary attributes
142
What are the valid range of values for similarity coefficients?
- They typically have values between 0 and 1. - A value of 1 indicates that the two objects are completely similar, while a value of 0 indicates that the objects are not at all similar.
143
Define the Simple Matching Coefficient (SMC)
- a type of similarity coefficient SMC = number of matching attribute values / number of attributes = _f11 + f00_ f01 + f10 + f11 + f00
144
This similarity measure counts both presencs and absences.
The Simple Matching Coefficient (SMC)
145
This similarity measure is useful when both positive and negative values carry equal information (symmetry). For example, gender (male and female).
Simple Matching Coefficient
146
This similarity test could be used to find students who answered questions similarly on a test that consisted only of true and false questions.
Simple Matching Coefficient, similarity coefficient
147
This similarity measure is frequently used to handle objects consisting of asymmetric binary attributes.
Jaccard coefficient
148
Define the Jaccard Coefficent
- a similarity measure that is frequently used to handle objects consisting of asymmetric binaryh attributes. - Often symbolized by J
149
What is the formula for the Jaccard Coefficent?
J= _number of matching presences_ number of attributes not involved in 00 matches = _f11_ f01 + f10 + f11
150
Define cosine similarity
ignores 0-0 matches like the Jaccard measure, but also handles non-binary vectors
151
This time of similarity measure is one of the most common measure of document similarity
cosine similarity
152
This type of similarity measure does not take the magnitude of the two data objects into account when computing similarity
cosine similarity
153
This coefficient can be used for document data and reduces to the Jaccard coefficient in the case of binary attributes
Tanimoto coefficient
154
Define correlation
-the correlation between two data objects that habe binary or continous variables is a measure of the linear relationship between the attributes of the objects.
155
What is the range of correlation?
Correlation is always in the range -1 to 1. A correlation of 1 ( -1 ) means that x and y have a perfect positive (negative) linear relationship; that is, xk = ayk + b, where a and b are constants. If the correlation is 0, then there is no linear relationship between the attributes of the two data objects.
156
Define Bregman Divergences
- a family of priximity functions that share some common properties - are loss or distortion functions - can be used as dissimilarity functions
157
What are 3 issues with proximity calculation?
1. how to handle the case in which attributes have different scales and /or are correlated 2. how to calculate proximity between objects that are composed of different types of attributes e.g. quantitative / qualitative 3. how to handle proximty calculation when attributes have different weights eg. when not all attributes contribute equally to the proximity of objects
158
How is distance measures handled when attributes do not have the same range of values?
The mahalanobis distance is useful when attributes are correlated, have different ranges of values (different variances), and the distrubution of the data is approximately Gaussian (normal).
159
What is Gaussian?
The graph of a Gaussian is a characteristic symmetric "bell curve" shape.
160
How is similarity measures handled when attributes are of different types?
Compute the similarity between each attribute sepearately, and then combine these similarities using a method that results in a similarity between 0 and 1. The overall ismiolarity is defined as the average of all the individual attribute similarties.
161
When computing proximity how do you handle when some attributes are more important to the definition of priximity than others?
The formlas for proximity can be modified by weighting the contribution of each attribute.
162
This type of proximity measure is often used for many types of dense, continuous data.
metric distance measures such as the Euclidean distances
163
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Time in terms of AM or PM
Binary, qualitative, ordinal
164
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Brightness as measured by a light meter.
Continuous, quantitative, ratio
165
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Brightness as measured by people’s judgments.
Discrete, qualitative, ordinal
166
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Angles as measured in degrees between 0◦ and 360◦.
Continuous, quantitative, ratio
167
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Bronze, Silver, and Gold medals as awarded at the Olympics.
Discrete, qualitative, ordinal
168
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Height above sea level. Continuous, quantitative, interval/ratio (de- pends on whether sea level is regarded as an arbitrary origin)
169
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Number of patients in a hospital.
Discrete, quantitative, ratio
170
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): ISBN numbers for books. (Look up the format on the Web.)
Discrete, qualitative, nominal (ISBN numbers do have order information, though).
171
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Ability to pass light in terms of the following values: opaque, translu- cent, transparent.
Discrete, qualitative, ordinal
172
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Military rank.
Discrete, qualitative, ordinal
173
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Distance from the center of campus.
Continuous, quantitative, interval/ratio (depends)
174
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Coat check number.
Discrete, qualitative, nominal
175
Can you think of a situation in which identification numbers would be useful for prediction?
One example: Student IDs are a good predictor of graduation date.
176
Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why?
A feature shows spatial auto-correlation if locations that are closer to each other are more similar with respect to the values of that feature than locations that are farther away. It is more common for physically close locations to have similar temperatures than similar amounts of rainfall since rainfall can be very localized;, i.e., the amount of rainfall can change abruptly from one location to another. Therefore, daily temperature shows more spatial autocorrelation then daily rainfall.
177
Give at least two advantages to working with data stored in text files instead of in a binary format.
(1) Text files can be easily inspected by typing the file or viewing it with a text editor. (2) Text files are more portable than binary files, both across systems and programs. (3) Text files can be more easily modified, for example, using a text editor or perl.
178
Distinguish between noise and outliers. Be sure to consider the following questions. Is noise ever interesting or desirable? Outliers?
No, by definition. Yes. (See Chapter 10.)
179
Distinguish between noise and outliers. Be sure to consider the following questions. Can noise objects be outliers?
Yes. Random distortion of the data is often responsible for outliers.
180
Are noise objects always outliers?
No. Random distortion can result in an object or value much like a normal one.
181
Are outliers always noise objects?
No. Often outliers merely represent a class of objects that are different from normal objects.
182
Can noise make a typical value into an unusual one, or vice versa?
Yes.
183
What is the range of values that are possible for the cosine measure?
Many times the data has only positive entries and in that case the range is [0, 1].
184
If two objects have a cosine measure of 1, are they identical? Explain.
Not necessarily. All we know is that the values of their attributes differ by a constant factor.
185
Proximity is typically defined between a pair of objects. How might you define the distance between two sets of points in Eu-clidean space?
One approach is to compute the distance between the centroids of the two sets of points.
186
Explain why computing the proximity between two attributes is often simpler than computing the similarity between two objects.
In general, an object can be a record whose fields (attributes) are of different types. To compute the overall similarity of two objects in this case, we need to decide how to compute the similarity for each attribute and then combine these similarities. In contrast, the values of an attribute are all of the same type, and thus, if another attribute is of the same type, then the computation of similarity is conceptually and computationally straightforward.
187
What is the curse of dimensionality?
the phenomenon that many types of data anlysis become significantly harder as the dimensionality of the data increases. As the dimensionality of the data increases, the data becomes increasingly sparse in the space that it occupies.