Chapter 2 - Data Flashcards

Question

What type of attribute is the following: age

Answer 1

Nominal and ordinal

Answer 2

They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as categorical attributes.

Answer 3

They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as qualitative attributes.

Answer 4

Interval and ratio

Answer 5

They are represented by numbers and have most of the properties of numbers. Also known as numeric attributes.

Answer 6

They are represented by numbers and have most of the properties of numbers. Also known as quantitative attributes.

Answer 7

The meaning of a length of an attribute is unchanged if a different measurement scale is used. For example, using the metric or the imperial system does not change the length.

Answer 8

A number or symbol multiplied with a variable or an unknown quantity in an algebraic term. For example, 4 is the coefficient in the term 4x, and x is the coefficient in x(a + b).

Answer 9

Discrete - has a finite or countably infinite set of values. Continuous - values are real numbers.

Answer 10

In mathematics, a real number is a value that represents a quantity along a continuous line

Answer 11

- has a finite or countably infinite set of values. - often represented using integer variables. - binary attributes are a special case of discrete attributes

Answer 12

- values are real numbers - are typically represented as floating-point variables - practically, can only be measured and represented with limited precision

Answer 13

b) binary or discrete

Answer 14

a) continuous

Answer 15

- only presence (a non-zero attribute value) is important -eg. whether or not a student took a particular course

Answer 16

- binary attributes where only non-zero values are important

Answer 17

1. Dimensionality 2. Sparsity 3. Resolution

Answer 18

- the number of attributes that the objects in the data set possess

Answer 19

- the amount of zero values within the data

Answer 20

- frequently possible to obtain data at different levels of resolution - properties of the data are different at different resolutions - eg. the surface of the earth: flat vs bumpy

Answer 21

1. Transaction or Market Basket Data 2. The Data Matrix 3. Sparse Data Matrix

Answer 22

1. Data with Relationships among Objects 2. Data with Objects that are Graphs

Answer 23

1. Sequential Data 2. Sequence Data 3. Time Series Data 4. Spatial Data

Answer 24

- each record (transaction) involves a set of items eg. items purchased at a grocery store - fields are typically asymmetric attributes (most often binary)

Answer 25

- if the data objects in a collection all have the same fixed set of numeric data objects (vectors) in a multidimensional space where each dimension represents a distinct attribute describing the object. - can be interpreted as an m by n matrix where there are m rows, one for each object, and n columns, one for each attribute. - standard matrix operations can be applied to transform and manipulate the data

Answer 26

- a special case of data matrix which the attributes are of the same type and are asymmetric (only non-zero values are important). eg. document data

Answer 27

- data objects are mapped to nodes of the graph while the relationships among objects are captured by the links between objects and link properties such as direction and weight. eg. web pages on the world wide web

Answer 28

- the objects contain sub-objects that have relationships eg. structural and chemical compounds

Answer 29

- an extension of record data where each record has a time associated to it eg. retail transaction

Answer 30

- a special type of sequential data in which each record is a time series (series of measurements taken over time) eg. average monthly temperature of a city between range of dates

Answer 31

- some objects have spatial attributes such as positions eg. weather data (precipitation, temperature, pressure) collected from a variety of geographical locations

Answer 32

- especially applies to time series data - if two measurements are close in time, then the values of those measurements are often very similar.

Answer 33

- especially applies to spatial data - objects that are physically close tend to be similar

Answer 34

-Record-oriented techniques can be applied to non-record data 1. extracting features from data objects 2. use these features to create a record corresponding to each object.

Answer 35

It does not capture all of the information in the data.

Answer 36

1. the detection and correction of data quality problems 2. the use of algorithms that can tolerate poor data quality

Answer 37

The detection and correction of data quality problems.

Answer 38

1. Human error 2. Limitations of measuring devices 3. Flaws in the data collection process

Answer 39

- refers to any problem resulting from the measurement process - eg. the record differs from the true value to some extent

Answer 40

the numerical difference of the measured and true value

Answer 41

errors such as: omitting data objects or attribute values or inappropriately including a data object

Answer 42

The distortion of a value or the addition of spurious objects

Answer 43

algorithms that produce acceptable results even when noise is present

Answer 44

deterministic distortions of the data eg. a streak in a photograph

Answer 45

the closeness of repeated measurements (of the same quality) to one another eg. scale - measure same object 5 times using scale - the percission of the scale is the standard deviation

Answer 46

a systematic variation of measurements from the quantity being measured eg. scale - measure same object 5 times using scale - the bias is the mean of the 5 measurements

Answer 47

the Standard Deviation is a measure of how spread out values are. it is represented by the the greek letter sigma σ it is the square root of the Variance

Answer 48

The closeness of measurements to the true value of the quantity being measured

Answer 49

goal is to use only as many digits to represent the result of a measurement or calculation that is justified

Answer 50

1. data objects that have characteristics that are different from most of the other data objects in the data set 2. values of an attribute that are unusual with respect to the typical values for that attribute

Answer 51

Outliers can be legitimate data objects or values and, unlike noise, may be of interest.

Answer 52

1. Eliminate data objects or attributes 2. Estimate Missing Values 3. Ignore the missing values during analysis

Answer 53

Even a partial data object contain some info and analysis may be unreliable if many values are missing

Answer 54

- missing data can sometimes be reliably estimated - can use interpolation

Answer 55

- many data mining approaches can be modified to ignore missing values

Answer 56

An estimation of a value within two known values in a sequence of values

Answer 57

When you can tell the data is not right - eg. a negative height - a zip code that doesn't belong to a city

Answer 58

Sometimes. For example, credit cards or product codes might have "check" digits The correctness of an inconsistency requires consistent or redundant information

Answer 59

1. If there are two objects that actually represent a single object, any different attributes needs to be resolved 2. Care needs to be taken to avoid accidently combining data objects that are similar but not duplicates such as two distinct people with identical names.

Answer 60

The process of dealing with duplicates

Answer 61

1. Timeliness - data can age as soon as its collected 2. Relevance - the data must contain the information necessary for its application 3. Knowledge about the data - data is normally accompanied by documentation that can affect analysis eg. a missing value is represented by -9999

Answer 62

Feature subset selection Dimensionality reduction Feature creation Discretization and binarization Aggregation Sampling Variable transformation

Answer 63

- a data mining preprocessing step - combining two or more objects into a single object eg. combining chain of store transactions into aggregated single store transactions

Answer 64

- smaller data sets require less memory and processing time and may permit the use of more expensive data mining algorithms - can provide a high-level view of the data - the behaviour of groups of objects or attributes is often more stable than that of individual objects or attributes

Answer 65

The potential loss of interesting details

Answer 66

- a data mining preprocessing step - selecting a subset of the data objects to be analyzed

Answer 67

1. when the sample has approximately the same property (of interest) as the original set of data For example, if the mean (average) is the property of interest and has the same mean as the original data set

Answer 68

1. simple random sampling 2. stratified sampling

Answer 69

There is an equal probabily of selecting any particular item. There are 2 variations: 1) sampling without replacement (as each item is selected, it is removed from the population) 2) sampling with replacement (objects are not removed from the population and can be selected more than once)

Answer 70

Starts with prespecified groups of objects There are two variations: 1) equal numbers of objects are drawn from each group even though the groups are different sizes 2) number of objects drawn from each group is proportional to the size of that group

Answer 71

Starts with a small sample and then increases the sample size until a sample of sufficient size has been obtained

Answer 72

A way to evaluate the sample to judge if it is large enough

Answer 73

Techniques that reduce the dimensionality (number of attributes) in a data set

Answer 74

- many data mining algorithm work better if the number of attributes in the data is lower it can eliminate irrelevant features and reduce noise - can make data more easily visualized

Answer 75

is a linear algebra technique for dimensionality reduction that is related to the Principal Components Analysis (PCA) technique.

Answer 76

A way of reducing the dimensionality of data by using only a subset of the features.

Answer 77

Information is not lost if redundant and irrelevant features are present.

Answer 78

features that duplicate much or all of the information contained in one or more other attributes eg. the purchase price of a product, and the amount of sales tax paid

Answer 79

contain almost no useful information for the data mining task at hand

Answer 80

1. Embeded approaches - the data mining algorithm itself decides which attributes to use and which to ignore 2. Filter approaches - features are selected before the data mining algorithm is run 3. Wrapper approaches - use the data mining algorithm as a black box to find the best subset of attributes without enumerating all possible subsets

Answer 81

- an alternative to keeping or eliminating features - more important features are assigned a highler weight

Answer 82

- create a new set of attributes that captures the important information in a data set from the original attributes, much more effectively

Answer 83

1. Feature extraction - the creation of a new set of features from the original raw data 2. Mapping the data to a new space - provides a new view of the data - techniques such as as a fourier transform or wavelet transform 3. Feature construction - One or more new features are constructed out of the original features that is more useful than the original features

Answer 84

transforming a continuous attribute into a categorical attribute

Answer 85

transforming both continuous and discrete attributes into one or more binary attributes

Answer 86

1. uniquely assign each value to an integer 2. convert each of these integers into a binary number

Answer 87

1. decide how many categories to have 2. map the values of the continues attribute to these categories

Answer 88

If class information is used, the discretization is supervised; when no class information is used, it is unsupervised.

Answer 89

- divides the range of the attributes into a user-specified number of intervals, each having the same width - can be badly affected by outliers

Answer 90

- preferred over equal width approach - tries to put the same number of objects into each interval

Answer 91

1. equal width 2. equal frequency 3. K-means

Answer 92

Entropy is best understood as a measure of uncertainty rather than certainty as entropy is larger for more random sources.) The source is also characterized by the probability distribution of the samples drawn from it. The idea here is that the less likely an event is, the more information it provides when it occurs.

Answer 93

1. Start by bisecting the initial values so that the resulting two intervals give mimimum entropy. 2. The splitting process is then repeated with another interval,with the worst (highest) entropy, until a user-specified number of intervals is reached or a stopping criterion is satisfied.

Answer 94

a mathematical transformation that is applied to all values of a variable.

Answer 95

another type of variable transformation to make an entire set of values have a particular property

Answer 96

they are used by a number of data mining techniques such as clustering, nearest neighbor classification, and anomaly detection.

Answer 97

- used to either refer to similarity or dissimilarity - the proximity between two objects is a function of the proximity between the corresponding attributes of the two objects

Answer 98

- The similarity between two objects is a numerical measure of the degree to which the two objects are a like. - Similarities are higher for pairs of objects that are more alike - usually not negative and between 0 (no similarity) and 1 (complete similarity)

Answer 99

- The dissimilarity between two objects is a numerical measure of the degree to which two objects are different - dissimilarities are lower for more similar pairs of objects - the term distance is used as a synonym for dissimilarity - sometimes fall between 0 and 1, but also common to range from 0 to infinity

Answer 100

- to convert disimilarities to similarities and vice versa - or tranfsforming the values of a priximity measure to a new scale

Answer 101

The straight-line distance between two points on a plane. Euclidean distance, or distance "as the crow flies," can be calculated using the Pythagorean theorem.

Answer 102

a matrix (two-dimensional array) containing the distances, taken pairwise, of a set of points. This matrix will have a size of N×N whereN is the number of points, nodes or vertices (often in a graph).

Answer 103

1. Positivity a) d(x,x) \>= 0 for all x and y b) d(x,y) = 0 only if x = y 2. Symmetry d(x,y) = d(y,x) for all x and y 3. Triangle Inequality d(x,z) \< = d(x,y) + d(y,z) for all points x, y and z

Answer 104

- Similarity measures between objects that contain only binary attributes

Answer 105

- They typically have values between 0 and 1. - A value of 1 indicates that the two objects are completely similar, while a value of 0 indicates that the objects are not at all similar.

Answer 106

- a type of similarity coefficient SMC = number of matching attribute values / number of attributes = _f11 + f00_ f01 + f10 + f11 + f00

Answer 107

The Simple Matching Coefficient (SMC)

Answer 108

Simple Matching Coefficient

Answer 109

Simple Matching Coefficient, similarity coefficient

Answer 110

Jaccard coefficient

Answer 111

- a similarity measure that is frequently used to handle objects consisting of asymmetric binaryh attributes. - Often symbolized by J

Answer 112

J= _number of matching presences_ number of attributes not involved in 00 matches = _f11_ f01 + f10 + f11

Answer 113

ignores 0-0 matches like the Jaccard measure, but also handles non-binary vectors

Answer 114

cosine similarity

Answer 115

cosine similarity

Answer 116

Tanimoto coefficient

Answer 117

-the correlation between two data objects that habe binary or continous variables is a measure of the linear relationship between the attributes of the objects.

Answer 118

Correlation is always in the range -1 to 1. A correlation of 1 ( -1 ) means that x and y have a perfect positive (negative) linear relationship; that is, xk = ayk + b, where a and b are constants. If the correlation is 0, then there is no linear relationship between the attributes of the two data objects.

Answer 119

- a family of priximity functions that share some common properties - are loss or distortion functions - can be used as dissimilarity functions

Answer 120

1. how to handle the case in which attributes have different scales and /or are correlated 2. how to calculate proximity between objects that are composed of different types of attributes e.g. quantitative / qualitative 3. how to handle proximty calculation when attributes have different weights eg. when not all attributes contribute equally to the proximity of objects

Answer 121

The mahalanobis distance is useful when attributes are correlated, have different ranges of values (different variances), and the distrubution of the data is approximately Gaussian (normal).

Answer 122

The graph of a Gaussian is a characteristic symmetric "bell curve" shape.

Answer 123

Compute the similarity between each attribute sepearately, and then combine these similarities using a method that results in a similarity between 0 and 1. The overall ismiolarity is defined as the average of all the individual attribute similarties.

Answer 124

The formlas for proximity can be modified by weighting the contribution of each attribute.

Answer 125

metric distance measures such as the Euclidean distances

Answer 126

Binary, qualitative, ordinal

Answer 127

Continuous, quantitative, ratio

Answer 128

Discrete, qualitative, ordinal

Answer 129

Continuous, quantitative, ratio

Answer 130

Discrete, qualitative, ordinal

Answer 131

Discrete, quantitative, ratio

Answer 132

Discrete, qualitative, nominal (ISBN numbers do have order information, though).

Answer 133

Discrete, qualitative, ordinal

Answer 134

Discrete, qualitative, ordinal

Answer 135

Continuous, quantitative, interval/ratio (depends)

Answer 136

Discrete, qualitative, nominal

Answer 137

One example: Student IDs are a good predictor of graduation date.

Answer 138

A feature shows spatial auto-correlation if locations that are closer to each other are more similar with respect to the values of that feature than locations that are farther away. It is more common for physically close locations to have similar temperatures than similar amounts of rainfall since rainfall can be very localized;, i.e., the amount of rainfall can change abruptly from one location to another. Therefore, daily temperature shows more spatial autocorrelation then daily rainfall.

Answer 139

(1) Text files can be easily inspected by typing the file or viewing it with a text editor. (2) Text files are more portable than binary files, both across systems and programs. (3) Text files can be more easily modified, for example, using a text editor or perl.

Answer 140

No, by definition. Yes. (See Chapter 10.)

Answer 141

Yes. Random distortion of the data is often responsible for outliers.

Answer 142

No. Random distortion can result in an object or value much like a normal one.

Answer 143

No. Often outliers merely represent a class of objects that are different from normal objects.

Answer 144

Many times the data has only positive entries and in that case the range is [0, 1].

Answer 145

Not necessarily. All we know is that the values of their attributes differ by a constant factor.

Answer 146

One approach is to compute the distance between the centroids of the two sets of points.

Answer 147

In general, an object can be a record whose fields (attributes) are of different types. To compute the overall similarity of two objects in this case, we need to decide how to compute the similarity for each attribute and then combine these similarities. In contrast, the values of an attribute are all of the same type, and thus, if another attribute is of the same type, then the computation of similarity is conceptually and computationally straightforward.

Answer 148

the phenomenon that many types of data anlysis become significantly harder as the dimensionality of the data increases. As the dimensionality of the data increases, the data becomes increasingly sparse in the space that it occupies.

Chapter 2 - Data Flashcards

(187 cards)