Chapter 2 - Data Flashcards
Why is the type of data important to data mining?
The type of data determines which tools and techniques can be used to analyze it.
Why is data quality important?
Improving data quality typically improves the quality of the resulting analysis.
What is a data set?
A collection of data objects
What is a data object?
record, point, vector, pattern, event, case, sample, observation, or entity
What are attributes?
A property or characteristic of an object that may vary from one object to another
What is a **measurement scale **when referring to data mining?
A rule (function) that associates a numerical or symbolic value with an attribute of an object.
Describe the process of measurement when referring to data mining.
Using a measurement scale to associate a value with a particular attribute of a specific object.
What are the 4 properties (operations) of numbers that are typically used to describe attributes?
- Distinctness = and != 2. Order , and => 3. Addition + and - 4. Multiplication x and /
What are the 4 types of attributes?
- Nominal 2. Ordinal 3. Interval 4. Ratio
What is a nominal type of attribute?
The values of a nominal attribute are just different names that provide only enough information to distinguish one object from another.
What is an ordinal type of attribute?
The values of an ordinal attribute provide enough information to order objects.
What is an interval type of attribute?
The differences between values of an interval attribute are meaningful (a unit of measurement exists).
What is a ratio type of attribute?
The ratio and differences are both meaningful in a ratio attribute.
What type of attribute is the following: zip codes
Nominal
What type of attribute is the following: employee ID numbers
Nominal
What type of attribute is the following: eye colour
Nominal
What type of attribute is the following: gender
Nominal
What type of attribute is the following: hardness of minerals {good, better, best}
Ordinal
What type of attribute is the following: grades
Ordinal
What type of attribute is the following: street numbers
Ordinal
What type of attribute is the following: calendar dates
Interval
What type of attribute is the following: temperature in Celsius or Fahrenheit
Interval
What type of attribute is the following: monetary quantities
Ratio
What type of attribute is the following: counts
Ratio
What type of attribute is the following: age
Ratio
What type of attribute is the following: mass
Ratio
What type of attribute is the following: length
Ratio
What type of attribute is the following: electric current
Ratio
What two types of attributes are categorical / qualitative?
Nominal and ordinal
What are qualitative attributes?
They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as categorical attributes.
What are categorical attributes?
They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as qualitative attributes.
What are the two types of quantitative attributes?
Interval and ratio
What are quantitative attributes?
They are represented by numbers and have most of the properties of numbers. Also known as numeric attributes.
What are numeric attributes?
They are represented by numbers and have most of the properties of numbers. Also known as quantitative attributes.
What are permissible transformations?
The meaning of a length of an attribute is unchanged if a different measurement scale is used. For example, using the metric or the imperial system does not change the length.
Given the following transformation, what type of attribute could be used? If all employee ID numbers are reassigned, it would not make a difference
Nominal
Given the following transformation, what type of attribute could be used? An attribute encompassing the notion of good, better, best, can represented equally well by the values 1,2,3
Ordinal
Given the following transformation, what type of attribute could be used? The Fahrenheit and Celsius temperature scales differ in their zero value and the size of a degree (unit).
Interval
Given the following transformation, what type of attribute could be used? Length can be measured in meters or feet
Ratio
What is a coefficient?
A number or symbol multiplied with a variable or an unknown quantity in an algebraic term. For example, 4 is the coefficient in the term 4x, and x is the coefficient in x(a + b).
A way of distinguishing between attributes is by the number of values they can take. What are the two types of attributes in this case?
Discrete - has a finite or countably infinite set of values. Continuous - values are real numbers.
What is a real number?
In mathematics, a real number is a value that represents a quantity along a continuous line
What is a discrete attribute?
- has a finite or countably infinite set of values. - often represented using integer variables. - binary attributes are a special case of discrete attributes
What is a continuous attribute?
- values are real numbers - are typically represented as floating-point variables - practically, can only be measured and represented with limited precision
Typically, nominal and ordinal attributes are: a) continuous b) binary or discrete
b) binary or discrete
Typically, interval and ratio attributes are: a) continuous b) binary or discrete
a) continuous
What is an asymmetric attribute?
- only presence (a non-zero attribute value) is important -eg. whether or not a student took a particular course
What are asymmetric binary attributes?
- binary attributes where only non-zero values are important
What are the general characteristics of data sets?
- Dimensionality 2. Sparsity 3. Resolution
When describing the general characteristics of data sets, what is dimensionality?
- the number of attributes that the objects in the data set possess
When describing the general characteristics of data sets, what is sparsity?
- the amount of zero values within the data
When describing the general characteristics of data sets, what is resolution?
- frequently possible to obtain data at different levels of resolution - properties of the data are different at different resolutions - eg. the surface of the earth: flat vs bumpy
What are 3 examples of record data, data sets?
- Transaction or Market Basket Data 2. The Data Matrix 3. Sparse Data Matrix
What are 2 examples of graph based, data sets?
- Data with Relationships among Objects 2. Data with Objects that are Graphs
What are 4 examples of ordered data, data sets?
- Sequential Data
- Sequence Data
- Time Series Data
- Spatial Data
What is transaction or market basket data?
- each record (transaction) involves a set of items eg. items purchased at a grocery store - fields are typically asymmetric attributes (most often binary)
What is a data matrix?
- if the data objects in a collection all have the same fixed set of numeric data objects (vectors) in a multidimensional space where each dimension represents a distinct attribute describing the object. - can be interpreted as an m by n matrix where there are m rows, one for each object, and n columns, one for each attribute. - standard matrix operations can be applied to transform and manipulate the data
What is a sparse data matrix?
- a special case of data matrix which the attributes are of the same type and are asymmetric (only non-zero values are important). eg. document data
What is Data with Relationships among Objects?
- data objects are mapped to nodes of the graph while the relationships among objects are captured by the links between objects and link properties such as direction and weight. eg. web pages on the world wide web
What is Data with Objects that are Graphs?
- the objects contain sub-objects that have relationships eg. structural and chemical compounds
What is sequential data / temporal data?
- an extension of record data where each record has a time associated to it eg. retail transaction
What is time series data?
- a special type of sequential data in which each record is a time series (series of measurements taken over time) eg. average monthly temperature of a city between range of dates
What is spatial data?
- some objects have spatial attributes such as positions eg. weather data (precipitation, temperature, pressure) collected from a variety of geographical locations
What is temporal autocorreltation?
- especially applies to time series data - if two measurements are close in time, then the values of those measurements are often very similar.
What is spatial autocorreltation?
- especially applies to spatial data - objects that are physically close tend to be similar
How can non-record data be handled?
-Record-oriented techniques can be applied to non-record data 1. extracting features from data objects 2. use these features to create a record corresponding to each object.
What is the issue with applying record-oriented techniques to non record data?
It does not capture all of the information in the data.
What two principles does data mining focus on when referring to data quality?
- the detection and correction of data quality problems 2. the use of algorithms that can tolerate poor data quality
What is data cleaning?
The detection and correction of data quality problems.
Name 3 errors that can occur with data collection
- Human error
- Limitations of measuring devices
- Flaws in the data collection process
Define measurement error
- refers to any problem resulting from the measurement process - eg. the record differs from the true value to some extent
Define error (for continuous attributes)
the numerical difference of the measured and true value
Define data collection error
errors such as: omitting data objects or attribute values or inappropriately including a data object
Define noise with respect to data mining
The distortion of a value or the addition of spurious objects
This term is often used in connection with data that has a spatial or temporal component. Techniques from signal or image processing can be used to reduce it.
noise
Define robust algorithms
algorithms that produce acceptable results even when noise is present
Define artifacts with respect to data mining
deterministic distortions of the data eg. a streak in a photograph
Define precision
the closeness of repeated measurements (of the same quality) to one another
eg. scale - measure same object 5 times using scale - the percission of the scale is the standard deviation
Define bias
a systematic variation of measurements from the quantity being measured
eg. scale - measure same object 5 times using scale - the bias is the mean of the 5 measurements
This is often measured by the standard deviation of a set of values
precision
Define standard deviation
the Standard Deviation is a measure of how spread out values are.
it is represented by the the greek letter sigma σ
it is the square root of the Variance
This is often measured by taking the difference between the mean of the set of values and the known value of the quantity being measured
bias
This can only be determined for objects whose measured quantity is known by means external to the current situation
Bias
Define Accuracy
The closeness of measurements to the true value of the quantity being measured
Define significant digits with respect to accuracy and data mining
goal is to use only as many digits to represent the result of a measurement or calculation that is justified
What are the two definitions of outliers?
- data objects that have characteristics that are different from most of the other data objects in the data set 2. values of an attribute that are unusual with respect to the typical values for that attribute
What is the difference between outliers and noise?
Outliers can be legitimate data objects or values and, unlike noise, may be of interest.
What 3 things can be done when an object or one or more attribute values are missing from a data set?
- Eliminate data objects or attributes 2. Estimate Missing Values 3. Ignore the missing values during analysis
When an object or one or more attribute values are missing from a data set, what is the disadvantage of eliminating the object altogether?
Even a partial data object contain some info and analysis may be unreliable if many values are missing
When an object or one or more attribute values are missing from a data set explain estimating missing values.
- missing data can sometimes be reliably estimated - can use interpolation
When an object or one or more attribute values are missing from a data set explain ignoring the missing values during analysis.
- many data mining approaches can be modified to ignore missing values
Define interpolation
An estimation of a value within two known values in a sequence of values
Explain inconsistent values
When you can tell the data is not right - eg. a negative height - a zip code that doesn’t belong to a city
Is it possible to correct inconsistent values?
Sometimes. For example, credit cards or product codes might have “check” digits The correctness of an inconsistency requires consistent or redundant information
What two main issues need to be addressed with duplicate data?
- If there are two objects that actually represent a single object, any different attributes needs to be resolved 2. Care needs to be taken to avoid accidently combining data objects that are similar but not duplicates such as two distinct people with identical names.
Define deduplication
The process of dealing with duplicates
What are three issues related to the application of data quality?
- Timeliness - data can age as soon as its collected 2. Relevance - the data must contain the information necessary for its application 3. Knowledge about the data - data is normally accompanied by documentation that can affect analysis eg. a missing value is represented by -9999
What are 7 topics related to data preprocessing?
Feature subset selection
Dimensionality reduction
Feature creation
Discretization and binarization
Aggregation
Sampling
Variable transformation
Define aggregation with respect to data mining (data preprocessing)
- a data mining preprocessing step - combining two or more objects into a single object eg. combining chain of store transactions into aggregated single store transactions
What are the advantages of aggregation
- smaller data sets require less memory and processing time and may permit the use of more expensive data mining algorithms - can provide a high-level view of the data - the behaviour of groups of objects or attributes is often more stable than that of individual objects or attributes
What is a disadvantage of using aggregation?
The potential loss of interesting details
Define sampling with respect to data mining (data preprocessing)
- a data mining preprocessing step - selecting a subset of the data objects to be analyzed
When is a sample representative with respect to data mining, preprocessing, sampling
- when the sample has approximately the same property (of interest) as the original set of data For example, if the mean (average) is the property of interest and has the same mean as the original data set
What are 2 common sampling techniques used in data mining
- simple random sampling 2. stratified sampling
Explain the simple random sampling, sampling technique used in data mining, preprocessing, and the two variations of simple random sampling.
There is an equal probabily of selecting any particular item.
There are 2 variations:
1) sampling without replacement (as each item is selected, it is removed from the population)
2) sampling with replacement (objects are not removed from the population and can be selected more than once)
Explain the stratified sampling technique used in data mining, preprocessing, and the two variations of stratified sampling.
Starts with prespecified groups of objects
There are two variations:
1) equal numbers of objects are drawn from each group even though the groups are different sizes
2) number of objects drawn from each group is proportional to the size of that group
Explain adaptive / progressive sampling
Starts with a small sample and then increases the sample size until a sample of sufficient size has been obtained
What is required for adaptive / progressive sampling?
A way to evaluate the sample to judge if it is large enough
Explain dimensionality reduction (data preprocessing)
Techniques that reduce the dimensionality (number of attributes) in a data set
What are some of the advantages of dimensionality reduction?
- many data mining algorithm work better if the number of attributes in the data is lower it can eliminate irrelevant features and reduce noise - can make data more easily visualized
Define Singular Value Decomposition (SVD)
is a linear algebra technique for dimensionality reduction that is related to the Principal Components Analysis (PCA) technique.
Define Feature Subset Selection (data preprocessing)
A way of reducing the dimensionality of data by using only a subset of the features.
When using feature subset selection for dimensionality reduction, is information lost?
Information is not lost if redundant and irrelevant features are present.
Define redundant features (dimensionality reduction)
features that duplicate much or all of the information contained in one or more other attributes eg. the purchase price of a product, and the amount of sales tax paid
Define irrelevant features (dimensionality reduction, data preprocessing)
contain almost no useful information for the data mining task at hand
What are the 3 standard approaches for feature subset selection (data preprocessing)?
- Embeded approaches - the data mining algorithm itself decides which attributes to use and which to ignore 2. Filter approaches - features are selected before the data mining algorithm is run 3. Wrapper approaches - use the data mining algorithm as a black box to find the best subset of attributes without enumerating all possible subsets
Define feature weighting
- an alternative to keeping or eliminating features - more important features are assigned a highler weight
Define feature creation (data preprocessing)
- create a new set of attributes that captures the important information in a data set from the original attributes, much more effectively
What are 3 methodologies for creating new attributes (feature creation, data preprocessing)
- Feature extraction - the creation of a new set of features from the original raw data 2. Mapping the data to a new space - provides a new view of the data - techniques such as as a fourier transform or wavelet transform 3. Feature construction - One or more new features are constructed out of the original features that is more useful than the original features
Define discretization (data preprocessing)
transforming a continuous attribute into a categorical attribute
Define binarization (data preprocessing)
transforming both continuous and discrete attributes into one or more binary attributes
Explain how to binarize a categorical attribute
- uniquely assign each value to an integer 2. convert each of these integers into a binary number
What two subtasks are involved in transforming a continuous attribute to a categorical one (discretization of continuous attributes)
- decide how many categories to have 2. map the values of the continues attribute to these categories
What is the difference between unsupervised and supervised discretization
If class information is used, the discretization is supervised; when no class information is used, it is unsupervised.
Describe the equal width discretization approach
- divides the range of the attributes into a user-specified number of intervals, each having the same width - can be badly affected by outliers
Describe the equal frequency (equal depth) discretization approach
- preferred over equal width approach - tries to put the same number of objects into each interval
What are 3 common approaches to unsupervised discretization?
- equal width 2. equal frequency 3. K-means
Define entropy
Entropy is best understood as a measure of uncertainty rather than certainty as entropy is larger for more random sources.) The source is also characterized by the probability distribution of the samples drawn from it. The idea here is that the less likely an event is, the more information it provides when it occurs.
These type of approaches to discretization are the most promising
entropy
What is a simple approach for partitioning a continous attribute?
- Start by bisecting the initial values so that the resulting two intervals give mimimum entropy.
- The splitting process is then repeated with another interval,with the worst (highest) entropy, until a user-specified number of intervals is reached or a stopping criterion is satisfied.
What is a variable transformation?
a mathematical transformation that is applied to all values of a variable.
What is standardization or normalization of a variable?
another type of variable transformation to make an entire set of values have a particular property
Why is similarity and dissimilarity important to data mining?
they are used by a number of data mining techniques such as clustering, nearest neighbor classification, and anomaly detection.
Define proximity
- used to either refer to similarity or dissimilarity
- the proximity between two objects is a function of the proximity between the corresponding attributes of the two objects
Define similarity
- The similarity between two objects is a numerical measure of the degree to which the two objects are a like.
- Similarities are higher for pairs of objects that are more alike
- usually not negative and between 0 (no similarity) and 1 (complete similarity)
Define dissimilarity
- The dissimilarity between two objects is a numerical measure of the degree to which two objects are different
- dissimilarities are lower for more similar pairs of objects
- the term distance is used as a synonym for dissimilarity
- sometimes fall between 0 and 1, but also common to range from 0 to infinity
What is a monotonic decreasing function used for in data mining?
- to convert disimilarities to similarities and vice versa
- or tranfsforming the values of a priximity measure to a new scale
What is the Euclidean distance?
The straight-line distance between two points on a plane. Euclidean distance, or distance “as the crow flies,” can be calculated using the Pythagorean theorem.
What is a distance matrix?
a matrix (two-dimensional array) containing the distances, taken pairwise, of a set of points. This matrix will have a size of N×N whereN is the number of points, nodes or vertices (often in a graph).
What are some well known properties for distances such as the Euclidean distance?
- Positivity
a) d(x,x) >= 0 for all x and y
b) d(x,y) = 0 only if x = y - Symmetry
d(x,y) = d(y,x) for all x and y
- Triangle Inequality
d(x,z) < = d(x,y) + d(y,z) for all points x, y and z
Define similarity coefficients
- Similarity measures between objects that contain only binary attributes
What are the valid range of values for similarity coefficients?
- They typically have values between 0 and 1.
- A value of 1 indicates that the two objects are completely similar, while a value of 0 indicates that the objects are not at all similar.
Define the Simple Matching Coefficient (SMC)
- a type of similarity coefficient
SMC = number of matching attribute values / number of attributes
= f11 + f00
f01 + f10 + f11 + f00
This similarity measure counts both presencs and absences.
The Simple Matching Coefficient (SMC)
This similarity measure is useful when both positive and negative values carry equal information (symmetry). For example, gender (male and female).
Simple Matching Coefficient
This similarity test could be used to find students who answered questions similarly on a test that consisted only of true and false questions.
Simple Matching Coefficient, similarity coefficient
This similarity measure is frequently used to handle objects consisting of asymmetric binary attributes.
Jaccard coefficient
Define the Jaccard Coefficent
- a similarity measure that is frequently used to handle objects consisting of asymmetric binaryh attributes.
- Often symbolized by J
What is the formula for the Jaccard Coefficent?
J= number of matching presences
number of attributes not involved in 00 matches
= f11
f01 + f10 + f11
Define cosine similarity
ignores 0-0 matches like the Jaccard measure, but also handles non-binary vectors
This time of similarity measure is one of the most common measure of document similarity
cosine similarity
This type of similarity measure does not take the magnitude of the two data objects into account when computing similarity
cosine similarity
This coefficient can be used for document data and reduces to the Jaccard coefficient in the case of binary attributes
Tanimoto coefficient
Define correlation
-the correlation between two data objects that habe binary or continous variables is a measure of the linear relationship between the attributes of the objects.
What is the range of correlation?
Correlation is always in the range -1 to 1.
A correlation of 1 ( -1 ) means that x and y have a perfect positive (negative) linear relationship; that is, xk = ayk + b, where a and b are constants.
If the correlation is 0, then there is no linear relationship between the attributes of the two data objects.
Define Bregman Divergences
- a family of priximity functions that share some common properties
- are loss or distortion functions
- can be used as dissimilarity functions
What are 3 issues with proximity calculation?
- how to handle the case in which attributes have different scales and /or are correlated
- how to calculate proximity between objects that are composed of different types of attributes e.g. quantitative / qualitative
- how to handle proximty calculation when attributes have different weights eg. when not all attributes contribute equally to the proximity of objects
How is distance measures handled when attributes do not have the same range of values?
The mahalanobis distance is useful when attributes are correlated, have different ranges of values (different variances), and the distrubution of the data is approximately Gaussian (normal).
What is Gaussian?
The graph of a Gaussian is a characteristic symmetric “bell curve” shape.
How is similarity measures handled when attributes are of different types?
Compute the similarity between each attribute sepearately, and then combine these similarities using a method that results in a similarity between 0 and 1.
The overall ismiolarity is defined as the average of all the individual attribute similarties.
When computing proximity how do you handle when some attributes are more important to the definition of priximity than others?
The formlas for proximity can be modified by weighting the contribution of each attribute.

This type of proximity measure is often used for many types of dense, continuous data.
metric distance measures such as the Euclidean distances
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Time in terms of AM or PM
Binary, qualitative, ordinal
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Brightness as measured by a light meter.
Continuous, quantitative, ratio
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Brightness as measured by people’s judgments.
Discrete, qualitative, ordinal
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Angles as measured in degrees between 0◦ and 360◦.
Continuous, quantitative, ratio
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Bronze, Silver, and Gold medals as awarded at the Olympics.
Discrete, qualitative, ordinal
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Height above sea level.
Continuous, quantitative, interval/ratio (de-
pends on whether sea level is regarded as an arbitrary origin)
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Number of patients in a hospital.
Discrete, quantitative, ratio
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
ISBN numbers for books. (Look up the format on the Web.)
Discrete, qualitative, nominal (ISBN numbers do have order information, though).
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Ability to pass light in terms of the following values: opaque, translu-
cent, transparent.
Discrete, qualitative, ordinal
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Military rank.
Discrete, qualitative, ordinal
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio):
Distance from the center of campus.
Continuous, quantitative, interval/ratio (depends)
Classify the following attribute as binary, discrete, or continuous; also, classify it as qualitative (nominal or ordinal) or quantitative (interval or ratio): Coat check number.
Discrete, qualitative, nominal
Can you think of a situation in which identification numbers would be useful for prediction?
One example: Student IDs are a good predictor of graduation date.
Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why?
A feature shows spatial auto-correlation if locations that are closer to each other are more similar with respect to the values of that feature than locations that are farther away. It is more common for physically close locations to have similar temperatures than similar amounts of rainfall since rainfall can be very localized;, i.e., the amount of rainfall can change abruptly from one location to another. Therefore, daily temperature shows more spatial autocorrelation then daily rainfall.
Give at least two advantages to working with data stored in text files instead of in a binary format.
(1) Text files can be easily inspected by typing the file or viewing it with a text editor.
(2) Text files are more portable than binary files, both across systems and programs.
(3) Text files can be more easily modified, for example, using a text editor or perl.
Distinguish between noise and outliers. Be sure to consider the following questions.
Is noise ever interesting or desirable? Outliers?
No, by definition. Yes. (See Chapter 10.)
Distinguish between noise and outliers. Be sure to consider the following questions.
Can noise objects be outliers?
Yes. Random distortion of the data is often responsible for outliers.
Are noise objects always outliers?
No. Random distortion can result in an object or value much like a
normal one.
Are outliers always noise objects?
No. Often outliers merely represent a class of objects that are different
from normal objects.
Can noise make a typical value into an unusual one, or vice versa?
Yes.
What is the range of values that are possible for the cosine measure?
Many times the data has only positive entries and in that case the range is [0, 1].
If two objects have a cosine measure of 1, are they identical? Explain.
Not necessarily. All we know is that the values of their attributes differ by a constant factor.
Proximity is typically defined between a pair of objects.
How might you define the distance between two sets of points in Eu-clidean space?
One approach is to compute the distance between the centroids of the two sets of points.
Explain why computing the proximity between two attributes is often simpler than computing the similarity between two objects.
In general, an object can be a record whose fields (attributes) are of different types. To compute the overall similarity of two objects in this case, we need to decide how to compute the similarity for each attribute and then combine these similarities. In contrast, the values of an attribute are all of the same type, and thus, if another attribute is of the same type, then the computation of similarity is conceptually and computationally straightforward.
What is the curse of dimensionality?
the phenomenon that many types of data anlysis become significantly harder as the dimensionality of the data increases.
As the dimensionality of the data increases, the data becomes increasingly sparse in the space that it occupies.