Chapter 2 - Data Flashcards
Why is the type of data important to data mining?
The type of data determines which tools and techniques can be used to analyze it.
Why is data quality important?
Improving data quality typically improves the quality of the resulting analysis.
What is a data set?
A collection of data objects
What is a data object?
record, point, vector, pattern, event, case, sample, observation, or entity
What are attributes?
A property or characteristic of an object that may vary from one object to another
What is a **measurement scale **when referring to data mining?
A rule (function) that associates a numerical or symbolic value with an attribute of an object.
Describe the process of measurement when referring to data mining.
Using a measurement scale to associate a value with a particular attribute of a specific object.
What are the 4 properties (operations) of numbers that are typically used to describe attributes?
- Distinctness = and != 2. Order , and => 3. Addition + and - 4. Multiplication x and /
What are the 4 types of attributes?
- Nominal 2. Ordinal 3. Interval 4. Ratio
What is a nominal type of attribute?
The values of a nominal attribute are just different names that provide only enough information to distinguish one object from another.
What is an ordinal type of attribute?
The values of an ordinal attribute provide enough information to order objects.
What is an interval type of attribute?
The differences between values of an interval attribute are meaningful (a unit of measurement exists).
What is a ratio type of attribute?
The ratio and differences are both meaningful in a ratio attribute.
What type of attribute is the following: zip codes
Nominal
What type of attribute is the following: employee ID numbers
Nominal
What type of attribute is the following: eye colour
Nominal
What type of attribute is the following: gender
Nominal
What type of attribute is the following: hardness of minerals {good, better, best}
Ordinal
What type of attribute is the following: grades
Ordinal
What type of attribute is the following: street numbers
Ordinal
What type of attribute is the following: calendar dates
Interval
What type of attribute is the following: temperature in Celsius or Fahrenheit
Interval
What type of attribute is the following: monetary quantities
Ratio
What type of attribute is the following: counts
Ratio
What type of attribute is the following: age
Ratio
What type of attribute is the following: mass
Ratio
What type of attribute is the following: length
Ratio
What type of attribute is the following: electric current
Ratio
What two types of attributes are categorical / qualitative?
Nominal and ordinal
What are qualitative attributes?
They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as categorical attributes.
What are categorical attributes?
They lack most of the properties of number (even if they are represented as numbers) and should be treated more like symbols. Also known as qualitative attributes.
What are the two types of quantitative attributes?
Interval and ratio
What are quantitative attributes?
They are represented by numbers and have most of the properties of numbers. Also known as numeric attributes.
What are numeric attributes?
They are represented by numbers and have most of the properties of numbers. Also known as quantitative attributes.
What are permissible transformations?
The meaning of a length of an attribute is unchanged if a different measurement scale is used. For example, using the metric or the imperial system does not change the length.
Given the following transformation, what type of attribute could be used? If all employee ID numbers are reassigned, it would not make a difference
Nominal
Given the following transformation, what type of attribute could be used? An attribute encompassing the notion of good, better, best, can represented equally well by the values 1,2,3
Ordinal
Given the following transformation, what type of attribute could be used? The Fahrenheit and Celsius temperature scales differ in their zero value and the size of a degree (unit).
Interval
Given the following transformation, what type of attribute could be used? Length can be measured in meters or feet
Ratio
What is a coefficient?
A number or symbol multiplied with a variable or an unknown quantity in an algebraic term. For example, 4 is the coefficient in the term 4x, and x is the coefficient in x(a + b).
A way of distinguishing between attributes is by the number of values they can take. What are the two types of attributes in this case?
Discrete - has a finite or countably infinite set of values. Continuous - values are real numbers.
What is a real number?
In mathematics, a real number is a value that represents a quantity along a continuous line
What is a discrete attribute?
- has a finite or countably infinite set of values. - often represented using integer variables. - binary attributes are a special case of discrete attributes
What is a continuous attribute?
- values are real numbers - are typically represented as floating-point variables - practically, can only be measured and represented with limited precision
Typically, nominal and ordinal attributes are: a) continuous b) binary or discrete
b) binary or discrete
Typically, interval and ratio attributes are: a) continuous b) binary or discrete
a) continuous
What is an asymmetric attribute?
- only presence (a non-zero attribute value) is important -eg. whether or not a student took a particular course
What are asymmetric binary attributes?
- binary attributes where only non-zero values are important
What are the general characteristics of data sets?
- Dimensionality 2. Sparsity 3. Resolution
When describing the general characteristics of data sets, what is dimensionality?
- the number of attributes that the objects in the data set possess
When describing the general characteristics of data sets, what is sparsity?
- the amount of zero values within the data
When describing the general characteristics of data sets, what is resolution?
- frequently possible to obtain data at different levels of resolution - properties of the data are different at different resolutions - eg. the surface of the earth: flat vs bumpy
What are 3 examples of record data, data sets?
- Transaction or Market Basket Data 2. The Data Matrix 3. Sparse Data Matrix
What are 2 examples of graph based, data sets?
- Data with Relationships among Objects 2. Data with Objects that are Graphs
What are 4 examples of ordered data, data sets?
- Sequential Data
- Sequence Data
- Time Series Data
- Spatial Data
What is transaction or market basket data?
- each record (transaction) involves a set of items eg. items purchased at a grocery store - fields are typically asymmetric attributes (most often binary)
What is a data matrix?
- if the data objects in a collection all have the same fixed set of numeric data objects (vectors) in a multidimensional space where each dimension represents a distinct attribute describing the object. - can be interpreted as an m by n matrix where there are m rows, one for each object, and n columns, one for each attribute. - standard matrix operations can be applied to transform and manipulate the data
What is a sparse data matrix?
- a special case of data matrix which the attributes are of the same type and are asymmetric (only non-zero values are important). eg. document data
What is Data with Relationships among Objects?
- data objects are mapped to nodes of the graph while the relationships among objects are captured by the links between objects and link properties such as direction and weight. eg. web pages on the world wide web
What is Data with Objects that are Graphs?
- the objects contain sub-objects that have relationships eg. structural and chemical compounds
What is sequential data / temporal data?
- an extension of record data where each record has a time associated to it eg. retail transaction
What is time series data?
- a special type of sequential data in which each record is a time series (series of measurements taken over time) eg. average monthly temperature of a city between range of dates
What is spatial data?
- some objects have spatial attributes such as positions eg. weather data (precipitation, temperature, pressure) collected from a variety of geographical locations
What is temporal autocorreltation?
- especially applies to time series data - if two measurements are close in time, then the values of those measurements are often very similar.
What is spatial autocorreltation?
- especially applies to spatial data - objects that are physically close tend to be similar
How can non-record data be handled?
-Record-oriented techniques can be applied to non-record data 1. extracting features from data objects 2. use these features to create a record corresponding to each object.
What is the issue with applying record-oriented techniques to non record data?
It does not capture all of the information in the data.
What two principles does data mining focus on when referring to data quality?
- the detection and correction of data quality problems 2. the use of algorithms that can tolerate poor data quality
What is data cleaning?
The detection and correction of data quality problems.
Name 3 errors that can occur with data collection
- Human error
- Limitations of measuring devices
- Flaws in the data collection process
Define measurement error
- refers to any problem resulting from the measurement process - eg. the record differs from the true value to some extent
Define error (for continuous attributes)
the numerical difference of the measured and true value
Define data collection error
errors such as: omitting data objects or attribute values or inappropriately including a data object
Define noise with respect to data mining
The distortion of a value or the addition of spurious objects