Lecture 3 Flashcards
What are the 5 steps of the data preparation process?
Collect data Prepare data Build model Evaluate model Deploy model
Name 7 data preprocessing techniques
Feature subset selection Discretization and binarization Dimensionality reduction Aggregation Attribute creation and transformation Sampling
Explain the preprocessing technique, aggregation.
Combining two or more attributes (or objects) into a single attribute (or object).
What are two techniques of aggregation?
Data reduction: reduce the number of attributes or objects Change of scale: cities are aggregated into countries, etc…
Describe the 4 sampling techniques used in sampling for preprocessing data?
Simple Random Sampling - each object is selected with equal probability) Sampling without Replacement - remove object from sample if selected Sampling with Replacement - do not remove object from sample if selected Stratified Sampling - split the data into several partitions and take random samples from each partition
What are the two techniques of feature subset selection?
Remove redundant features - duplicate much or all of the information contained in one or more other attributes eg. purchase price of a product and the amout of sales text paix Remove irrelevant features - contain no informatino that is useful for the datamining task at hand eg. student id’s is often irrelevant to the task of predicting students’ GPAs
What are the 3 techniques of discretization without using class labels?
Equal interval width Equal frequency K-means
Describe the curse of dimensionality
When the dimensionality increases, data becomes increasingly sparse in the space that it occupies
What is Big O’ notation?
Measure to evaluate the performance of an algorithm with respect to space or time Used to describe how an algorithm behaves with respect to run time and/or space as the problem size grows.
What is the name of this type of Big O notation? O ( 1 )
Constant
- something can be done in constant time
eg. pick a number from a data set and return it
What is the name of this type of Big O notation? O ( n )
Linear eg. Finding an item in an unsorted list
What is the name of this type of Big O notation? O ( n log n )
Linearthimic
eg. performing a mergesort or heapsort
and used by decision trees
What is the name of this type of Big O notation?
O (n2)
Quadratic
What is the name of this type of Big O notation?
O (n3)
Cubic
What is the name of this type of Big O notation?
O ( 2n)
Exponential
eg. used for feature subset selection