Midterm Flashcards
What fields are apart of data mining?
Machine learning Databases Visualization Application Domain Statistics
Explain the data mining process
Collect data Prepare data Build a model Evaluate model Deploy model
What are the two types of data mining techniques?
Predictive (supervised) predict (discrete or continuous) class attribute based on other attribute values. This is like learning from a teacher. Descriptive (unsupervised) Discover structure of data without prior knowledge of class labels
What is big data?
Data that is to large to be analyzed with today’s resources
Describe the algorithm for k-means (clustering)
- Randomly pick k cluster centers 2. Assign every object to its nearest cluster center 3. Move each cluster center 4. Repeat steps 2,3, until stopping criterion is satisfied
Describe the different types of attributes and the properties associated to each.
Nominal - distinctness Ordinal - distinctness, order Interval - distinctness, order, +, - Ration - distinctness, order, +, -, *, /
What are the 3 types of data sets?
Record - Data matrix - Document data - Transaction data Graph - objects with relationships to other objects - objects that have sub-objects Ordered - Spatial data - Temporal data - Sequence - Sequential
Describe the different sampling technique
Simple random sampling - each object is selected with equal probability - (with replacement, without replacement) Stratified sampling - split data into several partitions, and take random samples from each partition
What is under sampling?
- include all of the minority classes 2. sample randomly from the majority class with or without replacement
What is over sampling?
It’s the opposite of under sampling 1. include all samples from the majority class 2. sample the minority class with replacement
What are the two ways of mapping categorical values to numerical data (transformation for preprocessing)?
1 of n - create 1 new attribute for each possible value of the categorical attribute m of n - create new attributes such that each categorical value is mapped to a unique representation using the new attribute
What are the 3 ways values can be missing?
- Missing Completely at Random (MCAR) eg. randomly crossing out things in your data. This is the only time it is acceptable to substitute average of known values, but this reduces the variance of the set. 2. Missing at Random - the reason why it is missing is related to value of another attribute (eg. bias facebook page) 3. Non Ignorable Data - the reason why the value is missing is related to the value itself (eg. limitation of sensor instrument below a certain threshold)
What is normalization and what is it used for?
-equalize the weights associated with attribute values - gives each attribute the same importance
What are the 3 types of normalization?
- min-max normalization - z-score normalization - decimal scaling
What is the significance of sample size?
With sample size, there is a trade off between accuracy and speed.
Describe feature subset selection.
- Remove redundant features - duplicate info contained in other attributes 2. remove irrelevant features - not relevant to the data mining task at hand
Describe feature creation
mapping data to a new space
What are the different types of attribute transformations?
categorical to numeric (1 of n and n of m) numeric to categorical (binning and discretization) -PCA -Normalization and Standardization