Quiz 2 Flashcards
Truths about Data Warehouses
Data will not be modified by the end user.
Data may be integrated and cleaned from many large sources.
Data discretization is part of data reduction
True
Which of the following is true about data normalization?
Normalization scales the range of the data into some (generally smaller) specified range.
Z-Score normalization is useful for finding outliers because each point is represented by how far from the mean it is
When subtracting an offset and dividing by a range we change the mean and standard deviation of data without actually changing the shape of its distribution (as seen in a histogram)
Which of the following are issues in data integration? (which would actually cause conflicts)
Two different databases may have different column names for the same actual information (e.g. customerID vs cust-id).
An attribute named ‘weight’ may be in different units in different databases.
There may be discrepancies between entries in two different databases for the same actual real-life entity (e.g. for an employee).
z-score normalization (standardization)
the new values tell how many standard deviations the sample is from the mean of the original data.
min-max normalization
the values are linearly scaled from one interval into another; the middle value means nothing special.
decimal scaling
result is guaranteed to be between -1 and 1, but original zeros stay zero
The two major types of data reduction
Dimensionality reduction and numerosity reduction (the number of variables and the number of points)
Which of the following are methods of dimension reduction?
Feature selection
Feature extraction
Forward selection and backward selection
Attribute relevance analysis (e.g. information gain)
We discussed one method of Feature Extraction, Principle Component Analysis (PCA). Which of the following describes PCA?
PCA creates new features from the original attributes which can efficiently account for most of the variance of the data with fewer dimensions.
Which of the following are true about Forward Selection?
Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.
What are other names for features?
Attributes
predictors
explanatory variables
Binning numerical data into chunks (bins) can be useful for
dealing with noisy data by smoothing out lots of variation into chunks with reasonable ranges
drawing a histogram
Which of these are true of using clustering for smoothing?
We replace data points by an average or representatives of points in their cluster.
If all available data cleaning algorithms are run in sequence, there is no need to include human judgement in the process.
False
Which of the following are ways to deal with missing data values?
Use a special value like “unknown” to capture that there is meaning to the fact that value is missing.
Replace with the average value of the attribute among data points with the same class.
Predict missing value with a model based on the data you do have (i.e. classification or regression).
Text data can be stored in a matrix with a “bag-of-words” model.
each row represents a unit of text (e.g. document) and each column represents a word.
We’ve discussed several uses of clustering. Which of the following are included?
Smoothing noise
Numerosity reduction.
Finding outliers
Which of the following are true about Forward Selection?
Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.
Forward selection is a greedy algorithm that runs a classification algorithm over and over as part of evaluating subsets of features.
Using forward selection can result in a model that generalizes better, i.e. is less subject to overfitting.
The main criteria optimized in methods for projecting high dimensional data to 2D (like MDS)
Pairwise distances between points in the new 2D space are as close as possible to the corresponding distances in high-dimensional space.
A classifier is used to
discover a pattern that can predict a class that a new data instance falls into.
The key difference between supervised and unsupervised machine learning problems is due to the presence or absence of labeled data.
True