Quiz 2 Flashcards
Truths about Data Warehouses
Data will not be modified by the end user.
Data may be integrated and cleaned from many large sources.
Data discretization is part of data reduction
True
Which of the following is true about data normalization?
Normalization scales the range of the data into some (generally smaller) specified range.
Z-Score normalization is useful for finding outliers because each point is represented by how far from the mean it is
When subtracting an offset and dividing by a range we change the mean and standard deviation of data without actually changing the shape of its distribution (as seen in a histogram)
Which of the following are issues in data integration? (which would actually cause conflicts)
Two different databases may have different column names for the same actual information (e.g. customerID vs cust-id).
An attribute named ‘weight’ may be in different units in different databases.
There may be discrepancies between entries in two different databases for the same actual real-life entity (e.g. for an employee).
z-score normalization (standardization)
the new values tell how many standard deviations the sample is from the mean of the original data.
min-max normalization
the values are linearly scaled from one interval into another; the middle value means nothing special.
decimal scaling
result is guaranteed to be between -1 and 1, but original zeros stay zero
The two major types of data reduction
Dimensionality reduction and numerosity reduction (the number of variables and the number of points)
Which of the following are methods of dimension reduction?
Feature selection
Feature extraction
Forward selection and backward selection
Attribute relevance analysis (e.g. information gain)
We discussed one method of Feature Extraction, Principle Component Analysis (PCA). Which of the following describes PCA?
PCA creates new features from the original attributes which can efficiently account for most of the variance of the data with fewer dimensions.
Which of the following are true about Forward Selection?
Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.
What are other names for features?
Attributes
predictors
explanatory variables
Binning numerical data into chunks (bins) can be useful for
dealing with noisy data by smoothing out lots of variation into chunks with reasonable ranges
drawing a histogram
Which of these are true of using clustering for smoothing?
We replace data points by an average or representatives of points in their cluster.
If all available data cleaning algorithms are run in sequence, there is no need to include human judgement in the process.
False