Quiz 2 Flashcards

Question 1

Q

Truths about Data Warehouses

Answer

A

Data will not be modified by the end user.

Data may be integrated and cleaned from many large sources.

Question 2

Q

Data discretization is part of data reduction

Question 3

Q

Which of the following is true about data normalization?

Answer

A

Normalization scales the range of the data into some (generally smaller) specified range.

Z-Score normalization is useful for finding outliers because each point is represented by how far from the mean it is

When subtracting an offset and dividing by a range we change the mean and standard deviation of data without actually changing the shape of its distribution (as seen in a histogram)

Question 4

Q

Which of the following are issues in data integration? (which would actually cause conflicts)

Answer

A

Two different databases may have different column names for the same actual information (e.g. customerID vs cust-id).

An attribute named ‘weight’ may be in different units in different databases.

There may be discrepancies between entries in two different databases for the same actual real-life entity (e.g. for an employee).

Question 5

Q

z-score normalization (standardization)

Answer

A

the new values tell how many standard deviations the sample is from the mean of the original data.

Question 6

Q

min-max normalization

Answer

A

the values are linearly scaled from one interval into another; the middle value means nothing special.

Question 7

Q

decimal scaling

Answer

A

result is guaranteed to be between -1 and 1, but original zeros stay zero

Question 8

Q

The two major types of data reduction

Answer

A

Dimensionality reduction and numerosity reduction (the number of variables and the number of points)

Question 9

Q

Which of the following are methods of dimension reduction?

Answer

A

Feature selection
Feature extraction
Forward selection and backward selection
Attribute relevance analysis (e.g. information gain)

Question 10

Q

We discussed one method of Feature Extraction, Principle Component Analysis (PCA). Which of the following describes PCA?

Answer

A

PCA creates new features from the original attributes which can efficiently account for most of the variance of the data with fewer dimensions.

Question 11

Q

Which of the following are true about Forward Selection?

Answer

A

Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.

Question 12

Q

What are other names for features?

Answer

A

Attributes
predictors
explanatory variables

Question 13

Q

Binning numerical data into chunks (bins) can be useful for

Answer

A

dealing with noisy data by smoothing out lots of variation into chunks with reasonable ranges

drawing a histogram

Question 14

Q

Which of these are true of using clustering for smoothing?

Answer

A

We replace data points by an average or representatives of points in their cluster.

Question 15

Q

If all available data cleaning algorithms are run in sequence, there is no need to include human judgement in the process.

Question 16

Q

Which of the following are ways to deal with missing data values?

Answer

Study These Flashcards

A

Use a special value like “unknown” to capture that there is meaning to the fact that value is missing.

Replace with the average value of the attribute among data points with the same class.

Predict missing value with a model based on the data you do have (i.e. classification or regression).

Question 17

Q

Text data can be stored in a matrix with a “bag-of-words” model.

Answer

Study These Flashcards

A

each row represents a unit of text (e.g. document) and each column represents a word.

Question 18

Q

We’ve discussed several uses of clustering. Which of the following are included?

Answer

Study These Flashcards

A

Smoothing noise
Numerosity reduction.
Finding outliers

Question 19

Q

Which of the following are true about Forward Selection?

Answer

Study These Flashcards

A

Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.

Forward selection is a greedy algorithm that runs a classification algorithm over and over as part of evaluating subsets of features.

Using forward selection can result in a model that generalizes better, i.e. is less subject to overfitting.

Question 20

Q

The main criteria optimized in methods for projecting high dimensional data to 2D (like MDS)

Answer

Study These Flashcards

A

Pairwise distances between points in the new 2D space are as close as possible to the corresponding distances in high-dimensional space.

Question 21

Q

A classifier is used to

Answer

Study These Flashcards

A

discover a pattern that can predict a class that a new data instance falls into.

Question 22

Q

The key difference between supervised and unsupervised machine learning problems is due to the presence or absence of labeled data.

Answer

Study These Flashcards

A

True

Quiz 2 Flashcards

(22 cards)