Quiz 2 Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Truths about Data Warehouses

A

Data will not be modified by the end user.

Data may be integrated and cleaned from many large sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data discretization is part of data reduction

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following is true about data normalization?

A

Normalization scales the range of the data into some (generally smaller) specified range.

Z-Score normalization is useful for finding outliers because each point is represented by how far from the mean it is

When subtracting an offset and dividing by a range we change the mean and standard deviation of data without actually changing the shape of its distribution (as seen in a histogram)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following are issues in data integration? (which would actually cause conflicts)

A

Two different databases may have different column names for the same actual information (e.g. customerID vs cust-id).

An attribute named ‘weight’ may be in different units in different databases.

There may be discrepancies between entries in two different databases for the same actual real-life entity (e.g. for an employee).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

z-score normalization (standardization)

A

the new values tell how many standard deviations the sample is from the mean of the original data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

min-max normalization

A

the values are linearly scaled from one interval into another; the middle value means nothing special.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

decimal scaling

A

result is guaranteed to be between -1 and 1, but original zeros stay zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The two major types of data reduction

A

Dimensionality reduction and numerosity reduction (the number of variables and the number of points)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the following are methods of dimension reduction?

A

Feature selection
Feature extraction
Forward selection and backward selection
Attribute relevance analysis (e.g. information gain)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

We discussed one method of Feature Extraction, Principle Component Analysis (PCA). Which of the following describes PCA?

A

PCA creates new features from the original attributes which can efficiently account for most of the variance of the data with fewer dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following are true about Forward Selection?

A

Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are other names for features?

A

Attributes
predictors
explanatory variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Binning numerical data into chunks (bins) can be useful for

A

dealing with noisy data by smoothing out lots of variation into chunks with reasonable ranges

drawing a histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which of these are true of using clustering for smoothing?

A

We replace data points by an average or representatives of points in their cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

If all available data cleaning algorithms are run in sequence, there is no need to include human judgement in the process.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of the following are ways to deal with missing data values?

A

Use a special value like “unknown” to capture that there is meaning to the fact that value is missing.

Replace with the average value of the attribute among data points with the same class.

Predict missing value with a model based on the data you do have (i.e. classification or regression).

17
Q

Text data can be stored in a matrix with a “bag-of-words” model.

A

each row represents a unit of text (e.g. document) and each column represents a word.

18
Q

We’ve discussed several uses of clustering. Which of the following are included?

A

Smoothing noise
Numerosity reduction.
Finding outliers

19
Q

Which of the following are true about Forward Selection?

A

Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.

Forward selection is a greedy algorithm that runs a classification algorithm over and over as part of evaluating subsets of features.

Using forward selection can result in a model that generalizes better, i.e. is less subject to overfitting.

20
Q

The main criteria optimized in methods for projecting high dimensional data to 2D (like MDS)

A

Pairwise distances between points in the new 2D space are as close as possible to the corresponding distances in high-dimensional space.

21
Q

A classifier is used to

A

discover a pattern that can predict a class that a new data instance falls into.

22
Q

The key difference between supervised and unsupervised machine learning problems is due to the presence or absence of labeled data.

A

True