Data Flashcards

1
Q
The proces of processing data follows the follwing order:
1- \_\_\_
2- \_\_\_
3- \_\_\_
4- \_\_\_
5- \_\_\_
A
1- Input Data
2- Preprocess
3- Data Mining
4- Post Processing
5- Knowledge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Classification is the problem of identifying to wich of a set of ___ a new ___ belongs

A

classes (categories, labels)

observation (input)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Clustering is categorization in the absense of ___

It finds ___ in the data that share ___

A

labels
groups
similar characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Forecasting is the process of making ___ of the ___ based on ___ and ___ data and most commonly by ___

A

predictions
future
past and present
analysis of trends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Optimization is the process of finding the ___ among ___

A

best solution

all possible solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Heuristic optimization is the process of finding a ___ in a resonable ___

A

near optimal solution

time frame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Types vary from
1- ___ - categories, states, or “names of things”
2- ___ - atribute with only two states
3- ___ - Values have a meaningful order
4- ___ - Quantity / Interval / Ratio
5- ___ Attributes - finite or countably infinite set of values
6- ___ Attrbitues -real numbers as attribute balue

A
1- Nominal
2- Binary/Boolean
3- Ordinal
4- Numerical
5- Discrete
6- Continuous
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

To measure central tendency of data we can use:
1- ___ - that can either be weigthed arithmetic or trimmed
2- ___ - estimated by interpolation
3- ___ - value that occurs most frequently in the data

A

1- mean
2- median
3- mode

triple M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

To measure the dispersion of data we can use:
1- ___, ___ and ___
2- ___ and ___

A

1- quartiles, outliers and boxplots

2- Variance and standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The stpes of data preprocessing are:
1- Data ___ - fill missing values, smooth noisy data and identify or remove outliers
2- Data ___ - with multiple datasets
3- Data ___ - data compression and dimesionality reduction
4- Data ___ and data ___ - normalization, aggregation and discretization

A

1- cleaning
2- integration
3- reduction
4- transformation and discretization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the Data Cleaning step, to handle missing data we can:
1- ___
2- Fill in ___
3- Fill in ___

A

1- Ignore it
2- Fill manually
3- Fill automatically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In the Data Cleaning step, to handle noisy data we can use:
1- ___ - sort data and partition it
2- ___ - smooth by fitting the data into functions
3- ___ - detect and remove outliers
4- combined ___ and ___ inspection - detect suspicious values and check by human

A

1- binning
2- regression
3- clustering
4- computer and human

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The Data Reduction porpouse is to obtain a ___ representation of the data set that is much ___ in volume but produces the same (or almost the same) ___

A

reduced
smaller
analytical results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Some Data Reductions strategies are:
1- ___ Reduction
2- ___ Reduction
3- Data ___

A

1- Dimensionality Reduction
2- Numerosity Reduction
3- Data Compression

DND

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The Data Transformation porpouse is to ___ the entire set of values of a given ___ to a new set of ___

A

map
attribute
replacement values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Some Data Transformations strategies are:
1- ___ - scale values to fall within smaller, specified range
2- ___- divide the range of continuous attribute into intervals
3- ___ - values of multiple objects are grouped toghether to form a single summary value

A

1- Normalization
2- Discretization
3- Aggregation

17
Q

Similarity is the numerical measure of how ___ two data objects are while Dissimilarity is the numerical measure of hoe ___ two data objects are

A

alike

different

18
Q

An outlier is a data object that ___ significantly from the ___ objects as if it were generated by a ___

A

deviates
normal
different mechanism

19
Q
Some Discretization Methods are:
1- \_\_\_
2- \_\_\_ analysis
3- \_\_\_ analysis
4- \_\_\_ / \_\_\_ analysis
5- \_\_\_ analysis
A
1- Binning
2- Histogram analysis
3- Clustering analysis
4- Decision-tree / classification analysis
5- Correlation analysis
20
Q

The Five number summary corresponds to the ___, ___, ___ quartille, ___ quartille and ___ of a distribution

A
minimum
maximum
lower
upper
median
21
Q

The Five number summary are usefull to verify where the data is ___

A

concentrated

22
Q

Histograms show what ___ of cases fall

into each of several ___

A

proportion

categories

23
Q

Z_score handles ___ better than min-max normalization

A

outliers

24
Q

In symmetric data, the mean, median and mode all have the same ___

A

value