Data Flashcards
The proces of processing data follows the follwing order: 1- \_\_\_ 2- \_\_\_ 3- \_\_\_ 4- \_\_\_ 5- \_\_\_
1- Input Data 2- Preprocess 3- Data Mining 4- Post Processing 5- Knowledge
Classification is the problem of identifying to wich of a set of ___ a new ___ belongs
classes (categories, labels)
observation (input)
Clustering is categorization in the absense of ___
It finds ___ in the data that share ___
labels
groups
similar characteristics
Forecasting is the process of making ___ of the ___ based on ___ and ___ data and most commonly by ___
predictions
future
past and present
analysis of trends
Optimization is the process of finding the ___ among ___
best solution
all possible solutions
Heuristic optimization is the process of finding a ___ in a resonable ___
near optimal solution
time frame
Data Types vary from
1- ___ - categories, states, or “names of things”
2- ___ - atribute with only two states
3- ___ - Values have a meaningful order
4- ___ - Quantity / Interval / Ratio
5- ___ Attributes - finite or countably infinite set of values
6- ___ Attrbitues -real numbers as attribute balue
1- Nominal 2- Binary/Boolean 3- Ordinal 4- Numerical 5- Discrete 6- Continuous
To measure central tendency of data we can use:
1- ___ - that can either be weigthed arithmetic or trimmed
2- ___ - estimated by interpolation
3- ___ - value that occurs most frequently in the data
1- mean
2- median
3- mode
triple M
To measure the dispersion of data we can use:
1- ___, ___ and ___
2- ___ and ___
1- quartiles, outliers and boxplots
2- Variance and standard deviation
The stpes of data preprocessing are:
1- Data ___ - fill missing values, smooth noisy data and identify or remove outliers
2- Data ___ - with multiple datasets
3- Data ___ - data compression and dimesionality reduction
4- Data ___ and data ___ - normalization, aggregation and discretization
1- cleaning
2- integration
3- reduction
4- transformation and discretization
In the Data Cleaning step, to handle missing data we can:
1- ___
2- Fill in ___
3- Fill in ___
1- Ignore it
2- Fill manually
3- Fill automatically
In the Data Cleaning step, to handle noisy data we can use:
1- ___ - sort data and partition it
2- ___ - smooth by fitting the data into functions
3- ___ - detect and remove outliers
4- combined ___ and ___ inspection - detect suspicious values and check by human
1- binning
2- regression
3- clustering
4- computer and human
The Data Reduction porpouse is to obtain a ___ representation of the data set that is much ___ in volume but produces the same (or almost the same) ___
reduced
smaller
analytical results
Some Data Reductions strategies are:
1- ___ Reduction
2- ___ Reduction
3- Data ___
1- Dimensionality Reduction
2- Numerosity Reduction
3- Data Compression
DND
The Data Transformation porpouse is to ___ the entire set of values of a given ___ to a new set of ___
map
attribute
replacement values
Some Data Transformations strategies are:
1- ___ - scale values to fall within smaller, specified range
2- ___- divide the range of continuous attribute into intervals
3- ___ - values of multiple objects are grouped toghether to form a single summary value
1- Normalization
2- Discretization
3- Aggregation
Similarity is the numerical measure of how ___ two data objects are while Dissimilarity is the numerical measure of hoe ___ two data objects are
alike
different
An outlier is a data object that ___ significantly from the ___ objects as if it were generated by a ___
deviates
normal
different mechanism
Some Discretization Methods are: 1- \_\_\_ 2- \_\_\_ analysis 3- \_\_\_ analysis 4- \_\_\_ / \_\_\_ analysis 5- \_\_\_ analysis
1- Binning 2- Histogram analysis 3- Clustering analysis 4- Decision-tree / classification analysis 5- Correlation analysis
The Five number summary corresponds to the ___, ___, ___ quartille, ___ quartille and ___ of a distribution
minimum maximum lower upper median
The Five number summary are usefull to verify where the data is ___
concentrated
Histograms show what ___ of cases fall
into each of several ___
proportion
categories
Z_score handles ___ better than min-max normalization
outliers
In symmetric data, the mean, median and mode all have the same ___
value