Descriptive analytics Flashcards
The term about questioning if we have confidence and belief in the data source is called ___________
Data source reliability
The term about questioning if we have the right data for the job is called ______________
Data source accuracy
The term about questioning if we can easily get to the data is called _____________
Data accecibility
The term about questioning if the data source is secured to only access the data to those who are allowed to consult the data is called ___________
Data security
Data privacy
__________ means all the requested data elements are included in the data set
Data richness
________ means the data is accurately collected and combined. It dminished the possibility that two records get mixed up during a data merge.
Data consistency
_________ or ___________ means the data is as up to data as needed
data currency or data timeliness
__________ means that the data is at the lowest level of detail as intended for use of the data
Data granularity
Data ______ is the term used to describe a mismatch between the actual and expected value of a variable
validity
___________ means that the data in the data set are all relevant for the study
Data relevancy
Data is a collecion of _______ usually obtained by e_____, o______, transcations or e_______
facts, experiments, observations, experiences
data is the lowest/highest level of abstraction from which information is derived
lowest
structured data is what data mining techniques use and can be classified as _________ or _______
categorical or numeric
categorical data is:
and can be devided in _______ and ______
Categorical = labels of classes used to devide a variable into specific groups: education level, race, gender, etc
nominal and ordinal
nominal classification is ____________
simple codes assigned to objects as labels. Marital status = 1,2 or 3
ordinal classification is ____________
assigning codes to objects as labels that ALSO represent RANK order
what is the difference between ‘numeric data’ and ‘ratio data’
ratio data has values that can be compare to a non-arbitrary zero point: weight,angle, energy, temperature, velocity, etc
Neural networks, support vector machines and logistic regression expects a certain form of data. Which is that?
Numeric data
A _____ variable had infinite value range
continuous
A discrete variable had a ________ value range
finite countable
missing values in a collected data set due to an anomaly need to be _______ or ________
imputed (most probable value) or ignored
Reasons for missing values is data is : ________ or ________
anomaly or intended
Noisy data (outliers) should be
smoothed out
Sometimes data of a variable is ______ between a certain minimum and maximum the data to ______ the potential _____
normalized
mitigate the potential bias
What are some transformation tasks?
normalization
discretization
aggregation
convert numerical data to a categorical value
reduce nominal variables amount for a variable
reduce complexite: blood match 1 or - instead of blood groups
The final step in transformation of data is called ________
data reduction
In ‘predictive analysis’ and ‘data mining’ data sets have different dimensions that describe the phenomenon, when that data set needs to be reduced it is called __________(or _________)
dimensional reduction (or variable selection)
Data reduction can not only be managed by reducing variables (columns) but also by __________ also called ___________
reducing records, sampling
In a skewed data set is has been shown that ______ the represented classes and __________ the less represented samples is producing better prediction models than unbalanced ones
undersampling, oversampling