Chapter 3 Flashcards
analytics ready
data that has been identified as relevant to the task at hand along with the quality and quantity requirements, displays a consistent structure in terms of key fields and variables
arithmetic mean
average
box-and-whiskers plot
displays outliers as extended dots, top and bottom whiskers represent maximum and minimum values (excluding outliers), the box is bounded by the lower and upper quartile, the median is a straight line across the box, and X represents the mean
box plot
alias of box-and-whiskers plot
bubble chart
an extension of the scatter plot that uses color and size to add additional information
business report
is an artifact that is generated to convey useful information to decision makers that is derived from any number of data sources using an ETL (extract, transform, load) procedure.
categorical data
data represented by dividing a variable into a specific group or label.
centrality
an indication of where most of the data fits, using methods such as mean, mode, and median.
correlation
makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables.
a statistical relation between variables that may indicate some association or connection between these variables. note that correlation does not imply causation.
dashboards
a visual representation of data designed to be easily digestible with sufficient detail to inform decision making. typically allowing for a way to “drill” down into the information to explore the situation deeper.
data preprocessing
(consolidate, clean, transform, reduce)
a procedure designed to make data usable in a data mining scenario.
(1) Consolidation (collect data, select data, integrate data),
(2) Cleaning (impute values, reduce noise, remove duplicates,
(3) Transformation (normalize data, discretize data, create attributes
(4) Reduction (reduce dimension, reduce volume, balance data)
data quality
The holistic quality of data, including their accuracy, precision, completeness, and relevance
data security
only those with the proper permissions have access to the data
data taxonomy
a structure used to define types of data at varying layers of abstraction.
Data [Structured, Unstructured or Semi-structured]
Structured [Categorical, Numerical]
Categorical [Nominal, Ordinal]
Numerical [Interval, Ratio]
Unstructured or Semi-structured [Textual, Multimedia, XML/JSON]
Multimedia [Image, Audio, Video]
data visualization
A graphical, animation, or video presentation of data and the results of data analysis.
datum
smallest atomic unit of data, i.e. a single record of facts
descriptive statistics
describing the sample data on hand, typically employs centrality measures (mean, median, mode).
dimensional reduction
removing variables, or reducing columns, variable selection, stage 4 of preprocessing
dispersion
a statistical measure of “spread” out the data is, i.e. the degree of variation of a given variable, these include range, variance, and standard deviation.
high-performance computing
a set of techniques including in-memory analytics, in-database analytics, grid-computing, and appliances.