CAP Data Flashcards
Line intercept sampling
Sampling method where elements in a region are selected if a chosen line segment (transect) intersects the element
Theoretical sampling
Sample method where individuals are added to the sample based on results of data already collected
Non-standard values data transformation
Identify categories represented by multiple categorical values and replace with a standard value
Principal Component Analysis (PCA)
Dimensionality reduction method that uses orthogonal transformation to transform data set into a new coordinate system where first coordinate contains the most variance, second coordinate contains the second-most variance, etc.
Data volume
The quantity of data stored in a warehouse
probability proportionate to size sampling
Sample method where probability of an individual being chosen for the sample is proportional to the size of its subpopulation
Smoothing data transformation
Apply a simple moving average or a LOESS regression to the data
panel sampling
Sample method where individuals randomly chosen for an experiment are asked for information in waves of data collection
Stratified sampling
Sample method where population is divided into subpopulations and individuals are randomly chosen for the sample from these subpopulations
Binning data transformation
Divide the values of a continuous variable into discrete intervals
Data strategy
A plan designed to improve the enterprise’s acquisition, storage, management, sharing, and use of data
Statistical uncertainty
Natural randomness in a process that effects each experimental trial
Voluntary sampling
Sample method where individuals choose to join the sample
Skewness data transformation
Transform the distribution using a function such as a logarithm, a square root, or an inverse
Normalization data transformation
Scaling the data to remove differences in magnitude between continuous variables; examples include min-max, z-scores, and decimal scaling
Structured data
Information organized into a formatted repository so that its elements are easily searchable by basic algorithms
Interval scale
Items in the scale are differentiated by degree of difference with no absolute zero as part of the scale
Data value
The worth of data stored in and extracted from a warehouse
Master data
Data objects agreed on and shared across the enterprise
Precision of measurements
The closeness of agreement between independent measurements, primarily comes from random error
Ratio scale
Items in the scale are differentiated by degree of difference and there is an absolute zero in the scale
Data variety
The different forms of data contained in a warehouse
Box-Cox Transformation
A method for transforming non-normal dependent variables into a normal shape
Noisy data transformation
Use a smoothing function or funnel samples into bins
Outlier data transformation
Apply a logarithm to the data or, if data is erroneous, discard from data set entirely
Metadata
Data that provides information about other data
Epistemic uncertainty
Randomness in a process due to things knowable in principle but not known while conducting an experiment
Non-relational database
Database where data is not organized in a manner such that each row of information contains a unique key identifying the row
Data veracity
The truthfulness and provenance of data contained in a warehouse
Cluster sampling
Sample method where the population is divided into mutually homogeneous and internally heterogeneous subpopulations and a subpopulation is the sample (one-stage) or a simple random sample of a subpopulation is the sample (two-stage)
Fitting data transformation
Find a function that describes common features of training and testing data, then apply function to that data; one such example is Fourier transformation
Data velocity
The speed at which new data is added to the warehouse; the speed at which a user accesses existing data in the warehouse
Data steward
Job role involving the use of organization’s data governance processes to ensure fitness of data elements
Nominal scale
Items in the scale are differentiated by name alone
Design of Experiments (DOE)
An approach to problem solving involving the collection of data that supports valid, defensible, and supportable conclusions
Ordinal scale
Items in the scale are differentiated by rank with no degree of difference between items specified
Quota sampling
A non-probabilistic version of stratified sampling where judgment or convenience identifies what kinds of individuals are chosen for the sample
Missing data transformation
Either replacing missing values with placeholder values, removing rows or columns with missing values, or using a statistical method to infer the missing value
Simple random sampling
Sample method where every individual in the population has the same odds of being chosen for the sample
relational database
Database based on organizing data into one or more tables of columns and rows where each row contains a unique key identifying the row
Systematic sampling
Sample method where individuals are chosen for the sample from an ordered sampling frame; to be used only if population is homogenous
Accidental sampling; convenience sampling
Non-probabilistic sample method where chosen individuals are readily available and convenient
Data governance
Controls that ensure the data entry meets precise standards
Accuracy
The agreement between independent measurements and the true value of what is measured, primarily comes from systematic error
Unstructured data
Information that does not fit into pre-defined repository and is not organized in an easily searchable manner
Snowball sampling
Sampling method where individuals already in the sample are asked to identify new individuals to add to the sample
Data collection strategy
Iterative process of determining data needs, reviewing existing data, setting priorities for data, agreeing on roles and responsibilities for collecting the data, producing the data, and determining if data meets all needs
Minimax sampling
Sample method where the sampling ratio does not follow the population statistics in order to ease binary classification tasks
Response surface methodology (RSM)
A way to explore the relationships between explanatory variables and one or more response variables through a design of experiments
Data privacy
The relationship between collection and dissemination of data, technology, public expectation of non-observation of collected data, and legal and political issues surrounding these issues