Booz Selection Flashcards
What is the key question for 1. Describe?
How do I develop an understanding of the content of my data?
What is the key question for 1. Describe | Processing?
How do I clean and separate my data?
What is the key question for 1. Describe | Processing | Filtering?
How do I identify data based on its absolute or relative values?
What is the key question for 1. Describe | Processing | Imputation?
How do I fill in missing values in my data?
What is the key question for 1. Describe | Processing | Dimensionality Reduction?
How do I reduce the number of dimensions in my data?
What is the key question for 1. Describe | Processing | Normalization & Transformation?
How do I reconcile duplication representations in the data?
What is the key question for 1. Describe | Processing | Feature Extraction?
Really depends on the domain of the information. Variety of methods.
For 1. Describe | Processing | Filtering, If you want to add or remove data based on its value, start with:
Relational algebra projection and selection
For 1. Describe | Processing | Filtering, If early results are uninformative and duplicative, start with:
Outlier removal, Exponential smoothing, Gaussian filter, Median filter
For 1. Describe | Processing | Imputation, If you want to generate values from other observations in your dataset, start with:
Random sampling, Markov Chain Monte Carlo (MC)
For 1. Describe | Processing | Imputation, If you want to generate values without using other observations in your dataset, start with:
Mean, Statistical distributions, Regression models
For 1. Describe | Processing | Dimensionality Reduction, If you need to determine whether there is multi-dimensional correlation, start with:
PCA and other factor analysis
For 1. Describe | Processing | Dimensionality Reduction, If you can represent individual observations by membership in a group, start with:
K-means clustering, Canopy clustering
For 1. Describe | Processing | Dimensionality Reduction, If you have unstructured text data, start with:
Term Frequency/Inverse Document Frequency (TF IDF)
For 1. Describe | Processing | Dimensionality Reduction, If you have a variable number of features but your algorithm requires a fixed number, start with:
Feature hashing
For 1. Describe | Processing | Dimensionality Reduction, If you are not sure which features are the most important, start with:
Wrapper methods, Sensitivity analysis
For 1. Describe | Processing | Dimensionality Reduction, If you need to facilitate understanding of the probability distribution of the space, start with:
Self organizing maps
For 1. Describe | Processing | Normalization & Transformation, If you suspect duplicate data elements, start with:
Deduplication
For 1. Describe | Processing | Normalization & Transformation, If you want your data to fall within a specified range, start with:
Normalization
For 1. Describe | Processing | Normalization & Transformation, If your data is stored in a binary format, start with:
Format Conversion
For 1. Describe | Processing | Normalization & Transformation, If you are operating in frequency space, start with:
Fast Fourier Transform (FFT), Discrete wavelet transform
For 1. Describe | Processing | Normalization & Transformation, If you are operating in Euclidian space, start with:
Coordinate transform
What is the key question for 1. Describe | Aggregation?
How do I collect and summarize my data?
For 1. Describe | Aggregation, If you are unfamiliar with the dataset, start with:
basic statistics: Count, Mean, Standard deviation, Range, Scatter Plots, Box plots
For 1. Describe | Aggregation, If your approach assumes the data follows a distribution, start with:
Distribution fitting
For 1. Describe | Aggregation, If you want to understand all the information available on an entity, start with:
“Baseball card” aggregation
What is the key question for 1. Describe | Enrichment?
How do I add new information to my data?
For 1. Describe | Enrichment, If you need to keep track of source information or other user-defined parameters, start with:
Annotation
For 1. Describe | Enrichment, If you often process certain data fields together or use one field to compute the value of another, start with:
Relational algebra rename, Feature addition (e.g., Geography, Technology, Weather)
What is the key question for 2. Discover?
What are the key relationships in the data?
What is the key question for 2. Discover | Clustering?
How do I segment the data to find natural groupings?
For 2. Discover | Clustering, If you want an ordered set of clusters with variable precision, start with:
Hierarchical