Booz Selection Flashcards
What is the key question for 1. Describe?
How do I develop an understanding of the content of my data?
What is the key question for 1. Describe | Processing?
How do I clean and separate my data?
What is the key question for 1. Describe | Processing | Filtering?
How do I identify data based on its absolute or relative values?
What is the key question for 1. Describe | Processing | Imputation?
How do I fill in missing values in my data?
What is the key question for 1. Describe | Processing | Dimensionality Reduction?
How do I reduce the number of dimensions in my data?
What is the key question for 1. Describe | Processing | Normalization & Transformation?
How do I reconcile duplication representations in the data?
What is the key question for 1. Describe | Processing | Feature Extraction?
Really depends on the domain of the information. Variety of methods.
For 1. Describe | Processing | Filtering, If you want to add or remove data based on its value, start with:
Relational algebra projection and selection
For 1. Describe | Processing | Filtering, If early results are uninformative and duplicative, start with:
Outlier removal, Exponential smoothing, Gaussian filter, Median filter
For 1. Describe | Processing | Imputation, If you want to generate values from other observations in your dataset, start with:
Random sampling, Markov Chain Monte Carlo (MC)
For 1. Describe | Processing | Imputation, If you want to generate values without using other observations in your dataset, start with:
Mean, Statistical distributions, Regression models
For 1. Describe | Processing | Dimensionality Reduction, If you need to determine whether there is multi-dimensional correlation, start with:
PCA and other factor analysis
For 1. Describe | Processing | Dimensionality Reduction, If you can represent individual observations by membership in a group, start with:
K-means clustering, Canopy clustering
For 1. Describe | Processing | Dimensionality Reduction, If you have unstructured text data, start with:
Term Frequency/Inverse Document Frequency (TF IDF)
For 1. Describe | Processing | Dimensionality Reduction, If you have a variable number of features but your algorithm requires a fixed number, start with:
Feature hashing
For 1. Describe | Processing | Dimensionality Reduction, If you are not sure which features are the most important, start with:
Wrapper methods, Sensitivity analysis
For 1. Describe | Processing | Dimensionality Reduction, If you need to facilitate understanding of the probability distribution of the space, start with:
Self organizing maps
For 1. Describe | Processing | Normalization & Transformation, If you suspect duplicate data elements, start with:
Deduplication
For 1. Describe | Processing | Normalization & Transformation, If you want your data to fall within a specified range, start with:
Normalization
For 1. Describe | Processing | Normalization & Transformation, If your data is stored in a binary format, start with:
Format Conversion
For 1. Describe | Processing | Normalization & Transformation, If you are operating in frequency space, start with:
Fast Fourier Transform (FFT), Discrete wavelet transform
For 1. Describe | Processing | Normalization & Transformation, If you are operating in Euclidian space, start with:
Coordinate transform
What is the key question for 1. Describe | Aggregation?
How do I collect and summarize my data?
For 1. Describe | Aggregation, If you are unfamiliar with the dataset, start with:
basic statistics: Count, Mean, Standard deviation, Range, Scatter Plots, Box plots
For 1. Describe | Aggregation, If your approach assumes the data follows a distribution, start with:
Distribution fitting
For 1. Describe | Aggregation, If you want to understand all the information available on an entity, start with:
“Baseball card” aggregation
What is the key question for 1. Describe | Enrichment?
How do I add new information to my data?
For 1. Describe | Enrichment, If you need to keep track of source information or other user-defined parameters, start with:
Annotation
For 1. Describe | Enrichment, If you often process certain data fields together or use one field to compute the value of another, start with:
Relational algebra rename, Feature addition (e.g., Geography, Technology, Weather)
What is the key question for 2. Discover?
What are the key relationships in the data?
What is the key question for 2. Discover | Clustering?
How do I segment the data to find natural groupings?
For 2. Discover | Clustering, If you want an ordered set of clusters with variable precision, start with:
Hierarchical
For 2. Discover | Clustering, ? If you have an unknown number of clusters, start with:
X-means, Canopy, Apriori
For 2. Discover | Clustering, If you have text data, start with:
Topic modeling
For 2. Discover | Clustering, If you have non-elliptical clusters, start with:
Fractal, DB Scan
For 2. Discover | Clustering, If you want soft membership in the clusters, start with:
Gaussian mixture models
For 2. Discover | Clustering, If you have an known number of clusters, start with:
K-means
What is the key question for 2. Discover | Regression?
How do I determine which variables may be important?
For 2. Discover | Regression, If your data has unknown structure, start with:
Tree-based methods
For 2. Discover | Regression, If statistical measures of importance are needed, start with:
Generalized linear models
For 2. Discover | Regression, If statistical measures of importance are not needed, start with:
Regression with shrinkage (e.g., LASSO, Elastic net), Stepwise regression
What is the key question for 2. Discover | Hypothesis Testing?
How do I test ideas?
For 2. Discover | Hypothesis Testing, If you want to compare two groups
T-test
For 2. Discover | Hypothesis Testing, If you want to compare multiple groups
ANOVA
What is the key question for 3. Predict?
What are the likely future outcomes?
What is the key question for 3. Predict | Classification?
How do I predict group membership?
For 3. Predict | Classification, If you have known dependent relationships between variables
Bayesian network
For 3. Predict | Classification, If you are unsure of feature importance, start with:
Neural nets, Random forests, Deep learning
For 3. Predict | Classification, If you require a highly transparent model, start with:
Decision trees
For 3. Predict | Classification, If you have less than 20 data dimensions, start with:
K-nearest neighbors
For 3. Predict | Classification, If you have a large dataset with an unknown classification signal, start with:
Naive bayes
For 3. Predict | Classification, If you want to estimate an unobservable state based on observable variables, start with:
Hidden markov model
For 3. Predict | Classification, If you don’t know where else to begin, start with:
Support vector machines (SVM), Random forests
What is the key question for 3. Predict | Regression?
How do I predict a future value?
For 3. Predict | Regression, If the data structure is unknown, start with:
Tree-based methods
For 3. Predict | Regression, If you require a highly transparent model, start with:
Generalized linear models
For 3. Predict | Regression, If you have less than 20 data dimensions, start with:
K-nearest neighbors
What is the key question for 3. Predict | Recommendation?
How do I predict relevant conditions?
For 3. Predict | Recommendation, If you only have knowledge of how people interact with items, start with:
Collaborative filtering
For 3. Predict | Recommendation, If you have a feature vector of item characteristics, start with:
Content-based methods
For 3. Predict | Recommendation, If you only have knowledge of how items are connected to one another, start with:
Graph-based methods
What is the key question for 4. Advise?
What course of action should I take?
What is the key question for 4. Advise | Logical Reasoning?
How do I sort through different evidence?
For 4. Advise | Logical Reasoning, If you have expert knowledge to capture
Expert systems
For 4. Advise | Logical Reasoning, If you’re looking for basic facts
Logical reasoning
What is the key question for 4. Advise | Optimization?
How do I identify the best course of action when my objective can be expressed as a utility function?
For 4. Advise | Optimization, If your problem is represented by a non-deterministic utility function, start with:
Stochastic search
For 4. Advise | Optimization, If approximate solutions are acceptable, start with:
Genetic algorithms, Simulated annealing, Gradient search
For 4. Advise | Optimization, If your problem is represented by a deterministic utility function, start with:
Linear programming, Integer programming, Non-linear programming
For 4. Advise | Optimization, If you have limited resources to search with
Active learning
For 4. Advise | Optimization, If you want to try multiple models
Ensemble learning
What is the key question for 4. Advise | Simulation?
How do I characterize a system that does not have a closed-form representation?
For 4. Advise | Simulation, If you must model discrete entities, start with:
Discrete event simulation (DES)
For 4. Advise | Simulation, If there are a discrete set of possible states, start with:
Markov models
For 4. Advise | Simulation, If there are actions and interactions among autonomous entities, start with:
Agent-based simulation
For 4. Advise | Simulation, If you do not need to model discrete entities, start with:
Monte Carlo simulation
For 4. Advise | Simulation, If you are modeling a complex system with feedback mechanisms between actions, start with:
Systems dynamics
For 4. Advise | Simulation, If you require continuous tracking of system behavior, start with:
Activity-based simulation
For 4. Advise | Simulation, If you already have an understanding of what factors govern the system, start with:
ODES, PDES
For 4. Advise | Simulation, If you have imprecise categories
Fuzzy logic