Study Guide - questions B Flashcards
What type of sampling has been shown to lead to significant bias?
Convenience Sampling - asking subjects easy to identify
In rare event analysis, it may be advantageous to bias the sampling toward sampling those individuals most likely to have….
Experience the event of interest. Known as stratified random sampling.
This sampling method ensures that each subgroup of a given population is adequately represented within the whole sample population of a research study?
Stratified random sampling
It is common to use __________ (and especially regression) to specify the value of interest as a function of the covariates (characteristics).
Response surface modeling
When the variable is ratio scale, _________ are often used to achieve normality.
Box-Cox Transformations
When the dependent variable is categorical, the regression model is typically ______?
logistic
When the dependent variable is ordinal, the regression model is typically ordered ______?
logit
When the dependent variable is ratio, ________ is often used?
standard regression
If Y is the dependent variable and X1…Xn represent the independent variables, then the typical regression model has the form ________?
y=E[Y] + e
where is e is a normally distributed error term, and E[Y] the expected value of Y is a parameterized function (X1…Xn).
Time series analysis typically corrects for _____?
Season patterns, and provides a natural way of identifying trends
Sampling plan - A simple rule of thumb is that _______ the number of individuals sampled reduces the uncertainty in half.
Quadrupling
Sampling plan - _______ is a common way to measure uncertainty
Standard deviation
Sampling plan - If standard deviation does not exist, then the difference between the ____________, is more appropriate.
third and first fractile of the uncertainty distribution
Sampling plan - If our uncertainty is described by an exponential family distribution, it will have how may parameters?
two
Determining questions to be asked - A key issue in designing the experiment is determining what?
The nature of the variable being assessed - i.e. categorical
What type of scale asks YES/NO questions or multiple choice for ______
Nominal scales
For ordinal scales, it is possible to define the normalized quantity for each response x by the fraction of responses __________?
Less than or equal to x
Semantic differential survey responses with a form “very hard, somewhat hard, OKAY, somewhat easy, very easy”, where two ends of the scale represent opposites, the response is _______?
Ordinal
What survey approach asks individuals to rate various factors in order of importance?
Rank- order
Determining a control group - measurements are typically only meaningful if there is reference to some kind of _____________?
Underlying standard
When the item is an uncertain quantity, the score of an item is the probability of the item outranking a randomly chosen item from the __________?
Benchmark group
The benchmark group is commonly referred to as a _____ with the item’s score being called its _______?
control……effect size
The purpose of extraction is to collect all this data from the many sources so that it can eventually be loaded into a common ________.
database
In extracting data, it is critical to know the _______ from which each data element was taken.
data source
What is it called if there is a change in the clients analysis, and its important to transition the database to reflect the data sources which the new clients consider important?
traceability - and typically requires careful documentation
What are three reasons why survey quality may be deficient?
- Respondents get fatigued and put in any value
- Respondents may be offended by questions and deliberately fill in false answers
- Respondents refuse to fill out the survey
Data cleaning involves the following 6 items:
- Identifying the range of valid responses
- Identifying invalid data responses
- Identifying inconsistent data encodings
- Identifying suspicious data responses
- Identifying suspicious distribution of values
- Identifying suspicious interrelationships between fields.
A key part of data cleaning is determining whether the data makes sense, and also involves handling _______.
Null or missing values
What are four possible solutions to missing values?
- Deletion
- Deletion when necessary
- Imputing a value
- Randomly imputing a value
What are the 10 “Cs” checks on quality of the data?
- Completeness
- Correctness
- Consistency (is data under a given field consistent with definition of that field?)
- Currency (is data obsolete?)
- Collaborative (is data based on one opinion or a consensus of experts?)
- Confidential
- Clarity (is data legible and comprehensible)
- Common format
- Convenient (can data be conveniently and quickly accessed)
- Cost-effective (is cost of collecting data commensurate with its value).
A data warehouse is generally used to describe these three things:
- A staging area
- Data integration in centralized source
- Access layers in OLAP data marts
Data marts are organized along a single point of view for efficient data retrieval. It allows analysts to do these 5 things:
- Slice data (filtering)
- Dice data (grouping)
- Drill down
- Roll-up
- pivot
What are three examples of fact tables?
- Transaction fact tables
- Snapshot fact tables (at point in time)
- Accumulating fact tables (aggregate facts)
Do dimension tables have a larger or smaller number of records compared to fact tables?
smaller
What are 5 examples of dimension tables?
- time
- geography
- product
- employee
- range
Discovering relationships in data - what are 5 methods to reduce dimensions in the data?
- PCA or factor analysis (can determine if there is correlation across different dimensions)
- Frequency-inverse document frequency
- Feature hashing (creating fixed number of features)
- Sensitivity analysis and wrapper methods
- Self-organizing maps and Bayes nets
When data has a variable number of features, _________ is an efficient method of creating a fixed number of features which form the indices of an array.
Feature hashing
For unstructured text data, __________ identifies the importance of a word in some document in a collection by comparing the frequency with which the word appears in the document…
frequency-inverse document frequency
_______ and _______ are typically essential when you don’t know which features of your data are important.
Sensitivity analysis and wrapper methods
Wrapper methods, unlike sensitivity analysis, typically involving identifying a set of features on a small sample and then testing that set on a ________.
holdout sample
________ and _______ are helpful in understanding the probability distribution of the data.
Self-organizing maps and Bayes nets
Extracting features - ________ is required to ensure your data stays within common ranges.
Normalization
Format conversion is typically required when data is in __________?
binary format
Fast Fourner Transformations and Discrete wavelet transformations are used for _________?
frequency data
Coordinate transformations are used for geometric data defined over ________?
Euclidian
Collecting and summarizing data - These three plots provide compact representations of how data is distributed?
- Box plots
- Scatter plots
- box and whisker plots
Collecting and summarizing data - when the data can be reasonably described in parametric distributions, ___________ are even more efficient ways of summarizing data.
distribution fitting
Collecting and summarizing data - ___________ aggregation is an effective way of summarizing all the information available on an entity
Baseball card
Adding new information to the data - ________ is recommended for tracking source information and other use-defined parameters.
Annotation
Adding new info to the data - ____________ and _______ can be helpful in processing certain data fields together or in using one field to compute the value of another.
Relational algebra rename and feature addition
What are the 6 methods for segmenting data to find natural groupings?
- Connectivity-based methods (hierarchical clustering)
- Centroid-based methods
- Distribution-based methods
- Density-based method
- Graph-based methods
- Topic modeling (text data)
segmentation -A connectivity-based method called _________ generates an ordered set of clusters with variable precision.
Hierarchical clustering
segmentation - A centroid-based method with a known number of clusters
K-means clustering
segmentation - A centroid-based method with an unknown number of clusters.
x-means clustering
segmentation - A centroid-based method that is an alternate way of enhancing k-means when the number of cluster is unknown
canopy clustering
segmentation - A distribution-based method that typically uses the expectation-maximization (EM) algorithm and is appropriate if you want any data elements’ membership in a segment to be ‘soft’
Gaussian mixture models
segmentation - Two density-based methods used for non-elliptical clusters are _________?
fractal and DB scan
segmentation - _________ methods are often based on constructing cliques and semi-cliques, and are useful when you only have knowledge of how one item is connected to another.
Graph-based models
segmentation - For text data, this method allows for segmentation of the data.
topic modeling
variable importance - When the structure of the data is unknown, these methods are helpful.
tree-based methods
variable importance - If statistical measures of importance are needed, these models are appropriate.
Generalized linear models
variable importance - if statistical measures of importance are NOT needed, these two methods are useful.
- regression with shrinkage (e.g. Lasso or elastic net)
2. stepwise regression
classifying data into groups - These two methods are helpful if you’re unsure of feature importance.
- neural nets
2. random forests
classifying data into groups - If you require a highly transparent model, this type of model can be preferable.
decision trees (i.e. CART, CHAID)