Chapter 1 Flashcards
What is “big data”?
explosion in secondary data typified by increases in the volume, variety, and velocity of the data being made available from a myriad set of sources
What is “bivariate partial correlation”?
simple (two-variable) correlation between 2 sets of residuals (unexplained variance) that remain after the association of other independent variables is removed
What is “bootstrapping”?
approach to validating a multivariate model by drawing a large number of subsamples and estimating models for each subsample
● Doesn;t rely on statistical assumptions about the population to assess statistical significance, instead makes assessment based solely on the sample data
What is “causal inference”?
methods that move beyond statistics inference to the stronger statement of “cause and effect” in non-experimental situations
What is “cross validation”?
original sample is divided into a number of smaller-subsamples (validation samples), the validation fit is the “average” fit across all sub-samples
What are “data mining models”?
based on algorithms that are widely iused in big data applications
● Emphasis on predictive accuracy rather than statistical inference and explanation as seen in satisical/data models such as multiple regression
What is “dependence technique”?
classification of statistical techniques distinguished by having a variable or set of variables identified as the dependent variable(s) and the remaining variables as independent
● Objective = prediction of the DV(s) by IV(s)
● Depedent variable → presiumed effect of, or response to, a change in the IV(s)
● Independent variable → presumed cause of any change in the DV
What is “dimensional reduction”?
reduction of multicollinearity among variables by forming composite measures of multicollinear variable through such methods as exploratory factor analysis
What is “directed acyclic graph (DAG)”?
Graphical portrayal of causal relationships used in causal inference analysis to identify all “threats” to causal inference. Similar in some ways to path diagrams used in structural equation modeling.
What is a “dummy variable”?
non metrically measured variable transformed into a metric variable
○ Assigning a 1 or 0 to a subject
○ Always have one dummy variable less than the number of levels for the nonmetric variable
■ The omitted category is the reference category
Effect size
estimate of the degree to which the phenomenon being studied (e.g. correlation or difference in means) exists in the population
Estimation sample
portion of original sample used for model estimation in conjunction with validation sample
Validation sample
potion of the sample “held out” from estimation and then used for an independent assessment of model fit on data that wasn’t used in estimation (holdout sample)
General linear model (GLM)
Fundamental linear dependence model which can be used to estimate many model types (e.g., multiple regression, ANONA/MANOVA, discriminant analysis) with the assumption of a normally distributed dependent measure.
Generalized linera model (GLZ or GLIM)
similar in form to GLM, but able to accommodate non-normal depedent measures such as binary variables
● Logistic regression model
● Uses maximum likelihood estimation rather than ordinary least squares
Indicator
single variable used in conjunction with one or more others variables to form a
● Composite measure → combination of two or more indicators
Measurement error
inaccuracies of measuring the “true” variable values due to the fallibility of the measurement instrument, data entry errors, or respondent errors
Metric data
Also called quantitative data, interval data, or ratio data, these measurements identify or describe subjects (or objects)
not only on the possession of an attribute but also by the amount or degree to which the subject may be characterized by the
attribute. For example, a person’s age and weight are metric data.
● = Quantitative data, interval data, or ratio data
Non-metric Data
Also called qualitative data, these are attributes, characteristics, or categorical properties that identify or describe a subject or object. They differ from metric data by indicating the presence of an attribute, but not the amount.
Examples are occupation (physician, attorney, professor) or buyer status (buyer, non-buyer). Also called nominal data or
ordinal data.
● Difference from metric → these indicate the presence of an attribute, but not the amount
Multicollinearity
Extent to which a variable can be explained by the other variables in the analysis.
- As multicollinearity increases, it complicates the interpretation of the variate because it is more difficult to ascertain the effect of any single variable, owing to their interrelationships.
Mutivariate analysis
Analysis of multiple variables in a single relationship or set of relationships.
Multivariate measurement
the use of two or more variables as indicators of a single composite measure
- For example, a personality
test may provide the answers to a series of individual questions (indicators), which are then combined to form a single score
(summated scale) representing the personality trait.
Overfitting
estimation of model parameters that over-represent the characteristics of the sample at the expense of generalizability to the population
Practical significance
assessing multivariate analysis results based on the substantive findings rather than their statistical significance
● E.g. assesses whether the result is useful in achieving research objectives vs just finding whether the result is attributable to chance
Reliability
extent to which a (set of) variable(s) is consistent in what it’s intended to measure
● If multiple measurements are taken, reliable measures will all be consistent in their values
- It differs from validity in that it relates not to what should be measured, but instead to how it is measured.
● Consistency of the measure
Validity
extent to which a (set of) measure(s) correctly represents the concept of study
● Degree to which it’s free from any systematic or nonrandom error
● Concerned with how well the concept is defined by the measure(s) (vs teh consistency of measures, as with reliability)
Specificaiton error
omitting a key variable from the analysis, affecting the estimated effects of included variables
Statistical model
specific model is proposed, then estimated and a statistical inference is made as to its generalizability to the population through statistical tests
Summated scales
method of combining several variables that measure the same concept into a single variable in an attempt to increase the reliability of the measurement through multivariate measurement
- In most instances, the separate variables are
summed and then their total or average score is used in the analysis.
Treatment
Independent variable the researcher manipulates to see the effect (if any) on the dependent variable(s), such as in an
experiment (e.g., testing the appeal of color versus black-and-white advertisements).
Type I error
Type I error → probability of incorrectly rejecting H0
● Saying an effect exists when it actually doesn’t
● = Alpha (α)
Type II error
Type II error → probability of incorrectly failing to reject H0
● Chance of not finding an effect when it does exist
● = Beta (β)
● 1 - β = power
Power
probability of correctly rejecting H0 (null hypothesis) when it’s false → correctly finding a hypothesized relationship when it exists
● Function of
1. Statistical significance set by researcher for a type 1 error (α)
2. Sample size used
3. Effect size being examined
Univariate analysis of variance (ANOVA)
statistical technique used to determine, on the basis of one DV whether samples are from populations with equal means
Variate
linear combination of variables formed in the multivariate technique by deriving empirical weghts applied to a set of variables specified by the researcher