stats midterm Flashcards
Statistics:
- The practice or science of collecting and analyzing numerical data in large quantities to interpret, summarize, and present it in a meaningful way.
- A numerical fact or datum: a piece of data that provides information on a particular subject, often used in reference to quantitative research or studies.
Data
-Information, especially facts or numbers, collected to be examined and considered and used to help decision-making
-Information in an electronic form that can be stored and used
by a computer
Data literacy
the combination of skills and mindsets that allows individuals to find insights and meaning within their data to enable effective, data-informed decision-making
Data literacy imparts the skills and mindset to find
meaning
within data
Politics
-The activities of the government, members of law-making organizations, or people who try to influence the way a country is governed
-The relationships within a group or organization that allow
particular people to have power over others
Political science
uses data to figure out the correct answer to important questions like these
Two styles of research
-Qualitative
-Quantitative
Qualitative research
based on information that cannot be easily measured, such as people’s feelings, rather than on information that can be shown in numbers
Quantitative research
related to information that can be shown in numbers and amounts
Topic
a matter dealt with in a text, discourse, or conversation; a
subject
Theory
a plausible general principle or body of principles offered to
explain phenomena
A causal theory differs from a theory in that it
explicitly states the
relationship between two variables
Variable
a characteristic, number, or quantity that can be measured
or counted and can take on different values
Invariance
The property of remaining unchanged regardless of changes in the conditions of measurement
A hypothesis is even more - than a causal theory
specific
A hypothesis - the variables
operationalizes
Operationalization
precisely defining the variables and how they are measured
Pre-registration
makes your hypothesis and plan for
hypothesis testing public
Once you have - your research plan, you can test your hypotheses
pre-registered
Hypothesis testing
the use of statistics on data to test a hypothesis
Methodology
the use of statistics
Empirical analysis
the use of statistics on observational
data – not experimental data
Empirical testing
the use of statistics on observational data to test a hypothesis
Hypothesis testing uses statistics to test:
- whether an association exists between the two variables,
- the strength of any association between the two variables, and
- the probability that the association between the two variables is
due to random chance
Normative arguments include words like
“should” or “ought to.”
parsimonious
Time dimension
points at time in which your data changes
Time-series data
a sequence of data points collected or recorded at successive points
in time, typically at equally spaced intervals, that represents how a particular variable or set of variables changes over time
Hierarchical dimension
the level at which your data changes
Multi-level data
data that is structured in multiple nested levels, where observations are grouped within higher-level units
Spatial dimension
geographic locations in which your data changes
Cross-sectional data
data collected at a single point in time from multiple units, such as states or countries, to analyze variations across those units
Moderator (Z)
a variable that influences the strength or direction of the
relationship between an independent and a dependent variable in a study.
Mediator (Z)
a variable that explains the process or mechanism through
which an independent variable affects a dependent variable, acting as an
intermediary in the relationship11
Formal theory
a framework that uses mathematical models and logical structures to rigorously analyze and predict the behavior of complex systems or phenomena
Rational choice theory
individuals make decisions by systematically evaluating the costs and benefits to maximize their personal utility or advantage
Utility
the sum of all benefits of an action minus the sum of all costs from that
action
Utility maximizer
an individual who seeks to make choices that yield the highest possible level of benefit based on their preferences and available options
Expected utility
the overall anticipated satisfaction or benefit (utility) derived from a particular choice or outcome
Game theory
a branch of formal modeling that focuses on analyzing strategic interactions between rational decision-makers, where the outcome for each participant depends not only on their own choices but also on the choices of others
The prisoner’s dilemma
a classic game theory scenario where two individuals, who cannot communicate, face a choice between cooperating with each other or betraying one another
Social choice theory
a domain within formal modeling that examines how individual
preferences can be aggregated to make collective decisions
Intransitive Preferences
a preference structure that violates the transitivity condition. For example, an individual might prefer option A over option B, option B over
option C, but still prefer option C over option A (A > B, B > C, but C > A).
Spatial models
a specialized form of formal modeling that incorporate spatial or geographic
dimensions into the analysis of strategic interactions.
Spatial models of voting
a formal modeling approach used to analyze how voters’ preferences
and spatial positioning influence electoral outcomes
Preference mapping
voters and candidates are positioned on a spatial map (often a one-dimensional or two-dimensional continuum) based on their ideological or policy preferences
Vote maximization
candidates choose positions or policies to maximize their votes, typically moving towards the median voter or the center of voter preferences to appeal to the largest
segment of the electorate
Equilibrium analysis
The model identifies equilibrium points, where candidates’ positions stabilize because any deviation would result in fewer votes. The most common equilibrium is the median
voter theorem, where candidates converge to the preferences of the median voter
Causal relationship
a connection between two variables where one variable directly influences or determines the outcome of the other
Confounder
a variable that influences both the independent and dependent variables, potentially leading to a misleading or spurious association between them.
Spurious relationship
a false or misleading association between two variables that is actually caused by a third, confounding variable, rather than a direct causal link between the two
Control variable
a variable or condition that is held constant or regulated in an
experiment or study to isolate the effect of the independent variable on the dependent variable, ensuring that the results are not influenced by extraneous factors
Deterministic relationship
a connection between two variables where one variable’s value is precisely determined by the value of the other, with no randomness or uncertainty involved
Probabilistic relationship
a connection between two variables where changes in one variable are associated with changes in the likelihood or probability of different
outcomes in the other variable, but the relationship is not perfectly predictable
Observational data
information collected from real-world observations or measurements without conducting experiments
Experimental data
information collected from experiments where variables are
systematically manipulated to observe their effects on other variables, allowing for causal inferences
Randomized controlled trials (RCTs)
experimental studies where participants are randomly assigned to either a treatment group or a control group to evaluate the effectiveness of an intervention while minimizing biases
Treatment group
a group of participants in a study that receives the treatment or intervention being tested, allowing researchers to assess its effects compared to a control group
Random assignment
the process of randomly allocating
participants to control and treatment groups in a study to ensure that each group is comparable and to eliminate selection bias
Selection bias
when the sample of participants in a study is not representative of the population being studied, leading to distorted or unrepresentative results
Randomized controlled trials are considered the gold standard for
causal research because they can cross the
four causal hurdles.
Experiments can exhibit low levels of
external validity
External validity
the degree to which one can be confident that the results of an analysis apply to the broader population
Natural experiments
experiments that leverage naturally occurring random variations or events to investigate causal effects, without direct manipulation of the independent variable by the researcher
Natural experiments exhibit high levels of
internal validity
Controlled experiments
studies that compare the effects of an intervention or treatment between pre-selected groups that are not randomly assigned, aiming to assess causal relationships while
controlling for confounding variables
Quasi-experiments
research designs that aim to evaluate
interventions or treatments without full randomization, often using
pre-existing groups or natural conditions to infer causal relationships
Observational research
research designs in which the
researcher does not have control over values of the independent
variable because the independent variable occurs naturally
Survey item
a specific question or statement in a survey designed to gather
data on a particular aspect of a respondent’s attitudes, opinions, or behaviors
Open-ended items
items that allow respondents to provide their answers in
their own words
Ranking item
item that asks respondents to rank a list of choices according to their preferences or importance
Likert scale
response options that allow respondents to rate their level of
agreement or disagreement with a series of statements on an interval scale, typically ranging from “strongly disagree” to “strongly agree
Binary response option
a type of response with only two choices
Multi-item scales
multiple questions or items that measure a single underlying construct
Scale validation
the process of assessing whether a multi-item scale accurately and reliably captures the construct it is intended to measure, ensuring
that it reflects the intended attributes and performs consistently across different contexts and populations
Demographic items
data collected about respondents’ characteristics, such as age, gender, education level, income, and ethnicity
Population
the entire group of individuals or units from which a sample is drawn
and to whom the survey findings are intended to generalize
Sample
a subset of individuals or units selected from a larger population
for the purpose of conducting a survey or study to draw conclusions about the entire population
Sample
to select and examine a subset of a population or data set to draw conclusions or make inferences about the larger population
Sample size (N)
the number of individual units or observations selected from a
population for a study, used to ensure the results are statistically reliable and representative of the larger group
Statistical power
the probability that a statistical test will correctly reject a false null hypothesis, thereby detecting an effect or relationship if one truly exist
Representative sample
a subset of a population that accurately reflects the characteristics and diversity of the larger group, allowing the results to be generalized to the entire population
Probability sample
when each member of the population has a known, non-
zero chance of being selected for the sample, allowing for statistical inference and generalization to the population
Non-probability sample
when members of the sample are not selected at random, making it difficult to determine the likelihood of any member being chosen and limiting the ability to generalize the findings
Convenience samples
a type of non-probability sample where participants are selected based on their easy availability and proximity to the researcher, rather than through random sampling, which can lead to biases and limited generalizability
Quantitative research
a method of inquiry that focuses on collecting and analyzing numerical data to identify patterns, test hypotheses, and make generalizations about a population
Conceptual clarity
forming a precise definition for and clear understanding of the concepts being studied
Concept
a broad, abstract idea or general notion that provides a
foundational understanding
Construct
a specific, measurable version of a concept used in research
to operationalize and test theoretical ideas
Face validity
the extent to which a measurement tool appears to measure what it is supposed to measure, based on casual inspection
Construct validity
the extent to which a variable or measurement is related to other measures that theory suggests should be related
Content validity
the extent to which a variable or measurement accurately represents all of the elements that define the concept it is intended to measure
Reliability
the consistency and stability of a measurement tool across
repeated applications
Survivorship bias
when only the entities that have “survived” a particular process are considered, leading to a skewed understanding or conclusion.
Qualitative research
a method of inquiry that focuses on understanding and interpreting the meanings, experiences, and perspectives of individuals or groups through non-numerical data, such as interviews, observations, and texts
Categorical variables
represent categories or groups and do not have a numeric value
Nominal variables
categorical variables with no inherent order or ranking among the categories.
Ordinal variables
categorical variables that have a meaningful order or ranking, but the intervals between the categories are not necessarily equal.
Numerical variables
represent quantities and can be measured on a numeric scale
Continuous variables
can take any value within a range and can be subdivided into finer increments with equal unit distances
Discrete variables
can only take specific, distinct values, often counts or integers
Rank statistics
a class of statistics used to describe the variation of continuous variables based on their ranking from lowest to highest values
Quartile
a statistical term that divides a dataset into four equal parts, with
each quartile containing 25% of the data
Box-whisker plot
a graphical representation of data
that displays the median, quartiles, and potential
outliers, using a box to show the interquartile range
and “whiskers” to indicate the range of the data
Moments
numerical measures derived from the data values themselves and their positions relative to the mean or origin
The zero-sum property of the mean
if you subtract the mean of a dataset
from each data point, the sum of these deviations will always be zero
The mean of a variable is often called its
expected value because it is the
value you would most expect the variable to take.
Variance (second moment)
a measure of the dispersion of a variable around its mean
Standard deviation
another measure of the dispersion of a variable around
its mean.
Kernal density plot
a visual depiction of the distribution of a single variable based on a smoothed calculation of the density of cases across the range of values
Skewness (third moment)
a measure that indicates the symmetry of the variable’s distribution around the mean
Kurtosis (fourth moment)
a measure that indicates the steepness of the distribution of a variable
Even when we go all out to get information about every U.S. citizen in the Census, we still have
lots of nonrespondents.
Convenience sample
a sample such that each member of the underlying population does NOT necessarily has an equal probability of being selected.
Statistical inference
the process of using what we
know about a sample to make probabilistic statements about the broader population.
Parameters
parameters are numerical values that
describe certain characteristics or features of a sample or an entire population, such as the mean, variance, or proportion.
Central limit theorem
a fundamental result from
statistics indicating that if one were to collect an infinite number of random samples and plot the resulting sample means, those sample means would be distributed normally around the true population mean
Distribution
a mathematical function that describes the probabilities of different outcomes in a random variable or set of data
Data generating process
the underlying mechanism or
model that describes how data is produced and collected
Independent outcomes
an outcome whose occurrence is not influenced by the outcome of another event.
Normal distribution
a bell-shaped statistical distribution that can be entirely characterized by its mean and standard deviation.
standard deviation numbers
- One standard deviation in each direction captures
68.3% of the area under the curve. - Two standard deviations in each direction captures
95.5% of the area under the curve. - Three standard deviations in each direction captures
99.7% of the area under the curve.
Standard error (of the mean)
the standard deviation of the sampling distribution means.
-It is the measure of the variability or dispersion of sample means around the population mean
Confidence intervals
a probabilistic statement about the likely value of a population characteristic based on the observations in a sample.
hypothesis
a testable statement predicting a relationship or effect between variables, often framed as an expectation of what will happen
null hypothesis
a specific type of hypothesis that assumes no effect or no difference between variables and serves as a baseline to test against
Counterfactual
an alternative scenario or condition that contrasts with the proposed effect or relationship in the hypothesis, effectively serving as the null hypothesis which assumes no effect or difference
Critical value
a predetermined threshold derived from a particular statistical distribution used to conduct a statistical test
Significance level
the probability of rejecting the null hypothesis when its actually true, representing the threshold for statistical significance.
Test statistic
a value calculated by:
* identifying the sample statistic (e.g., the mean),
* determining its standard error (e.g. standard error of the mean), and
* using a specific formula to assess how far the sample result deviates from the null hypothesis
p-value
the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true
In the social sciences, the standard p-value threshold is
p = 0.05.
Statistical significance
an indication that an observed effect or relationship in the data is unlikely to have occurred by random chance alone. (assuming the null hypothesis is true and the study is repeated an infinite number of times by drawing random samples from the same population, less than 5% of these results will be more extreme than the current result.)
When a result is statistically significant, that does not mean that
the alternative hypothesis is proven to be true. It just means you can reject the null hypothesis
Chi-squared test of tabular association
a statistical test that
evaluates whether observed categorical data align with the expected frequencies based on a specific hypothesis
Contingency table
a matrix that displays the frequency distribution of two categorical variables, showing how their values intersect
Degrees of freedom
the number of independent values or quantities that can vary in a statistical calculation, typically indicating the number of values that are free to vary after certain constraints are applied
The shape of the Chi-square
distribution depends on the
degrees of freedom