Chapter 1, Intro to Data Flashcards
What is a summary statistic?
a single number summarizing a large amount of data
What is a proper data set called? and what makes is “proper”?
data matrix, each row corresponds to a unique case and each column corresponds to a variable.
What is the formal name for a row?
case or observational unit
What do columns represent? and what is important to know about them?
characteristics, called variables (imp to understand what each variable means, as well as units of measurement)
What are 2 types of variables?
Numerical and Categorical
What are the 2 kinds of numerical variable?
Discrete and continuous
What are the 2 kinds of categorical variable?
Ordinal and nominal
What is a discrete numerical variable?
a number value that can only be a whole number, e.g. population, since you can’t have half a person
What is a continuous numerical variable?
a number value that can be in between whole numbers, e.g. an hourly pay rate.
What is an ordinal categorical variable?
a categorical variable that involves an ordering, e.g. educational level attained
What is a nominal categorical variable?
a categorical variable that doesn’t involve an ordering, e.g. color
What are possible categorical variables called?
levels
What makes 2 variables “associated” or “dependent”?
When they show some connection with one another.
What is a scatterplot graph useful for?
Showing whether or not 2 variables are associated, as well as trends in the relationship
What is a positive correlation between 2 variables?
a relationship where if one variable increases, the other also increases or vice versa
What is a negative correlation between 2 variables
a relationship where if one variable increases, the other decreases or vice versa
What are independent variables?
variables that aren’t associated
What 2 words express whether or not one variable affects another?
an explanatory variable (might affect) a response variable
What are the 2 primary types of data collection?
observational and experimental
What makes a study observational? Any why use this method?
Research do not interfere directly with how the data arise. (Surveys, collect data from existing records, follow a cohort of similar individuals in studies of diseases). Can provide evidence of association between variables, but can’t show a causal connection. Can give rise to hypotheses to be checked using experiments.
Why use an experiment?
to investigate the possibility of a causal connection
What is a sample?
A subset of the population to be studied.
Define anecdotal evidence. Why is it a problem?
Data collected in a haphazard fashion. May not be representative of the population.
What is a non-response rate and why is it important?
Non-response rate is the rate at which people in the sample population do not respond. A high non-response rate can skew the results.
What is a confounding variable
a variable that is correlated with both explanatory and response variables. E.g. sun exposure is related to both the use of sunscreen and skin cancer. Also called a lurking variable, confounding factor, or a confounder.
What are the 2 kinds of observational study?
Prospective, which identifies individuals and collects info as events unfold
Retrospective, which collects data after events have taken place (e.g. studying medical records)
What are the 4 sampling methods?
- Simple random (SRS): like a lottery
- Stratified: divide population into groups of similar individuals (strata), then take a random sample from each group.
- Cluster: divide population into groups of dis-similar individuals, then use data from a sample of the clusters
- Multistage: involves more than one stage of sampling, 1. cluster sample, then 2 SRS within the selected cluster.
What is sampling variability?
The natural variation in samples. Unavoidable, doesn’t usually cause problems.
What are 2 biased sampling methods? (not everyone in the population has an equal chance of being part of the sample)
Convenience sample (people who are easy to reach, like standing on a street and collecting data from people who walks by) Voluntary response sample: people who have chosen to include themselves in the sample, people with a strong interest in the topic are most likely to respond.
What is a population parameter?
a number that describes something about an entire group or population
What makes a study an experiment?
researchers assign treatments to cases. When treatments are assigned randomly, it’s called a randomized experiment.
What are the 4 principles randomized experiments are built on?
- controlling for possible confounding variables such as how much water a person takes a pill with
- randomization into treatment and control groups order to account for variables that can’t be controlled
- replication: the more often a result is replicated (though a sufficiently large sample), the more accurately the effect of the explanatory variable on the response variable can be estimated
- Blocking: using strata within the experimental population, i.e., grouping population into blocks that have certain characteristics, if they suspect those characteristics (variables) may influence the response.
What are the 2 ways to employ randomization and what is the benefit of using them?
- random sampling allows you to generalize results to the target population
- random assignment to treatment or control group strengthens the suggestion of causality between explanatory and response variables.