Ch 1 - Intro to Data Flashcards
Summary Statistic
a single number summarizing a large amount of data
Data Matrix
Common way to organize data, each row is a unique case (aka unit of observation or obervational unit), each column corresponds to a variable
Numerical Variable
Can take a wide range of numerical values, and it makes sense to add, subtract, take averages with those values. Discrete or Continuous
Ordinal Variable
a categorical variable that has levels with a natural ordering (gold, silver, bronze)
Nominal Variable
Binary/dichotomous: 0=male, 1=female. Categorical: Blood types 0=A, 1=B, 3=AB, 4=O
Associated Variables
When two variables show some connections with one another. AKA dependent variables. If not associated, then they’re independent variables.
Simple Random Sample
Each case in the population has an equal chance of being included and there is no implied connection between the cases in the sample.
Convenience Sample
Individuals who are easily accessible are more likely to be included in the sample.
Observational Studies
Data is collected in a way that does not directly interfere with how the data arise. Can provide evidence of naturally occurring association between variables, but they cannot by themselves show a causal relationship.
Randomized Experiment
Individuals are randomly assigned to a group, and the individuals in each group are assigned a treatment.
Confounding Variable
A variable that is correlated to both the explanatory and response variables.
Stratified Sampling
The population is divided into groups called strata, chosen so that similar cases are grouped together, then a second sampling method is employed within each stratum.
Cluster Sample
Like a two-stage simple random sample. Break up population into many groups, called clusters. Then sample a fixed number of clusters and collect a simple random sample within each cluster.
Scatterplot
Provides a case-by-case view of data for two numerical values
Dot Plot
One-variable scatterplot
Skew
When data trail off in one directions, the distribution has a long tail. If long left tail, it is left skewed. Right tail, right skewed.
Sample Mean
Average of all observed values
Sample Variance
Square all deviations from the sample mean, take an average. Divide by n-1 for sample mean.
Median
50th percentile
Interquartile Range (IQR)
The length of the box in a box plot = Q3 - Q1, where Q1 and Q3 are the 25th and 75th percentiles
Box Plot Whiskers
Extend out from the box to the max of the farthest data point or 1.5 * IQR
Robust Estimates
Median and IQR are examples, because extreme observations have little effect on their values. The mean and standard deviation are not robust.
Contingency Table
A table that summarizes counts of data for two categorical variables
(Relative) Frequency Table
A table that summarizes (percentages) counts of data for one variable