About Data 1 Flashcards
Experiment, data collection, categories
What are the rows and columns of the data table called?
Rows (observations/cases)
Columns (variables)
What are the two sub classes of numerical variables and what do they mean?
Continuous - infinite choices
Discrete - finite choices
What are the two categorical variable sub classes and what do they mean?
Ordinal - has natural order
Regular- doesn’t have natural order
Associated vs independent variables?
Asssociated - two variables have a connection
Independent- no connection
Anecdotal evidence?
Grandma says lightning cures cancer cuz it happend to her
Problems with taking a census
Some are hard to locate
Complex
Population changes while cencus is being taken
“Tasting soup” exploratory analysis, infrence and representative.
Exploritory analysis - gathering data (tasting the soup)
Inference - to generalize your claims to the whole population
Representative - does your sample represent tge whole population (it needs to!)
Sampling bias from these- non response, voulentary response, convenience sample
Non response - if only a small fraction of randomly sampled people respond; the sample may no longer be representative of the population.
Voulentary response- only people who care to respond are those with strong opinions (npt representative)
Convineince sample- people who are more easily accessable are more likly to be in the sample
Explanitory variable and response variables
Its a suggestion to which one is influencing the other (does not mean it is causal)
Observational study
Data is collected in a way that does not effect how data comes “observes”
Experiment
subjects are assigned treatments to establish causal connections between explanatory and response variables
Co-founding variable
a variable which is correlated to the explanatory and response variables
Two types of observational studys?
prospective and retrospective studys
prospective study?
collects info as events unfold
retrospective study?
collects info after events have taken place
What are the four sampling methods?
simple random sampling
stratified “”
cluster “”
multistage “”
Simple random sample
random samples
Stratified sample
divides population into groups based on similar observations. Then takes random samples from each
Cluster sample
divides population into random groups then takes whole cluster samples from some randomly chosen groups
Multistage sample
make random clusters. then randomly chose clusters to sample. simple random sample within
Principles of experimental design (4) C R R B
Control-compare treated with control group
Randomize- random samples
Replicate-do the experiment many times by collecting a large sample
Block-assign groups into subdivisions to eliminate a third variable
Scatter plot
useful for visualizing the relationship between two numerical values
Dot plots and mean
Shows the mean along with dots grouped densely up in a single line
Sample statistic and point estimate
Sample statistic- data found from the sample
Point estimate- an estimation of the population
Stacked plot
dots are piled on top of each other in multiple rows to show mean
Histograms
shows data density, describes the shape of the data, bar width can alter the story (makes it less/more accurate) ex 1-10 vs 1-2
4 types of histogram shapes
Uni modal - looks like normal distribution
bimodal- two humps
multi modal 3 or more humps
uniform- one straight line
3 types of skew
Right medianmean
symmetric median=mean
Varience
average squared deviation from the mean. sum of(x-mean)^2/n-1
Standard deviation
sqrt (sum of(x-mean)^2/n-1
sqrt of variance
all data should be within 3 sd’s
median
value that splits data in half when in ascending order. if even average of two mid #s
Q1, Q3 and IQR
Q1=25th percentile
median=50th
Q3=75th percentile
between Q1 and Q3 = the middle 50%= interquartile range=IQR
box plot
represents the middle 50% with a box. with whiskers that can reach up to 1.5x the closest quartile. outliers are outside of the wiskers
robustness
median and IQR vs mean and SD
median and IQR are more resistant to skewness. skewed distribution are rep by median and IQR. symmetric distribution rep by mean and SD
Graph transformations
log(x) “eliminates outliers
easier to model
Hard to interpret
contingency tables
summarizes data from two categorical variables
bar plot
common way to show a single categorical variable
histogram vs bar plot
histogram-numerical variables x axis number
bar plot-categorical variables x axis categories
Segmented bar and mosaic plot
bar graph with legend. All the space is taken up to total # of samples with yes and no
Pie charts
Don’t use if comparing more than 5 items. Colorful but bad.