Introduction To Data Flashcards
3 components of statistics
Collect
Analyze
Infer
Study of how best to collect, analyze and draw conclusions from data
Statistics
In a study, the group that provides the reference point against the treatment group is
Control group
Single number summarizing a large amount of data
Summary statistic
The first step in most analyses
Effective presentation and description of data
Each row in the table is the
Case
Each column on the table is a
Variable
Row + column
Data matrix
Another term for case
Unit of observation or an observational unit
A variable with values that can be added, subtracted or averaged is
Numerical
A numerical value that cannot take non negative numbers is
Discrete
Variables that denotes classification is
Categorical
The possible values of categorical is
Level
Categorical variable with levels of natural ordering is
Ordinal
When two variables show some connection with one another, they are called ___________________ or _____________________ variables.
Associated; dependent
If a variable increase and the other decrease, there is
Negative association
If the variable increase, and the other increase, this is
Positive association
If two variables are not associated, this is
Independent
Can a pair of variable be associated and independent at the same time?
No
Each research question refers to a target
Population
A subset of cases which is a small fraction of the population is known as
Sample
Data collected in haphazard fashion is
Anecdotal evidence
If someone was permitted to pick and choose exactly the included subjects in a sample, this introduces _____________ into a sample.
Bias
Most basic random sample is
Simple random sample
In simple random sample, each case in a population has a/an __________ chance of being included
Equal
Bias can crop up. If only 30% of people randomly sampled actually responded, it is unclear whether the results are __________________ of the entire population. The _____________ bias can skew results.
Representative / non response
When individuals who are easily accessible are more likely included in the sample, this is _____________________.
Convenience sample
Explanatory variable might affect
Response variable
Association implies causation. True or false.
Not always. False.
Two primary types of data collection
Observational studies
Experiments
Collecting data in a way that does not directly interfere with how the data arise is
Observational study
When researchers want to investigate the possibility of a causal connection, they conduct a/an
Experiment
When individuals are randomly assigned to a group, the experiment is called a
Randomized experiment
In a two group experiment, the fake treatment is called a
Placebo
Causation can only be inferred from a ______________.
Randomized experiment
A variable correlated with both the explanatory and response variables
Confounding variable
Two forms of observational studies
Prospective
Retrospective
What observational study identifies individuals and collects information as events unfold
Prospective
What observational study collect data after events have taken place, eg, researchers review past events in medical records
Retrospective
Three random sampling techniques
Simple
Stratified
Cluster
Most intuitive form of random sampling
Simple random sampling
Fishbowl is
Simple random
Divide and conquer sampling strategy
Stratified sampling
When similar cases are grouped together, then simple random sampling is employed in each group, this is
Stratified sampling
A two-stage simple random sample is
A cluster sample
This is similar to stratified sampling but no requirement
Cluster
Studies where researchers assign treatments to cases are called
Experiments
Four principles of experimental design
Controlling
Randomizing
Replication
Blocking
Asking all patients to drink a 12 ounce of water with the pill demonstrates
Control
To even out differences and prevent accidental bias, what is done?
Randomization
Verifying an earlier finding to make it more accurate requires
Replication
If variables influence a response, split the cases in categories, then split the distribution. This is
Blocking
The gold standard in data collection is
Randomized experiments
When researchers keep the patients uninformed about their treatment, the study is said to be
Blind
Fake treatment
Placebo
If a fake treatment results in a slight but real improvement in patients, this is
Placebo effect
If doctors and researchers, like patients, are unaware of who is or is not receiving treatment, this is
Double blind
Provides a case by case view of data for two numerical variables
Scatterplot
Scatterplot helps spot
Associations
One-variable scatterplot
Dot plot
Common way to measure the center of a distribution of data
Mean
Sample mean
X with line above where x is the total number of cases or observation units
What is the sample size in x = x1 + x2 + xn / n
n
The average of all observations in a population is known as ; a subscript represents
mu ; variable the population mean refers to
Sample mean may provide a reasonable estimate of _____________. Although not perfect, this provides a _____________.
mu subscript x where mu = average of ALL observations and x = variable ; rough estimate
Provides a view of the data density
Histogram
Useful when individual values are of interest
Dot plot
Useful for highlighting outliers, median and interquartile range
Box plot
What determines skew
The long tail
Useful for highlighting spatial distribution
Intensity map
4 ways to evaluate variables relationship
Direction
Shape
Strength
Outliers
3 forms of skewness
Left
Symmetric
Right
4 modalities of skewedness
Unimodal
Bimodal
Uniform
Multimodal
2 measures of variability
Variance
Standard deviation
Which one is easier to understand? Variance or standard deviation?
Standard deviation
Distance of an observation from the mean is
Deviation
What is the symbol for sample variance?
S with superscript 2
Formula for sample variance?
Square all over n-1
The square root of the variance is
Standard deviation
Standard deviation is the
Square root of the variance
S squared / n-1 =
Sample Variance
What is variance?
Average squared distance from the mean
Square root of the variance
Standard deviation
The greek letter for used for population values
Sigma
What is the difference between sample variance and population variance?
Sample variance uses n-1 and population variance uses n
Summarizes a data set using five statistics while plotting unusual observations
Box plot
The first step in building a box plot is denoting the
Median
To find median, arrange variables from
Smallest to largest
The second step in building a box plot is
Drawing a rectangle to represent the middle 50% of the data
The total length of the box in a box plot is the
Interquartile range (IQR)
The two boundaries of the box are called
First quartile and third quartile
The more variable the data, the _____________ the standard deviation
Larger
25% of the data fall below this value
Q1
25% of this data is above this value(vertical box plot)
Q3
What is the formula for IQR?
IQR = Q3-Q1
In a box plot, the ____________ attempt to capture the data outside of the box
Whisker
The whisker is never allowed to go beyond
1.5 x IQR
An observation beyond the whisker, aka, unusually distant observations are called
Outliers
An observation that appears extreme relative to the rest of the data
Outlier
Why is it important to look for outliers?
Insight to interesting data properties
Errors in entry or collection of data
Reexamine
Strong skew identification
Extreme observations have little effect on the
Median and IQR
Median and IQR are called ______________ estimates
Robust
Why are median and IQR robust estimates?
They are only sensitive to the numbers near Q1, the median and Q3.
A table that summarizes data for two categorical variables is called a
Contingency table
Provides total counts across each row
Row totals
Provides total counts down each column
Column totals
A table for a single variable is
Frequency table
A frequency table replaced with percentages and proportions is called a
Relative frequency table
Common way to display a single categorical variable
Box plot
Counts divided by their row totals
Row proportions
Count divided by column totals
Column proportion
A table that summarizes data for two categorical variables is called a
Contingency table
Provides total counts across each row
Row totals
Provides total counts down each column
Column totals
A table for a single variable is
Frequency table
A frequency table replaced with percentages and proportions is called a
Relative frequency table
Common way to display a single categorical variable
Bar plot
Counts divided by their row totals
Row proportions
Count divided by column totals
Column proportion
When do you use barplots? Histograms?
Barplot
Categorical
Histogram-numerical variable
X axis on histogram is
Numerical
X axis on barplot
Category
Rescaling of the data using a function
Transformation
When much of the data cluster is near zero relative to the larger values of the data set
Natural log transformation
Why transform scatterplot
Make the relationship between variables more linear
Goals of transformation
See data structure differently
Skew reduction to assist in modeling
Straighten a nonlinear relationship in a scatterplot
To visualize two categorical variables
Segmented bar plot
Useful for visualizing conditional frequency distributions
Segmented bar plot
To explore relationships between variables in a segmented bar plot, we need to compare
Relative frequencies
Segmented bar plot that uses proportion is
Relative frequency segmented bar plot
It displays marginal distribution, by using the width of a bar
Mosaic plot
Mosaic plot is only used for
Categorical variable