Exam 1 Flashcards
Data file
the format in which statistical format is organized, typically in spreadsheet form. Rows contain measurements for a particular subject, columns contain measurements for a particular characteristic
Simulation
use of a computer to mimic what would actually happen if you selected a sample and used statistics in real life. These are done when it is not practical to physically perform an experiment. Probability sampling is used in designing simulations
Response variable
variable we are interested in measuring
component
what you are simulating through use of a random device
trial
One repetition of a simulation/experiment
Steps for building simulations
- Identify component to be repeated/simulated
- Explain how you will model the component’s outcome
- State response variable clearly
- Explain how to combine the components into a trial to model the response variable
- Run several trials
- Collect and summarize the results of the trials
- State your conclusion
3 reason for studying stats
- being informed
- making good decisions
- evaluate decisions that affect you
Definition of statistics
The science of learning from data in the presence of variability. variability is everywhere
Statistical problem solving process
- formulate a statistical research question
- collect data
- analyze data
- interpret results
Main components of statistics
- design: plan on how to obtain data to answer the question
- description: summarize and analyze the data
- probability: determine how sample differs from population
- Inference: make decisions and predictions
Variable
any characteristic observed in a study
data
the values of a variable for one or more people or things
Observation
(subject) an individual piece of data
data set
the collection of all observations for a particular variable
Categorical variable
(qualitative) Non-numerical variable with different categories, can still be a number depending on what that number represents
Quantitative variable(and types)
a numerical variable
Types
1. Discrete: values form a set of separate numbers. Typically something we count
- continuous: values form a continuum of values, infinite number of possible values. Typically something we measure
Reasons for identifying different data types
- Choose appropriate graphical display
2. Choose correct statistical method for inferential procedures
W’a and H for data
How, What, Where, When, Why, Who
Frequency distribution
A listing of distinct categories and their frequencies
Relative frequency distribution
A listing of distinct values and their relative frequencies(proportions and percentages). Used to compare samples of unequal size
Joint event
Event with two or more characteristics
How to tell if there is an association or not?
Association: relative frequencies differ
No association: relative frequencies are similar
Dot plots
- easy to make
- useful for comparing 2 or more data sets
- display individual values of data set
- good for smaller data sets
- shows raw data
Stem plots
- not useful with large data sets
- Usually displays more info than histograms
- include raw data
- useful for comparing 2 or more data sets
- Have “stem”(can have more than one digit) and “leaf” can not have more than one digit
- arranged in ascending order
- must have a key
Histogtams
- analogous to bar charts
- horizontal axis has classes of quantitative data
- frequency, relative frequency or percent
- bars touch
- good for larger data sets
- good if you need more flexibility
Time plots
- show changes over time
- vertical axes show each observation
- horizontal axes show time when observation was measured
- trends can be seen by connecting points
what does “n” usually indicate?
sample size
Which measures of center are resistant to the outliers and which arent?
- Resistant: Median
* Not resistant: Mean
Which measures of center are useful with quantitative data and which are useful with qualitative/categorical data?
Mean and median can only be used with quantitative data. Mode can be used with both
What can you know about the distribution if the mean is greater than median? What about if the is less than the median?
Mean is greater: right skewed
Mean is less than: left skewed
Measures of variation(purpose and types)
Indicate amount of spread in a distribution
types
1. Range: if you dont know this youre screwed
2. standard deviation: accounts for all
observations, indicates how far on average observations lie from the mean, not resistant to outliers
3. Interquartile range(IQR): Quartiles of data, used with boxplotd
which types of graphical displays are for quantitative data?
- dot plots
- stem and leaf plots
- histograms
- time plots
Graphical displays for categorical data
- Frequency distribution
- Relative frequency distributions
- Pie charts: use relative frequencies, aka circle graph, difficult to construct by hand, best for data sets for few categories
- Bar charts: easiest way to graph, horizontal axis is distinct values of categorical data, vertical axis is frequencies or relative frequencies
- Pareto charts: bar graph with bars from tallest to shortest
Response variable
measured to make comparisons between groups
Explanatory variable
(predictor) explains the value of response values
Association
relationship between 2 variables
Contingency table
Frequency distribution for bivariate data, also called a two way or cross tabulation table
Conditional proportions
Proportions based on the explanatory variable for categories of the response variables
Empirical rule
Applies to bell shaped distributions
68% of data falls within 1 standard deviation the mean
95% falls within 2 standard deviations
99.7% falls within 3
Percentile
- measure of relative standing
- indicate the below which a certain percentage of observations fall
- resistant to outliers
- often preferred over mean and STD
- Divides data into 100 equal parts, there are 99 percentiles
Types of percentiles
- Deciles: divide data into tenths
- Quartiles: divide data into fourths
•1st quartile: aka lower quartile, median of lower half of data, divides lower 25% and upper 75%
•Second quartile: median
•Third quartile: divides bottom 75% from top 25%
5 number summary and it’s graph
- Minimum
- Q1
- Median
- Q3
- Maximum
represented by a boxplot
Interquartile range
- Preferred measure of variation when median is used
- IQR=Q3-Q1
- more resistant to outliers
Finding potential outliers with IQR
- less than Q1-1.5•IQR
2. greater than Q3+1.5•IQR
Difference between potential outlier and outliers
and outlier is far removed from the rest of the data
SOCS
- Acronym for Shape, Outliers, center, spread
* Use to describe distributions of quantitative data
Components of graph shape
Modality: #of peaks, can be unimodal, binodal or multimodal
Skewedness and symmetry
Outlier criterion using z scores
z>|3|
How to know whether to use mean or median for measure of center
- Use mean of possible because it takes into account of actual observations
- mean is good for symmetric observations with a small number of discrete values
- median is good for skewed distributions when potential outliers are oresent
What is report with the mean? median?
Mean and standard deviation are reported together while IQR and range are reported with median
Probability
The science of uncertainty, used to evaluate and control the likelihood that a statistical inference is correct. It quantified uncertainty
Types of probability
- Subjective: guessing a probability based off personal judgement
- Theoretical: Based on formulas
- Experimental/empirical: results of a random experiment
Common cutoff values for an event to be considered “unusual”
1%, 5%, 10%(mainly 5%)
Law of large numbers
The probability of an event is the proportion of times it occurs in a large number of repetitions in an experiment. Aka frequentist interpretation. Ignores black swan events. Helps understand and visualize meaning of probability
Sample space
all possible outcomes for an experiment
Ways to visualize a sample space
Tree diagram or venn diagram
Event
A subset of the sample space. A collection of 1 or more outcomes
Complement of an event
- Event that does not occur
- denoted as A^c
- P(A^c)=1-P(A)
Disjoint events
- aka mutually exclusive events
- events that do not have any outcomes in common
- events that cant happen at the same time
- compliment events are disjoint
Intersection
- consists of outcomes that are in both events, the overlap
* disjoint events: P(A and B)=0
Union
- A or B
* Out comes that are in one or the other
P(A or B)
Disjoint: = P(A)+P(B)
Not disjoint: = P(A)+P(B)-P(A and B)
Conditional probability
The probability of an event occurring when you know that another event has occurred
P(A|B)=P(A and B)/P(B)
Probability that event A will occur given that B has occurred. We are conditioning event B, meaning it occurred first
Formula for intersection of two events using conditional probability
P(A and B)= P(A)•P(B|A)
P(A and B)=P(B)•P(A|B)
Methods for determining if events are independent
- P(A|B)=P(A)
- P(B|A)=P(B)
- P(A and B)=P(A)•P(B)
Sensitivity
The probability that the test will give a positive result, given that the condition tested for is present
P(Positive result|condition present)
Specificity
The probability that the test will give a negative result, given that the condition tested for is not present
P(Negative result|Condition isnt present)
Parameter
- Numerical summary of a population
- Numerical summary of a probability distribution
- Denoted by greek letters
Random variable
A numerical measurement of the outcome of a random event
Expected value
the mean
Mean of a discrete probability distribution
mean=x•p(x)
repeat “x•p(x)” for each sample
What type of graph represents continuous distributions?
A curved graph
Normal distribution
- used for continuous random variables
* symmetric and bell shaped
Properties of empirical rule
- Data must be unimodal and approximately bell-shaped
2. Probabilities are approximate
Rounding rules when working with normal distributions
Round to 4 decimal places
Conditions for binomial dostribution
- Fixed number of trials(n)
- each trial has 2 possible outcomes
- the probability of success (p) is the same for each trial
4: Trials are independent
What happens to a binomial distribution if p isnt 0.50?
p<0.5: right skewed
p>0.5: left skewed
How do you know if n is large enough in a binomial distribution?
np> or equal to 15
and
1-p=15
Mean and standard deviation formulas for binomial distributions
Mean=np
Std=/np(1-p)
Ways to obtain information
census, sampling, experimentation
Mean and median in symmetric distributions
Mean and median can be used, they should be close in value
What is spread measured by?
Standard deviation and IQR
How to gage symmetry
Look at how different the mean and median are
What type of statistics is probability?
Inferential
How do you measure spread for discrete random variables?
Range
What is used to find the center of probability distributions?
Mean
Purpose of descriptive statistics
Reduce the data to simple summaries without distorting too much information
Types of proportion distributions
- Population distribution: almost never observed, we learn about it from sample distributions
- Sample distribution: aka data distribution, consists of sample data you observe and analyze, should resemble population distribution if good sampling techniques were used
- Sampling distributions: Describes long run behavior of the statistic, specifies probabilities for all possible values of the statistic for a sample in a given sizr
How to tell if a sampling distribution is normal?
n•p and n(1-p) are at least 15
Central limit theorem assumptions and conditions for the sampling distribution of p
- Randomization condition: values are randomly obtained
- Independence assumption: Sampled values are independent
- 10% condition: n is no more than 10% of the population
- Sample size assumption: n has to be large enough to expect at least 15 successes and failures
Central limit theorem assumptions and conditions for the sampling distributions of the mean of observations
- Randomization condition: values are sampled randomly
- Independence assumption: sampled values are independent
- 10% condition: n is no more than 10% of the population
- Sample size assumption: There is no one size fits all rule, small samples work if population is unimodal and symmetric, large sample is need if skewed