Exam 1 Real Flashcards
What is statistics?
Study of methods for measuring aspects of populations from samples and for quantifying the uncertainty of the measurements
What is a population versus a sample?
A population is all of the individual units of interest and a sample is a subset of the population
What are variables?
Characteristics that differ among individuals
What is a parameter?
A quantity describing a population (real)
What is an estimate or statistic?
A related quantity calculated from a sample (a subset of the population)
What does error value of an estimate or statistic depend on?
Depends on the variability within the population
What is estimation?
The process of inferring an unknown quantity of a population using sample data
What is a random sample?
In a random sample each member of the population has an equal and independent chance of being selected
What do random samples achieve?
Minimizes bias and makes it possible to measure the amount of sampling error
What is a sample of convenience?
A collection of individuals that are easily available to the researcher
What is the parameter?
The truth
What is sampling error?
The difference between an estimate and population parameter being caused by chance
What is bias?
- Bias is a systematic discrepancy between estimates we would obtain if we could sample a population again and again, and the true population
- Error in the same direction if you repeated the sample
What is volunteer bias?
Resulting from systematic differences between the pool of volunteers and the population to which they belong
What is accurate?
Closer the statistic or estimate is to the truth
What is precise?
Describing how repeatable an estimate is - could be due to low variability in the population
Data can be___|_____
Categorical or numerical
Categorical data can be ________ or ________
Nominal - no inherent order
Ordinal - inherent order
Numerical data can be ________ or ______
Continuous - any real number
Discrete - indivisible units (# of children)
What is a frequency distribution?
The number of times each value of a variable occurs in a sample
What are two types of studies?
Experimental and observational
What are two types of variables?
explanatory and response variables
How are variables graphed?
Explanatory variable on the x axis and response variable on the y axis
What is a lurking/confounding variable?
A variable that masks or distorts the causal relationship between measured variables in a study
What are 3 problems with 3D bar graphs?
Takes the average, difficult to make comparisons because of the way data is displayed, magnitudes are distorted making the differences out of proportion
What is good about graphs?
Good when you want to show trends or patterns in values
When are tables good?
When you want to report/compare specific values with precision
What is a bar graph used for?
Uses the height of rectangular bars to display the frequency distribution of a categorical variable
What is a grouped bar graph?
Uses the height of rectangular bars to display the frequency distributions of two or more categorical variables
Which is better a bar graph or a pie chart?
Textbook prefers bar graph to pie chart, pie chart only if there are only two categories
What is a histogram?
Like a bar graph but the x axis has numerical variables
Describe the aspects of a histogram shape.
- the mode is the highest peak in the frequency distribution
- skew refers to asymmetry in the shape
- outlier
What is a plot with area of rectangles?
mosaic plot
What does a mosaic plot display?
- uses the area of rectangles to display the relative frequency of occurence of two categorical variables
What is a scatter plot?
graphical display of two numerical variables, each observation a point on a graph of two axes
What is a strip plot?
a graphical display of a numerical variable and a categorical variable in which each observation is represented as a dot
What is useful about a strip plot?
Gives a good idea of sample size
What is a box plot?
a graph that uses lines and a rectangle box to display the median, quartiles, range, and extreme measurements of the data
What is a violin plot?
a graph that shows an approximation of the frequency distribution of a numerical variable in each group and its mirror image
What does the width of a violin plot indicate
- distribution of the data
- width is proportional to the density of data points
What is a good tip for multiple histograms?
- better to stack vertically rather than side by side because it is easier to compare groups
-use same scale for x axis
What is interquartile range?
upper quartile - lower quartile
What describes the spread of a distribution?
standard deviation and variance
What is deviation?
difference between a data point and the mean
What is the sum of squares?
the sum of squared deviation
What is variance?
s.d. squared
What is standard deviation?
Why is there a preference for standard deviation?
- never negative
- in the same units as the observation
- helpful rule of thumb properties
What are the rule of thumb properties standard deviation?
What is the issue with comparing the spread of distributions in different populations?
mouse vs elephant weights, just because there is a larger deviation value doesnt mean there is a bigger relative spread
What is useful for comparing the spread of distributions in different populations?
coefficient of variation
What is the coefficient of variation?
- the standard deviation expressed as a percentage of the mean
- CV = s / mean x 100%
- Larger CV = wider spread
What is the median?
the middle measurement of a set of observations
What do percentiles indicate?
xth percentile is the sample below which x percent of the observations lie
What is the line in the middle of a box plot?
the median
Explain the box and whiskers in a box plot.
- Box covers entire IQR
- The upper whisker is the highest point within the quartile 3 + 1.5*IQR
- The lower whisker is the lowest point within the quartile 1 – 1.5*IQR
- If there is a data point lower than the floor there are dots – outliers
Where should the median be in a bell shaped curve?
right in the middle of the box
What is the plot with frequency lines?
- Cumulative relative frequency at a given measurement is the fraction of observations less than or equal to that measurement
-A steep jump indicates the clustering of a lot of data points - A horizontal line indicates a gap in data points
What is the IQR?
the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data
Median is ____ mean is_____
Median is the middle value, while the mean is the center of gravity
What is proportion?
- Proportion of observations in a given category
- P = num in category / n
- The p has a little hat on it when you are estimating the proportion in a sample
Describe how sampling distributions change with different numbers of samples.
- The spread of the sampling distribution depends on the number of samples
- As you increase (observations/sample) the spread (sd) decreases
What is the standard error of an estimate?
- The standard error of an estimate is the standard deviation of the estimate’s sample distribution
- SE_Y=s/√n
- Reflects the precision of the estimate
- The smaller the standard error the less uncertainty there is in the estimate of the target parameter
What is the standard error of the mean?
σ=σ/√n
- we usually don’t know the actual population standard deviation so we approximate with sample standard deviation as an estimate of σ
σ
population standard deviation
s
sample standard deviation
What is a confidence interval?
a range of values surrounding the sample estimate that is likely to contain the population parameter
What is the normal confidence interval?
The 95% confidence interval provides a most plausible range for a parameter.
How do you describe confidence interval certainty?
- Right: We are 95% confident that the true mean lies between ___ and ____
- Wrong: there is a 95% probability that the true mean falls between 2827.8 and 3828.4
What are error bars?
- lines on a graph extending outward from the sample estimate to illustrate uncertainty about the value of the parameter being estimated
- used to display the uncertainty, not the spread of the data
What is the 2SE rule?
A rough approximation of the 95% confidence interval for a mean can be calculated as the sample mean plus and minus two standard errors
What is a random trial?
- a process or experiment that has two or more possible outcomes
- die, coins
What is an event in a random trial?
- Event (of interest): any potential subset or all possible outcomes
- Flipping coin: heads
- Rolling die: 3
What is probability?
the proportion of times the event would occur if we repeated a random trial over and over again under the same conditions
How do you abbreviate probability?
Pr[A] means “the probability of event A”
What does mutually exclusive mean?
Two events are mutually exclusive if they cannot occur at the same time
What is probability distribution?
a list of the probabilities of all mutually exclusive outcomes of a random trial
How do you represent the probability distribution of different variables?
- A discrete variable is measured in indivisible units
- All categorical variables (present or absent) and many numerical variable (number of mates)
- Continuous variables can take on any real number value within some range
- Probability of Y being in some range is indicated by the area under the curve
What is the addition rule?
if two events A and B are mutually exclusive then Pr[A or B] = Pr[A] + Pr[B]
What is the general addition rule?
- Not all events are mutually exclusive, so extra term is needed so you don’t double count outcomes
- Pr[A or B] = Pr[A] +Pr[B] – Pr[A and B]
What are independent events?
- Two events are independent if the occurrence of one does not inform us about the probability that the second will occur
- Two flips of a coin or roll of a die
What is the multiplication rule?
If two events are independent then the probability that they both occur is the probability of the first event multiplied by the probability of the second event
What are dependent events?
the probability of a particular event in the second trial depends on what happened in the first trial
What is the general multiplication rule?
- Finds the probability that both of two events occur even if the two are dependent
- Pr[A and B] = Pr[A]Pr[B|A]
Standard deviation, standard error, 95% confidence interval
SD > 95% > SE
Explain the difference between a bar plot and a histogram.
Bar graphs are used to show the frequency distribution of a categorical variable whereas histograms are used to show the frequency distribution of a numerical variable.
How do you identify a skew?
where ever the tail is
sd
The standard error of a sample mean is ___.
the standard deviations of the means of randomly drawn samples from the population
Select the proper interpretation of a confidence interval for a mean at a confidence level of C%.
A range of values _____.
produced by a method such that C% of confidence intervals produced by the same method contain the population mean