Statistics Basics Flashcards
What means “Doing science”?
Collecting Data so that sample information is a useful representation of the world.
Summarizing Data to make it easier to understand and use for describing the real world.
Using data to critically evaluate evidence for or against a specific hypothesis.
Population of Interest (Main Problem)
Too large to study
Sample
Subset of the population of interest. Knowledge gained from measurements on a sample. Scientists can make estimates of the larger population
What determines whether or not the data collected for a study are representative of the real world?
The methods used to obtain such data. Such methods must include unbised, random procedures
What is a Variable?
Is a characteristic of an object or group of objects that can be represented with a number that has more than 1 possible value
Columns represent…
Variables
Rows represent…
Observations
Ratio-Scale Variables
Have a true absolute zero value.
Quantitative data measured on a scale that has a constant increment between successive values.
Ordinal or Rank Scale Variables
Have values that represent the ranked order of the objects or individuals or individuals with regard to a variable.
However, the actual differences between ranks can differ. Ex. Top 3 GPAs: 4.0, 3.96, 3.7
Discrete variables
Can take only specific values and are often based on counting.
Continuous variables
Can take infinite number of possible values, limited only to the number of decimal places to which the value can be precisely measured.
Categorical variables
Have values that indicate the individual belongs to a class or category. Although these values cannot be inherently represented by numbers, they are often analyzed in terms of the count or proportion of individuals that fall within that class or category.
Population of Interest
Entire group of objects or individuals about which information is desired.
Data
Refers to a collection of observations and/or measurements for one or more variables, made on one or more individuals from the population of interest for the purpose of addessing a specific question.
Statistics
Numbers that describe characteristics of a sample. These are calculated by the data obtained from individuals in a sample.
Sample Statistics and its relation to Population Parameters
Sample statistics are used to estimate or infer something about the values of population parameters.
Sample unit
Is an individual unit that comprises the sample or pop. of interest. For example, sample= people; sample unit= person.
Are Sample Statistics considered accurate with a valid study design?
Even with a valid study design, a sample statistic is more or less accurate. They never represent the true values of the pop. parameter.
Population Parameters
Numbers that describe the characteristics of the entire population of interest.
Random Sample Variation (RSV)
Is the variation in the values of a sample statistic computed from different, independent samples taken from the same population.
They will always happen as long as scientists use samples to estimate population parameters.
Why does RSV happen?
It is a consequence of the randomness of the process by which individuals are selected from a population to create a sample.
In other words, it occurs because repeated samples include different subsets of indivudals who vary with regard to the VoI.
One example as to how sample variation and consequent uncertainty (associated with the estimates of the pop. parameters) can minimized.
If data is obtained using appropiate methods and unbiased procedures.
Bias
Is any systematic deviation of sample statistics away from the true value of population parameters.
Systematic referring to consistently wrong.
Three most common reasons for bias
Confounding, Selection Bias, Information Bias
Selection Bias
Happens when individuals included in the sample are not representative of the larger population of interest.
This is determined by the method used to select individuals from the population into your sample.
Information Bias
Measurements do not adequately represent the variable of interest.
This can be determined whether the choice of method used to measure the variable of interest (VoI) calculates the wrong value of the VoI OR when you are using an appropiate measurement method; however, it is consistently calculating the VoI wrong (For example, lack of training).
Measurement Validity
Is the idea that a measurement made on the study subjects accurately quantifies the variable of interest.
Precision
Refers to the amount of variation among the values of a sample statistic derived from repeated, independent samples of the same population.
For example, if repeated samples produce very similar values of a sample statistic then the estimates are said to be precise estimates of the population parameter.
Sample size and its effect on unusual observations
Since some samples, by chance, include unusual individuals, the effect of such few unusual observations on the value of the sample statistic can be reduced by a large sample size.
In other words, if many individuals are included in a sample, random variation among individuals will average out and sample statistics computed from repeated samples will be less variable, and thus, more precise.
Precise Estimates
The values of a precise statistic derived by repeatedly sampling the same population tends to fall within a very narrow range.
Accurate estimates
The values of a precise statistic derived from an unbiased study design are very likely to fall close to the true parameter value.
What makes a statistic an unbiased estimate of the population parameter?
If all individuals in the population of interest have equal chance of being selected AND the measurement procedure produces valid data for the VoI.
How do you minimize bias?
You do this by reducing selection bias (meaning your sample is representative of the larger population), reducing information/measurement error (choosing an adequate tool to measure your VoI AND making sure your personnel is trained enough), and checking for confounding
How do you maximize precision?
Collecting data from large sample sizes and also training personnel to ensure consistent methodology
Study Design
Is a description of the methods that the investigator will be using to acquire data
Study design depends on
the requirements for statistics to be accurate estimates of parameters and the specific objectives of the study
Sampling study design
Is generally used when the study objective is to estimate the value of the pop. parameter.
Principles of a representative sample
All individuals in the pop. should have an equal chance to be included in the sample AND the sample should include a sufficient number of individuals in the sample to represent the range of variation that is present within the population.
Randomization
Process for selecting individuals who will be included in a sample based on some random mechanism
Purpose of randomization
Minimize bias
Replication in Experimental Design
Refers to the number of individuals included in a sample
RSV and Sample Size
Sample statistics computed from large sample exhibit less random sampling variation/less affected by a few unusual individuals than stats computed from small samples.
Purpose of replication in Experimental Design
Control sampling variability and increase precision
Types of sampling study designs
Completely random sampling, randomized systematic sampling, stratified sampling design
Completely random sampling
All members of the pop. must have an equal chance of being selected for the sample. This sampling can happen through the random selection from a list OR random location of sample points.
Haphazard selection
Is not random sampling. Examples of this include, picking up the phonebook to randomly pick names or walking in the woods aimlessly to pick random location. It is not possible for you to confirm that the selection is truly random
Randomized systematic sampling Application
Commonly used in field sampling when it is very difficult to travel to random points or when a relatively small numbers of sample points will be used to describe a larger area.
Randomized systematic sampling
A random starting point is chosen and then sampling locations are located at a fixed distance intervals or travel-time intervals proceeding away from this random point. It reduces cost and effort.
Stratified sampling design
Involves identifying the various subpopulations called strata and taking separate random samples from each.
Experiment
Deliberate imposition of a treatment by the investigator on a sample of subjects to evaluate the response of the subjectsto the treatment.
Primary purpose of experimental study designs
Determine cause-effect relationships between a treatment variable and a response variable
Treatment/Explanatory Variable
Measures the condition(s) that the imposes on the study subjects
Response/Outcome Variable
Measures some characteristic of the study subjects that is hypothesized to change as a result of the treatment
How are experiments conducted in relation to non-treatment variables
They are done in controlled settings to minimize the possibility that nontreatment variables might influence the response of study sujects
Experimental design establishes what?
Establishes conditions such that there are only two possible explanations for why groups that received different treatments are different at the end of the experiment: Treatment caused the difference or is it was due to chance (random sampling variation)
What happens if nontreatment factors are allowed to influence study subjects or treatment groups were different from the beginning
It is impossible to determine if the treatment was the cause of the observed differences
What is an essential good component of experimental design
Equivalent study (treatment and control) groups
What is referred to equivalent study (treatment and control) groups
Groups that prior to the imposition of treatment are similar. They have the same variation of all nontreatment variables
Randomization of Assignment does what?
It randomly assigns the treatment to one of the two groups and ensures that the groups are equivalent before the treatment is imposed.
What is adequate replication?
Since replication involves including many subjects in your sample, which in the case of an experimental design will be divided into two groups, adequate replication refers to including enough individuals to average out individual differences, minimizing hte possibility that differences between groups are due to random-chance difference between individual assigned to groups.
What is the experimental unit?
The basic unit of the population of interest, for example, if population is patients with a HBP, then the experimental unit is patient.
What does independence refer to?
The value on the outcome variable on one experimental unit (patient) should not influence or not be influence by the value measured in other units
Different types of experimental design
Completely randomized (one treatment variable, two or more levels),
Before-after (used when the effect of the treatment is expected to be small relative to the variation among the study subjects),
Matched-pairs (unlike before-after, there is no carryover effect),
Randomized block (used when a response to treatment is influenced by an extraneous/nontreatment variable that cannot be controlled or eliminated; for example gender),
Factorial (used when the researcher want s to determine the response of the study subjects to two or more treatments but has reason to believe that effect of one variable will interact with the effect of other variables)
Problems with Cause-Effect Inferences
Since the purpose of an experimental study design is to provide a realistic test of the effect of a treatment on individuals in a population of interest, an inverstigator might reach an ERRONEOUS conclusion regarding the effect of the treament on the pop. of interest due to any of several types:
Confounding factors: facotrs that were not controlled by the investigator might actually be the cause of the differences between groups;
Poor measurement validity (aka information bias): measurements that are poor representations of the phenomen of interest provide misleading information;
Groups are not similar: often happens when randomization is not done or when there is not an adequate replication;
Nonrepresentative subjects included in the experiment (aka as selection bias): often seen when researchers, who are tying to control for external factors, include subjects from a homogenous group;
Investigator bias: researchers are more likely to see or not see a treatment effect due to their preconceived expectations about how experiments should turn out.;
Placebo effect;
Lack of realim: The more realistic the treatment and experimental cinditions, the less control the researcher may have over confounding factors; the more control over confounding factors, the less realistic the experiemntal conditons and the less likely the study subjects will respond as they might in their natural environment.
What are Natural Experiments?
They are experiments that involve comparing samples obtained from two or more populations in their natural environment
Main observations about natural experiments
Researcher has limited or no control over which subjects received treatments AND he/she has limited control over how the subjects have been influenced by extraneous, nontreatment factors.
Main reasons why natural experiments are adequate
They are more realistic and they avoid ethical concerns since it was not the researcher who imposed the natural treament/exposure
Main problems with natural experiments
There is a substantial risk that the two comparison groups are nonequivalent before the exposure happened AND cannot completely control for extraneous factors HENCE observed differences may not have been caused by the natural treatment/exposure
Adequate solution to natural experiments
Selecting study subjects based on criteria that make comparison groups as similar as possble. Caution must be taken when reporting conclusions about the population of interest
How can the independence assumption be violated?
Sampling of each individual is not truly random, such that some individuals in the sample are located in close spatial proximity to each other other or are genetically related.
Pseudo-replication: multiple measurements made on individuals from the pop. of interest are treated the same as single measurements from different, randomly chosen individuals
Are the scenarios that violate independence problematic and can you still make analyses with them?
Making analyses with these types of scenarios are not problematic, The problem is assuming that your sample is independent. For pseudo-replication, you will need to average the multiple measurements to obtain one single measurment, and for measurments on multiple individuals located around a single randomly located point, they must be analyzed using procedures specficially developed to handle data from that type of study design.
How does randomization lead to equivalent groups before the treatment is imposed?
Guarenteeing each individual has equal chance of being assigned to any group.
When participants are randomly allocated to different groups, the law of large numbers ensures that, on average, each group will have similar characteristics. Randomization creates balance across groups in terms of both measured and unmeasured factors that could potentially influence the outcome being studied.
By randomizing, you’re making sure each group has a fair mix of different preferences and characteristics. It’s like shuffling the cards before dealing them out.
So, randomization helps make sure your experiment is fair and that any differences you see between the groups are because of what you’re testing, not because of other factors. It’s like setting up a level playing field for your experiment.
First rule after collecting data
LOOK at the data values for unanticipated patterns or oddities. Do not jump straight to complex statistical analyses. Remember, all statistical analyses are bounded to the GIGO principle: Garbage in, Garbage out. Meaning if oyu apply statisitcal analysis to that that data that is not appropriate for that analysis, your apparently scientific and precise results will be nonsense.
What is exploratory data analysis (EDA)?
It is based on looking at the data to assimilate and understand the information embodied within the data. The goal is to summarize the data in a way that accentuates patterns (systematic variation in data values), and that distinguish patterns from noise (random variations or deviations from the pattern).
What is the goal of the EDA?
The goal is to summarize the data in a way that accentuates patterns (systematic variation in data values), and that distinguish patterns from noise (random variations or deviations from the pattern). We are looking for evidence of ANY PATTERN without preconceived notions as to its nature.
What are associations?
They are a type of pattern that allow us to make informed predictions of future events, the true goal in science.
What are the primary tools of EDA?
Graphics and Summary Statistics
Examples of graphics
Histograms, stem-leaf plots, box plots, scatter plots
Examples of summary statistics
Mean, Median, Mode (Measure of Center)
Range, Variance, Standard Deviation, Interquartile Range, Min, Max (Measures of Spread), Percentiles (Measure of Data Value Location)
What is frequency
Is the number of times a specific data value occurs in the sample data.
What is relative frequency?
Is the proportion or percentage of a specific data value in a sample
Measures of Center
Measure the middle, the range of values are appear more frequently in your sample
Measures of spread
They describe the amount of variability in your sample or another way to phrase it “the amount of variability in the values away from the center”. Little spread = little variability. Large spread = large variability
Charcteristics of distributions
Center, spread, shapes, gaps, outliers
Shape
Refers to the number of peaks in the distribution and whether the distribution is symmetirc or assymetric (parametric or nonparametric)
Gaps
Refer to segments within the min and max value data values where there are no data values
Outliers
Refers to data values that very drifferent from all other values.
They can happen due to measurement error (mistake), sampling individuals from putside the pop of interest (mistake, or novel situations .
Hisyograms vs Bar charts
Histograms display data distribtuions for quantitative variables while bar charts display them for qualitative , categorical variables.
The concepts of center and shape apply only to quantitiave data variables, not to categorical variables displayed in bar charts. The concept of variability applies too categorical variables onlin in the sense that some of these variables have more distinc classes than others; for example, mangos.
Bin width
Refers to the width of the intervals that define classes. The selection of the best bin width is usually done by trial and error. Usuallym the optimal interval width is found only after looking at multiple histograms with different interval widths (to small a bin width, you will find many opeaks, too large a bin width, you will obscure some importnant patterns)
What are summary statistics?
They are numerical values that describe the characteristics of data distributions. They provide more objective assessments of differences or associations than graphics (visual comparisons)
What does a Percentile refer to?
The percentile of a data distribution is the value of the variable such that p% of the values in the sample are less than that value. For example, the median is the 50th percentile, meaning 50% of the values are less than the median.
Median
Is the middle value of a ranked list of data values so that 50% of data values are less than the median and 50% are greater
Mode
Is the most frequent data value in a data distribution
Variance
It is the average of the squared differences between individual data values and the mean of the distribution. However instead by dividing just by “n”, you deivde by n-1, which is called the Bessel’s correction. This method corrects the bias in the estimation of the population variance. If you didn’t subtract minus 1, you would be underestimating the variance.
Remember more spread out data from the center, more variance. The variance increases because the difference results in larger numbers, hence, increasing the numerator, but keeping the denominator fixed.
Standard deviation
Is the square root of the variance
IQR
Is the 75th percentile minus the 25th percentile. This statistic expresses the range of the middle 50% of the data values. If using the median as youe measure of center, you should use the IQR as your measure of spread because both are based on the ranking of your data values, rather than the data values themselves.
Sensitive statistic vs Robsut statistic
A statistic is sensitve if its is influenced by having outliers in your data. Usually, sensitive statistics are less sensitive if the sample size is large; however, a robust statistic is not influenced by the presence of outliers in your data.
Describing the shape of a distribution
Symmetry: Example bell-shaped normal distribution; a distribution is symmetric when both sides are very similar.
Skewness: Data is clumped together towards one end of the range of values. Right or Postively Skewed is when there is few data values in the right AND Left or Negatively Skewed is when there is few data values in the left.
Kurtosis: Refers to how the shape compares to a bell-shaped normal distribution. If there is more data in the middle and few in the tails, it is said to be peaky (leptokurtic); however if it is the opposite, it is called flat (platykurtic).
Unimodal vs multimodal: One peak vs multiple peaks
Graphics for comparing distributions
Back to Back Stem Leaf Plot: Ised when only comparing 2 groups (shows the maximum amount of info, easy to read, but unfamiliar to people)
Side by Side Histograms: 2-3 gorups. Emphasis on detail, easily understood by most people, although professional appearing it is diffcult when making specfic comparisons
Side by Side Boxplots: 2 or more gorups being compared; emphasis on clarity of comparisons but sacrifices some detail (gaps, multiple modes)
Association between 2 variables
When the value of one variable changes the value of the other variable also changes in a systematic manner.
No association between 2 variables
This happens when there is no pattern or if the data values of X and Y are in a horizontal or vertical line
Population Mean Notation
Greek Letter Mu μ
Population Variance Notation
Greek Letter Sigma Squared σ^2
Sample Mean Notation
X bar
Sample Variance Notation
S squared
Unbiased Sample Variance
Divding by a smaller number, you will get a larger sample variance. If you just divided by n, instead of n-1, your sample mean will always sit inside of your data, even though your true population mean is outside of it. However, you want to look at it, you will be underestimating the population variance.
Linear vs. Nonlinear
Linear: Regardless of the values of X, the change is Y will be constant
Nonlinear: The change in Y is not constant; it does depend on the values of X and will eventually create a slope
Strength of an association through a scatterplot
Is measured through the amount of scatter in the cloud of data points. It is also best explained in the context of a true cause-effect relationship.
Strong association
Is one where the cause vairable is the only factor that controls the response of the outcome variable
Weak association
Is one where many “cause” variables influence the value of the outcome variable.
Timeplot
Time scale is normally on the X-axis.
X-axis observations for scatterplots
Although the x-axis is often used for the “cause” variable in a cause-effect relationship, time is not a cause variable. Many variables change over time, but the changes are cause by facotrs that happen to occur over time.
Application of Stats in the Process of Science
Involves:
- Obtaining data (experiment and sampling design)
- Summarizing and describing data (exploratory data analysis and summary stats)
- Using data from samples and experiments to make estimates and test competing hypotheses about the universe (inferential stats)
Why can’t we ever say with certainty that the value of a sample stat exactly equals the true value of a pop parameter?
Because of random dampling variation. Different samples randomly taken from the same population will produce different estimates of the pop. parameter value.
What do scientists have to do because of random sampling variation?
They must quantify just how uncertain their estimates and conclusions are to convince others of the validity of their judgments
How do scientists quantify their uncertainty?
Scientists evaluate the validity of their hypotheses by determining the probability of getting the observed value of a sample statistic if the parameter value proposed by a hypothesis is true.
Probability Defintion
It can be defined as the relative frequency of an event. That is, if you observed a very large number of outcomes from a random phenomenon, the proportion of outcomes that meet the description of a specific event is an estimate of the probability of that event.
For example, when a meteorologist says there is an 80% orbability of rain today, that means that it rained on 80% of days when similar conditions prevailed in the past.
Event
Is defined as a combination of outcomes from a reandom phenomenon that meet a specific criterion.
For example, rolling a six-sided die is a random phenomenon with a sample space of 6 possible outcomes; one event might be “the number of dots is less than 3”: outcomes that meet the defintion of this event are {1,2}
Random phenomenon
Is a phenomenon that has individual outcomes that are not predictable but the probabilities associated with the possible outcomes are well-defined.
Haphazard phenomenon
The probabilities associates with the various possible outcomes are unknown
Probability distribution of outcomes for a random phenomenon
Is comprised of two parts:
- Listing all possible outocmes for a random phenomenon called sample space;
- The probabilities associated with each outcome.
Basic Rules of Probability
- The value of a prob. must fall within the range from 0 to 1. In terms of percents. valid probability values must be between 0 and 100%.
- The sum of all probabilities associated with all possible outcomes in the sample space of a random phenomenon is always 1.0.
- Complement Rule: For any Event A in the sample space, the prob. that A does not occur is 1 minus the prob. that A does occur. Ex. P[Not 1] = 1 - P[1]
Union of events
Is the combination of their outcomes. For example if Event A is Getting 1 or 2 and Event B is Getting 6, then the union of these two events, indicated as A or B, is the combination of all outcomes: sample space is {1,2,6}. The union of events is said to have occured if any one of the outcomes in the combination occurs.
Disjoint events
Are events that do not have any outcomes in common. When determining the probability associated with the union of two events, it is important to determine whether the events are disjoint.
Simple Addition Rule of Prob.
P[A or B or C] = P[A] + P[B] + P[C]. This rule is only used if the events are disjoint.
Problem with Nondisjoint Events
If you want to know the prob of drawing a king or a red, you will say P[King or Red]= P[King] + P[Red] = 4/52 + 26/52 = 30/52. This is wrong because you are double counting the kings, once for being kings and once for being red.
The double counting is why the events must be disjoint for the simple addition rule to provide correct prob. statements.
General Addition Rule of Prob.
This addresses the nondisjoint problem.
P[A or B] = P[A] + P[B] - P[A and B]
By substracting the probability associated with the union (overlap) of the nondisjoint events, the effect of this double counting is eliminated.
Intersection of Events
This refers to the event that all events will occur.
If event A is {A person selected is male} and event B is {A person selected is Republican}, then the intersection of these events, indicated as A and B, would occur if a randomly chosen indivudal in the US is both male and republican.
If events A and B are disjoint then they have no intersection and it is impossible for them to occur together P[A and B]=0
Independent Events
Two events are independent if the prob. of B, P[B], is in no way related to or affected by whether or not event A has occured and vice versa. If the outcome of one event does not affect the outcome of the next event.
When determining the the prob. associated with the intersection of two events P[A and B], it is important to determine if the two events are independent.
Simple Multiplication Rule of Prob.
P[A and B]= P[A] * P[B]. This rule is only used if the events are independent.
Problem with Nonindependent Events
When answering a question such as “what is the prob that two cards drawn from the same deck will both be face cards”, you want to apply the simple mulpication rule of prob. (it involves the prob of one event AND another event).
P[face and face] = 12/52 * 12/52
However, the answer is wrong and that is because it is violating the independence assumption. Temoving cards from the deck changes the probabilities for subsequent draws. It should be 12/52*11/51
General Multiplication Rule of Prob.
Addresses the problem with nonindependent events.
P[A and B] = P[A] * P[B|A]
By multiplying by a conditional probability, you are taking into account that the probability of B depends on A, hence the prob. of B given that A has occurred.
Random variables
Are quantitative represetations of outcomes of random phenomena. Because random sampling variation is always present in scientific studies, all sample statistics (means, medians, proportions, standard deviations) are random variables and all statistical analyses are based on probability distribtuions for random variables
Disjoint vs Independent
Events are considered disjoint if they never occur at the same time; these are also known as mutually exclusive events. Events are considered independent if they are unrelated.
Theoretical Probability Values?
They are values based on assumptions about the nature of the random phenomenon and the application of the rules of probability.
For example, given the rules of probability, there is only one probability distribution for the random phenomenon of flipping a fair coin (50% heads and 50% tails; Sample space{heads, tails}
Empirical (Data-Based) Probability
Refers to a probability estimated from observed long-term relative frequencies and it is computed as follows:
Prob of Event A (also known as the relative frequency of the occurrent of Event) = Number of ocurrences of Event A / Total number of observations
When do we used Empirical Probabilities?
When the assumptions made about the event are not correct or possible. For example, a scenario that is not as simple as “the coin is fair”. In these circumstances, we must observe many repetitions of the random phenomenon (the event) to learn the probabilities of its various possible outcomes.
Law of Large Numbers
As the number of observations increases, variation in the relative frequency of the event diminishes and the empirical probability aproaches the true probability.
Hence, the more data you have, the better will be your empirical estimate of the true probability of an event. However, a certain amount of the empirical from the true probability remains.
What is a Binomial Random Variable?
Is a discrete quantitative variable that indicates the count of the number of observations or individuals in a sample that belong to a specified category, assuming four conditions are met:
- The total number of observations is fixed (This means you counted the number of individuals with the characteristic of interest after sampling all your individuals, not until you got what you considered a good number of individuals with the characteristic of interest).
- all individuals/observations have equal probability of being selected.
- The probability that any one observation of the event will meet the specified criterion is constant.
- the outcomes of the multiple individuals/observations of the event are independent.
Conditions 3 and 4 depend on the population size being 100x greater than the sample size.
Binomial variables are based on what?
On observations of random phenomenon that are categorized based simply on whether or not a specified outcome occurs.
How do you calculate the probability associated with each value of X of a Binomial count variable?
First, you list all possible outcomes of the random phenomenon. Ex. Flip the coin, all possible outcomes refers to Heads and Tails
Second, list all the possible values a Binomial random variable can take and assign probabilities to each value. In other words, create your sample space. If you flipped the coin 3 times, your sample space or the number of combinations for heads will be either {0, 1, 2, 3}. Assign probabilities, for example, 0.5 for heads.
Third, determine probabilities for all values in the sample space of the binomial count variable (X). Use the simple multiplication rule to do this.
Fourth, sum the probabilities for all outcomes (combinations) that produce the same value for the count variable X to determine the overall probaility of X.
Empirical Estimate of P[X]
Is the relative frequency of the occurrence of each X value = Number of times that X value occurs / Total number of n observations.
Remember the more repetitions used to compute relative frequencies, the closer they will aproximate the true probabilites for each X value in the sample space.
Continuous Variables and the Binomial Distribution
Because it is not possible to list all the possible values in the sample space for a continuous variables, statistical analyses using continuous variables must be based on a probability other than the Binomial distribution.
Probabilities Associated with the Pop. Mean
There is a high probability associated with values close to the population mean
Probabilities Associated with values away from the mean
These probabilies progressively decrease as the value deviates from the mean.
Probabilities for Continuous Variables
These probabilities are determined only for ranges of values (a ≤ X ≤ b). This is because there an infinite number of values in the sample space, the probability associated with getting exactly one specific value is approximately zero.
Probability Histograms for Discrete Random Variables
This type of graphic lists all the prossible values of X in the sample space on the x-axis and above each value is a histogram bar that displays the probability (relative frequency) associated with that value.
Probability Density Curves
This graphic is used to represent probability distributions for continuous random variables. The X-axis displays the range (min to max) of values for the cont. variable. The probability of any event defined by a specific ranges of values within the sample space (a<X<b) is represented by the area under the curve above the specfied range of X-values.
The Y-axis is not important (it describes the prob. density which is a harder concept to grasp and irrelevant at this point)
What is the Normal Distribution?
Is a family of probability density curves that are symmetric, unimodal (single-peaked), bell-shaped, defined by the mean μ, and the standard deviation σ of a continuous random variable
What sample statistics are used to estimate population parameters μ and σ?
Sample mean (x bar) and the sample standard deviation (S)
What determines the location of the center of the bell-shaped distribution along the X-axis?
The mean μ
What determines the spread of the bell shaped distribution and how?
The standard deviation σ. The horizontal distance between the mean and the two points on the bell-shaped curve to either side of the mean where the curve changes from being convex up to concave up (called inflection points) is equal to the standard deviation σ. (See figure saved in favorites in your phone)
Empirical Rule for Normal Distributions
- Approximately 68% of the area under the curve is one standard deviation to the left and to the right of the mean. That is, the P{μ- 1σ ≤ X ≤ μ+ 1σ} ≈ 0.68
- Approximately 95% of the area under the curve is two standard deviations to the left and to the right of the mean. That is, the P{μ- 2σ ≤ X ≤ μ+ 2σ} ≈ 0.95
- Approximately 99% of the area under the curve is three standard deviations to the left and to the right of the mean. That is, the P{μ- 3σ ≤ X ≤ μ+ 3σ} ≈ 0.99
What does the empirical rule provide?
A useful approximation for determining probabilities associated with values for Normally distributed random variables.
Standard Normal Distribution (SND)
Overall, it is quite difficult to determine areas under the curves, which is why mathematicians have performed the calculations and produced the table of probabilities for a single SND.
This specific normal distribution has a mean mu=0 and a standard deviation sigma=1
Standard Normal Distribution Variable
Is given the symbol Z
Can you determine probabilities associated with values for any normally distributed variable using the SND table?
Yes
How do you determine the probabilities associated with a range of values for a continuous Normally distributed random variable X?
You transform the original x-value(s) to z-values on the standard Normal Z scale, using the formula z=(x-mu)/sigma
Can you determine probabilities associated a specific Z-value or a range of Z-values?
A range of Z-values
The probabilities in the standard normal table are given only for the range defined by P[Z≤z], meaning?
You need to take the complement of P[Z≤z] to calcuate the probabilities associated to events such as P[Z≥z] and P[z_a ≤ Z ≤ z_b].
Remember determining this P[z_a ≤ Z ≤ z_b], only makes sense if z_b is larger than z_a
If a histogram of the individual data values in a sample is approximately symmetric and bell-shaped, what can you assume?
You can assume the population distribution from which the data is obtained is also Normal; however, if the sample size is small, the distribution of data values obtained from a truly Normal distrbution may not appear bell-shaped in a histogram or bell-shaped.
What do you use to evaluate whether or not a the data values of a quantitative random variable X are Normally distributed?
You should plot the data in a normal quantile (probability) plot. If the array of points in the plot form a straight line, the values of the variable are Normally distributed. Deviations from a straight line indicate a non-distribution.
What is the Sampling Distribution of a Statistic?
It is the probability distribution for the values of a sample statistic. They describe the range of possible values for a sample statisitvc and display the probabilities associated with those values.
How can the sampling distribution of a statistic be visualized?
As a probability histogram or a probability density curve. The list or range of possible values for the statistic are on the X-axis and the heaght of the bars or the area under the curve represents the relative frequency (probability of obtaining specific values of the statistic from random samples.
What is assumed of the sample statistic in relation to the true pop. parameter if the sample units are selected by a random, unbiased procedure?
If selected by a random, unbiased procedure, simple logic dictates that the value of sample statistic is equally likely to fall above or below the true parameter value. If we compute many values of a sample statisitc, derived from repeated, independent samples form the same population, and we plot a frequency histogram of these values, the true value of the parameter should be at the center of the histogram.
What is the expected value of the sample statisitc?
The center of the sampling distribution. Basically, if the study design is random and unbiased, the expected value of the statistic is the true value of the pop. parameter.
What happens to the sampling distribution if the sample size n of each sample is increased?
The variation among values of a sample statistic computed from repeated samples from the same pop. should decrease.
How can we maximize the probability of getting a representative sample?
By random and unbiasedly sampling a large number of sample units. The representative sample will provide a precise and accurate esitmate of the true pop. parameter value.
What is the most common statistic for categorical variables?
Proportion
Sample proportion
Denoted by the symbol p hat and computed as X/n. This statistic is an empirical estimate of the proportion of individuals in the entire population that fall into the specified category.
It is a discrete random variable that can have only integer values.
The spread of the sampling distribution of p hat reflects what?
It reflects the amount of random sampling variation that would be exhibited by this statistic if independent samples of n observations were repeatedly taken from the same population.
Standard Deviation of p hat
sigma phat = Square Root (P(1-P)/n)
As observed from the formula, sample size AND the value of the population proportion influence how much random variation is observed in the value of p hat
Sample proportion and How it Becomes Approximates Continuous
Although sample proportion is a discrete random variable that can only take particular values {0.0, 0.25, 0.5, 0.75, 1.0}, as the sample size increases, the values in the sample space of p hat look more like a continuous variable.
The consequence is that the shape of the sampling distribution of p hat changes from a probability histogram that looks less like a staircase to one that looks more like a smooth curve. Hence, as the sample size increases, the sample space values for p hat become approximately continuous.
Population Proportion, Large Sample Size, and the sampling distribution of P hat
The more the value of Pop. proportion deviates from 0.5, the more skewed is the sampling distribution of p hat; however, if the sample size is sufficiently large, the sampling distribution of p hat becomes approximately Normal
Empirical estimates of the expected value of p hat, of the SD of p hat, and of the shape of the sampling distribution of p hat.
The mean of the p hat values from thousands of simulated samples is an empirical estimate of E(p hat)
The SD of the simulated p values is an empirical estimate of the SD pf p hat
The shape of the relative frequency histogram for p hat values from thousands of simulated samples is an empirical estimate of the shape of the sampling distribution of p hat
When the empirical sampling distribution histogram for p hat is skewed, what happens with the normal probability plot?
It occurs in separate clustes. As the sample size increases, the points in the Normal probability plot begin to approximate a continuous straight line.
In real-world science, do we take repeated samples from our population of interest to dicument the sampling distribution of our statistics?
Never. The theoretical sampling distributions (E(p hat)= P and sigma p hat= square root of (P(1-P)/n)) are then powerful tools that allow us to describe a large population based on data from a single sample.
Since p hat can be considered a normally distributed variable if the sample size is large, then how can you determine its probabilities?
You can determine the probability associated with any range of values for p hat by converting the proportion value to a standard Normal Z-value and using the standard Normal distribution to obtain the probability.
P[p hat ≤ p hat observed] = P [Z≤ (p hat observed - E[p hat]/digma p hat] OR the same formula but replacing ≤ with ≥
What is the sample mean?
It is used to estimate the value of true mean for the entire pop. of interest which is denoted by the symbol μ
We use probability distributions for the value of the sample mean for what purpose?
To arrive at appropriate conclusions based on uncertain evidence provided by sample data. Since the sample mean is a continuous random variable, the sampling distribution for the sample mean i s a probability density curve.
If individuals are sampled from the population of interest by a random, unbiased procedure, the value of the sample mean (x bar) is equally likely to above or below the the value of the population mean μ. True or False?
True. Therefore, the center of the sampling distribution of x bar is the true value of the population mean μ, assuming a represntative sample is obtained. Hence,
E(x bar) = μ
The spread of the sampling distribution of x bar reflect what?
The amount of random sampling variation that would be exhibited by this statistic if independent samples of n observations were repeatedly taken from the same population.
What is population standard deviation (sigma)?
It is the amount of variation among individuals in the population. This is a characteritic that differs bith between variables and between different populations. For example, there is more variation in body weight anong adults than among infants.
The greater the variation among individuals in the population (sigma), the greater…
the amount of random sampling variation we can expect to see in values of x bar computed from independent samples.
What does the SD of the sample mean (x bar) and what is its formula?
sigma subscript x bar = sigma / square root of n. This calculates the amount of random sampling variation in the value of x bar.
Problem with the SD of the sample mean formula
We rarely know sigma (population standard deviation), which is why we will usually quantify the spread of the sampling distribution of x bar by replacing sigma with an estimated SD, denoted by S.
How do you calculate the estimated SD (S)?
You calculated using data values from your sample
Standard Error of the Mean
This is the resulting measure of spread for the sampling distribution of the mean, and it is given the symbol S subscript x bar.
What determines the shape of the sampling distribution of x bar?
- The shape of the population distribution for variable X.
- Sample Size n
What is Population Distribution?
It is the probavility distribution for individual values of variable X that would be obtained if the entire population were measured.
What happens if the shape of the population distribution is Normal?
The sampling distribution of x bar will always be Normal.
How do we assess if a population distribution is Normal?
Generally, investigators look at histograms, boxplots, or Normal quantile plots of individual data values in a sample (called data distribution)
What happens if the data distribution (and by inference the pop. distribution) is not Normal? (For example, skewed or multimodal)
The shape of the sampling distribution of x bar will depend on the sample size n. If n is small, the shape of the sampling distribution will be similar to the shape of the data distribution (hence the pop. distribution); however, as n increases the sampling distribution of x bar will become approximately Normal (See Central Limit Theorem)
Central Limit Theorem
This theorem says that when the sample size n is sufficiently large, the shape of the sampling distribution of the sample means x bar will be Normal, no matter what the shape of the population (data) distribution.
How large a sample is sufficiently large?
This depends on how far the data distribution is from Normal (and by inference how far the pop. distri. is from Normal). However, you can say that the more skewed and multimodal the pop. distribution, the larger the sample size required before the sampling distribution of x bar will be Normal
The levels of Skewness from Boxplots
A skewed distribution will have one whisker longer than the other and the median line will not usually be located in the middle of the box.
A mdoerately skewed distribution might have one whisker that is 2 to 5x longer than the other but no outliers.
Extremely skewed distributions might have one whisker more than 10x longer the other, and usually include outliers off the end of the whisker.
Applying the Central Limit Theorem Example
If the largest data value is more than 10x the median, obtaining a larger sample (n ≥ 100) would be imperative, and if not possible, it would be difficult to assume the sampling distribution of x bar is Normal
Why is it so important that the sampling distribution of x bar be Normal?
The most powerful statistical analyses for sample means are based on a Normal sampling distribtuion for x bar. If these procedures are used under circumstances in which the sampling distribution is not Normal, the results will be inaccurate.
There are alternative statistical analyses that are not based on Normal sampling distribution, tbut these procedures are foten less powerful and less familiar to many scientists.
How can we determine the probability associated with any range of values for x bar?
By converting the observed value of the sample mean (x bar observed) to a standard Normal Z-value and then using the standard Normal distribution to obtain the probability.
P[x ≥ x bar observed] = P [Z ≥ x bar observed minus the expected value of x bar / sigma subscript x bar observed (which is equal to sigma divided by square root of n)] OR the same formula but replacing ≥ with ≤