Statistics Basics Flashcards

1
Q

What means “Doing science”?

A

Collecting Data so that sample information is a useful representation of the world.

Summarizing Data to make it easier to understand and use for describing the real world.

Using data to critically evaluate evidence for or against a specific hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Population of Interest (Main Problem)

A

Too large to study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sample

A

Subset of the population of interest. Knowledge gained from measurements on a sample. Scientists can make estimates of the larger population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What determines whether or not the data collected for a study are representative of the real world?

A

The methods used to obtain such data. Such methods must include unbised, random procedures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Variable?

A

Is a characteristic of an object or group of objects that can be represented with a number that has more than 1 possible value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Columns represent…

A

Variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Rows represent…

A

Observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ratio-Scale Variables

A

Have a true absolute zero value.

Quantitative data measured on a scale that has a constant increment between successive values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ordinal or Rank Scale Variables

A

Have values that represent the ranked order of the objects or individuals or individuals with regard to a variable.

However, the actual differences between ranks can differ. Ex. Top 3 GPAs: 4.0, 3.96, 3.7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Discrete variables

A

Can take only specific values and are often based on counting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Continuous variables

A

Can take infinite number of possible values, limited only to the number of decimal places to which the value can be precisely measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Categorical variables

A

Have values that indicate the individual belongs to a class or category. Although these values cannot be inherently represented by numbers, they are often analyzed in terms of the count or proportion of individuals that fall within that class or category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Population of Interest

A

Entire group of objects or individuals about which information is desired.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data

A

Refers to a collection of observations and/or measurements for one or more variables, made on one or more individuals from the population of interest for the purpose of addessing a specific question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Statistics

A

Numbers that describe characteristics of a sample. These are calculated by the data obtained from individuals in a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sample Statistics and its relation to Population Parameters

A

Sample statistics are used to estimate or infer something about the values of population parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Sample unit

A

Is an individual unit that comprises the sample or pop. of interest. For example, sample= people; sample unit= person.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Are Sample Statistics considered accurate with a valid study design?

A

Even with a valid study design, a sample statistic is more or less accurate. They never represent the true values of the pop. parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Population Parameters

A

Numbers that describe the characteristics of the entire population of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Random Sample Variation (RSV)

A

Is the variation in the values of a sample statistic computed from different, independent samples taken from the same population.

They will always happen as long as scientists use samples to estimate population parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why does RSV happen?

A

It is a consequence of the randomness of the process by which individuals are selected from a population to create a sample.

In other words, it occurs because repeated samples include different subsets of indivudals who vary with regard to the VoI.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

One example as to how sample variation and consequent uncertainty (associated with the estimates of the pop. parameters) can minimized.

A

If data is obtained using appropiate methods and unbiased procedures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Bias

A

Is any systematic deviation of sample statistics away from the true value of population parameters.

Systematic referring to consistently wrong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Three most common reasons for bias

A

Confounding, Selection Bias, Information Bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Selection Bias

A

Happens when individuals included in the sample are not representative of the larger population of interest.

This is determined by the method used to select individuals from the population into your sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Information Bias

A

Measurements do not adequately represent the variable of interest.

This can be determined whether the choice of method used to measure the variable of interest (VoI) calculates the wrong value of the VoI OR when you are using an appropiate measurement method; however, it is consistently calculating the VoI wrong (For example, lack of training).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Measurement Validity

A

Is the idea that a measurement made on the study subjects accurately quantifies the variable of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Precision

A

Refers to the amount of variation among the values of a sample statistic derived from repeated, independent samples of the same population.

For example, if repeated samples produce very similar values of a sample statistic then the estimates are said to be precise estimates of the population parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Sample size and its effect on unusual observations

A

Since some samples, by chance, include unusual individuals, the effect of such few unusual observations on the value of the sample statistic can be reduced by a large sample size.

In other words, if many individuals are included in a sample, random variation among individuals will average out and sample statistics computed from repeated samples will be less variable, and thus, more precise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Precise Estimates

A

The values of a precise statistic derived by repeatedly sampling the same population tends to fall within a very narrow range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Accurate estimates

A

The values of a precise statistic derived from an unbiased study design are very likely to fall close to the true parameter value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What makes a statistic an unbiased estimate of the population parameter?

A

If all individuals in the population of interest have equal chance of being selected AND the measurement procedure produces valid data for the VoI.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do you minimize bias?

A

You do this by reducing selection bias (meaning your sample is representative of the larger population), reducing information/measurement error (choosing an adequate tool to measure your VoI AND making sure your personnel is trained enough), and checking for confounding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How do you maximize precision?

A

Collecting data from large sample sizes and also training personnel to ensure consistent methodology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Study Design

A

Is a description of the methods that the investigator will be using to acquire data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Study design depends on

A

the requirements for statistics to be accurate estimates of parameters and the specific objectives of the study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Sampling study design

A

Is generally used when the study objective is to estimate the value of the pop. parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Principles of a representative sample

A

All individuals in the pop. should have an equal chance to be included in the sample AND the sample should include a sufficient number of individuals in the sample to represent the range of variation that is present within the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Randomization

A

Process for selecting individuals who will be included in a sample based on some random mechanism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Purpose of randomization

A

Minimize bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Replication in Experimental Design

A

Refers to the number of individuals included in a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

RSV and Sample Size

A

Sample statistics computed from large sample exhibit less random sampling variation/less affected by a few unusual individuals than stats computed from small samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Purpose of replication in Experimental Design

A

Control sampling variability and increase precision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Types of sampling study designs

A

Completely random sampling, randomized systematic sampling, stratified sampling design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Completely random sampling

A

All members of the pop. must have an equal chance of being selected for the sample. This sampling can happen through the random selection from a list OR random location of sample points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Haphazard selection

A

Is not random sampling. Examples of this include, picking up the phonebook to randomly pick names or walking in the woods aimlessly to pick random location. It is not possible for you to confirm that the selection is truly random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Randomized systematic sampling Application

A

Commonly used in field sampling when it is very difficult to travel to random points or when a relatively small numbers of sample points will be used to describe a larger area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Randomized systematic sampling

A

A random starting point is chosen and then sampling locations are located at a fixed distance intervals or travel-time intervals proceeding away from this random point. It reduces cost and effort.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Stratified sampling design

A

Involves identifying the various subpopulations called strata and taking separate random samples from each.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Experiment

A

Deliberate imposition of a treatment by the investigator on a sample of subjects to evaluate the response of the subjectsto the treatment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Primary purpose of experimental study designs

A

Determine cause-effect relationships between a treatment variable and a response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Treatment/Explanatory Variable

A

Measures the condition(s) that the imposes on the study subjects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Response/Outcome Variable

A

Measures some characteristic of the study subjects that is hypothesized to change as a result of the treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How are experiments conducted in relation to non-treatment variables

A

They are done in controlled settings to minimize the possibility that nontreatment variables might influence the response of study sujects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Experimental design establishes what?

A

Establishes conditions such that there are only two possible explanations for why groups that received different treatments are different at the end of the experiment: Treatment caused the difference or is it was due to chance (random sampling variation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What happens if nontreatment factors are allowed to influence study subjects or treatment groups were different from the beginning

A

It is impossible to determine if the treatment was the cause of the observed differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is an essential good component of experimental design

A

Equivalent study (treatment and control) groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is referred to equivalent study (treatment and control) groups

A

Groups that prior to the imposition of treatment are similar. They have the same variation of all nontreatment variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Randomization of Assignment does what?

A

It randomly assigns the treatment to one of the two groups and ensures that the groups are equivalent before the treatment is imposed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is adequate replication?

A

Since replication involves including many subjects in your sample, which in the case of an experimental design will be divided into two groups, adequate replication refers to including enough individuals to average out individual differences, minimizing hte possibility that differences between groups are due to random-chance difference between individual assigned to groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the experimental unit?

A

The basic unit of the population of interest, for example, if population is patients with a HBP, then the experimental unit is patient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What does independence refer to?

A

The value on the outcome variable on one experimental unit (patient) should not influence or not be influence by the value measured in other units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Different types of experimental design

A

Completely randomized (one treatment variable, two or more levels),

Before-after (used when the effect of the treatment is expected to be small relative to the variation among the study subjects),

Matched-pairs (unlike before-after, there is no carryover effect),

Randomized block (used when a response to treatment is influenced by an extraneous/nontreatment variable that cannot be controlled or eliminated; for example gender),

Factorial (used when the researcher want s to determine the response of the study subjects to two or more treatments but has reason to believe that effect of one variable will interact with the effect of other variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Problems with Cause-Effect Inferences

A

Since the purpose of an experimental study design is to provide a realistic test of the effect of a treatment on individuals in a population of interest, an inverstigator might reach an ERRONEOUS conclusion regarding the effect of the treament on the pop. of interest due to any of several types:

Confounding factors: facotrs that were not controlled by the investigator might actually be the cause of the differences between groups;

Poor measurement validity (aka information bias): measurements that are poor representations of the phenomen of interest provide misleading information;

Groups are not similar: often happens when randomization is not done or when there is not an adequate replication;

Nonrepresentative subjects included in the experiment (aka as selection bias): often seen when researchers, who are tying to control for external factors, include subjects from a homogenous group;

Investigator bias: researchers are more likely to see or not see a treatment effect due to their preconceived expectations about how experiments should turn out.;

Placebo effect;

Lack of realim: The more realistic the treatment and experimental cinditions, the less control the researcher may have over confounding factors; the more control over confounding factors, the less realistic the experiemntal conditons and the less likely the study subjects will respond as they might in their natural environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What are Natural Experiments?

A

They are experiments that involve comparing samples obtained from two or more populations in their natural environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Main observations about natural experiments

A

Researcher has limited or no control over which subjects received treatments AND he/she has limited control over how the subjects have been influenced by extraneous, nontreatment factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Main reasons why natural experiments are adequate

A

They are more realistic and they avoid ethical concerns since it was not the researcher who imposed the natural treament/exposure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Main problems with natural experiments

A

There is a substantial risk that the two comparison groups are nonequivalent before the exposure happened AND cannot completely control for extraneous factors HENCE observed differences may not have been caused by the natural treatment/exposure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Adequate solution to natural experiments

A

Selecting study subjects based on criteria that make comparison groups as similar as possble. Caution must be taken when reporting conclusions about the population of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

How can the independence assumption be violated?

A

Sampling of each individual is not truly random, such that some individuals in the sample are located in close spatial proximity to each other other or are genetically related.

Pseudo-replication: multiple measurements made on individuals from the pop. of interest are treated the same as single measurements from different, randomly chosen individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Are the scenarios that violate independence problematic and can you still make analyses with them?

A

Making analyses with these types of scenarios are not problematic, The problem is assuming that your sample is independent. For pseudo-replication, you will need to average the multiple measurements to obtain one single measurment, and for measurments on multiple individuals located around a single randomly located point, they must be analyzed using procedures specficially developed to handle data from that type of study design.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

How does randomization lead to equivalent groups before the treatment is imposed?

A

Guarenteeing each individual has equal chance of being assigned to any group.

When participants are randomly allocated to different groups, the law of large numbers ensures that, on average, each group will have similar characteristics. Randomization creates balance across groups in terms of both measured and unmeasured factors that could potentially influence the outcome being studied.

By randomizing, you’re making sure each group has a fair mix of different preferences and characteristics. It’s like shuffling the cards before dealing them out.

So, randomization helps make sure your experiment is fair and that any differences you see between the groups are because of what you’re testing, not because of other factors. It’s like setting up a level playing field for your experiment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

First rule after collecting data

A

LOOK at the data values for unanticipated patterns or oddities. Do not jump straight to complex statistical analyses. Remember, all statistical analyses are bounded to the GIGO principle: Garbage in, Garbage out. Meaning if oyu apply statisitcal analysis to that that data that is not appropriate for that analysis, your apparently scientific and precise results will be nonsense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What is exploratory data analysis (EDA)?

A

It is based on looking at the data to assimilate and understand the information embodied within the data. The goal is to summarize the data in a way that accentuates patterns (systematic variation in data values), and that distinguish patterns from noise (random variations or deviations from the pattern).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is the goal of the EDA?

A

The goal is to summarize the data in a way that accentuates patterns (systematic variation in data values), and that distinguish patterns from noise (random variations or deviations from the pattern). We are looking for evidence of ANY PATTERN without preconceived notions as to its nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

What are associations?

A

They are a type of pattern that allow us to make informed predictions of future events, the true goal in science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

What are the primary tools of EDA?

A

Graphics and Summary Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Examples of graphics

A

Histograms, stem-leaf plots, box plots, scatter plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Examples of summary statistics

A

Mean, Median, Mode (Measure of Center)
Range, Variance, Standard Deviation, Interquartile Range, Min, Max (Measures of Spread), Percentiles (Measure of Data Value Location)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

What is frequency

A

Is the number of times a specific data value occurs in the sample data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

What is relative frequency?

A

Is the proportion or percentage of a specific data value in a sample

82
Q

Measures of Center

A

Measure the middle, the range of values are appear more frequently in your sample

83
Q

Measures of spread

A

They describe the amount of variability in your sample or another way to phrase it “the amount of variability in the values away from the center”. Little spread = little variability. Large spread = large variability

84
Q

Charcteristics of distributions

A

Center, spread, shapes, gaps, outliers

85
Q

Shape

A

Refers to the number of peaks in the distribution and whether the distribution is symmetirc or assymetric (parametric or nonparametric)

86
Q

Gaps

A

Refer to segments within the min and max value data values where there are no data values

87
Q

Outliers

A

Refers to data values that very drifferent from all other values.

They can happen due to measurement error (mistake), sampling individuals from putside the pop of interest (mistake, or novel situations .

88
Q

Hisyograms vs Bar charts

A

Histograms display data distribtuions for quantitative variables while bar charts display them for qualitative , categorical variables.

The concepts of center and shape apply only to quantitiave data variables, not to categorical variables displayed in bar charts. The concept of variability applies too categorical variables onlin in the sense that some of these variables have more distinc classes than others; for example, mangos.

89
Q

Bin width

A

Refers to the width of the intervals that define classes. The selection of the best bin width is usually done by trial and error. Usuallym the optimal interval width is found only after looking at multiple histograms with different interval widths (to small a bin width, you will find many opeaks, too large a bin width, you will obscure some importnant patterns)

90
Q

What are summary statistics?

A

They are numerical values that describe the characteristics of data distributions. They provide more objective assessments of differences or associations than graphics (visual comparisons)

91
Q

What does a Percentile refer to?

A

The percentile of a data distribution is the value of the variable such that p% of the values in the sample are less than that value. For example, the median is the 50th percentile, meaning 50% of the values are less than the median.

92
Q

Median

A

Is the middle value of a ranked list of data values so that 50% of data values are less than the median and 50% are greater

93
Q

Mode

A

Is the most frequent data value in a data distribution

94
Q

Variance

A

It is the average of the squared differences between individual data values and the mean of the distribution. However instead by dividing just by “n”, you deivde by n-1, which is called the Bessel’s correction. This method corrects the bias in the estimation of the population variance. If you didn’t subtract minus 1, you would be underestimating the variance.

Remember more spread out data from the center, more variance. The variance increases because the difference results in larger numbers, hence, increasing the numerator, but keeping the denominator fixed.

95
Q

Standard deviation

A

Is the square root of the variance

96
Q

IQR

A

Is the 75th percentile minus the 25th percentile. This statistic expresses the range of the middle 50% of the data values. If using the median as youe measure of center, you should use the IQR as your measure of spread because both are based on the ranking of your data values, rather than the data values themselves.

97
Q

Sensitive statistic vs Robsut statistic

A

A statistic is sensitve if its is influenced by having outliers in your data. Usually, sensitive statistics are less sensitive if the sample size is large; however, a robust statistic is not influenced by the presence of outliers in your data.

98
Q

Describing the shape of a distribution

A

Symmetry: Example bell-shaped normal distribution; a distribution is symmetric when both sides are very similar.

Skewness: Data is clumped together towards one end of the range of values. Right or Postively Skewed is when there is few data values in the right AND Left or Negatively Skewed is when there is few data values in the left.

Kurtosis: Refers to how the shape compares to a bell-shaped normal distribution. If there is more data in the middle and few in the tails, it is said to be peaky (leptokurtic); however if it is the opposite, it is called flat (platykurtic).

Unimodal vs multimodal: One peak vs multiple peaks

99
Q

Graphics for comparing distributions

A

Back to Back Stem Leaf Plot: Ised when only comparing 2 groups (shows the maximum amount of info, easy to read, but unfamiliar to people)

Side by Side Histograms: 2-3 gorups. Emphasis on detail, easily understood by most people, although professional appearing it is diffcult when making specfic comparisons

Side by Side Boxplots: 2 or more gorups being compared; emphasis on clarity of comparisons but sacrifices some detail (gaps, multiple modes)

100
Q

Association between 2 variables

A

When the value of one variable changes the value of the other variable also changes in a systematic manner.

101
Q

No association between 2 variables

A

This happens when there is no pattern or if the data values of X and Y are in a horizontal or vertical line

102
Q

Population Mean Notation

A

Greek Letter Mu μ

103
Q

Population Variance Notation

A

Greek Letter Sigma Squared σ^2

104
Q

Sample Mean Notation

A

X bar

105
Q

Sample Variance Notation

A

S squared

106
Q

Unbiased Sample Variance

A

Divding by a smaller number, you will get a larger sample variance. If you just divided by n, instead of n-1, your sample mean will always sit inside of your data, even though your true population mean is outside of it. However, you want to look at it, you will be underestimating the population variance.

107
Q

Linear vs. Nonlinear

A

Linear: Regardless of the values of X, the change is Y will be constant

Nonlinear: The change in Y is not constant; it does depend on the values of X and will eventually create a slope

108
Q

Strength of an association through a scatterplot

A

Is measured through the amount of scatter in the cloud of data points. It is also best explained in the context of a true cause-effect relationship.

109
Q

Strong association

A

Is one where the cause vairable is the only factor that controls the response of the outcome variable

110
Q

Weak association

A

Is one where many “cause” variables influence the value of the outcome variable.

111
Q

Timeplot

A

Time scale is normally on the X-axis.

112
Q

X-axis observations for scatterplots

A

Although the x-axis is often used for the “cause” variable in a cause-effect relationship, time is not a cause variable. Many variables change over time, but the changes are cause by facotrs that happen to occur over time.

113
Q

Application of Stats in the Process of Science

A

Involves:

  • Obtaining data (experiment and sampling design)
  • Summarizing and describing data (exploratory data analysis and summary stats)
  • Using data from samples and experiments to make estimates and test competing hypotheses about the universe (inferential stats)
114
Q

Why can’t we ever say with certainty that the value of a sample stat exactly equals the true value of a pop parameter?

A

Because of random dampling variation. Different samples randomly taken from the same population will produce different estimates of the pop. parameter value.

115
Q

What do scientists have to do because of random sampling variation?

A

They must quantify just how uncertain their estimates and conclusions are to convince others of the validity of their judgments

116
Q

How do scientists quantify their uncertainty?

A

Scientists evaluate the validity of their hypotheses by determining the probability of getting the observed value of a sample statistic if the parameter value proposed by a hypothesis is true.

117
Q

Probability Defintion

A

It can be defined as the relative frequency of an event. That is, if you observed a very large number of outcomes from a random phenomenon, the proportion of outcomes that meet the description of a specific event is an estimate of the probability of that event.

For example, when a meteorologist says there is an 80% orbability of rain today, that means that it rained on 80% of days when similar conditions prevailed in the past.

118
Q

Event

A

Is defined as a combination of outcomes from a reandom phenomenon that meet a specific criterion.

For example, rolling a six-sided die is a random phenomenon with a sample space of 6 possible outcomes; one event might be “the number of dots is less than 3”: outcomes that meet the defintion of this event are {1,2}

119
Q

Random phenomenon

A

Is a phenomenon that has individual outcomes that are not predictable but the probabilities associated with the possible outcomes are well-defined.

120
Q

Haphazard phenomenon

A

The probabilities associates with the various possible outcomes are unknown

121
Q

Probability distribution of outcomes for a random phenomenon

A

Is comprised of two parts:

  • Listing all possible outocmes for a random phenomenon called sample space;
  • The probabilities associated with each outcome.
122
Q

Basic Rules of Probability

A
  • The value of a prob. must fall within the range from 0 to 1. In terms of percents. valid probability values must be between 0 and 100%.
  • The sum of all probabilities associated with all possible outcomes in the sample space of a random phenomenon is always 1.0.
  • Complement Rule: For any Event A in the sample space, the prob. that A does not occur is 1 minus the prob. that A does occur. Ex. P[Not 1] = 1 - P[1]
123
Q

Union of events

A

Is the combination of their outcomes. For example if Event A is Getting 1 or 2 and Event B is Getting 6, then the union of these two events, indicated as A or B, is the combination of all outcomes: sample space is {1,2,6}. The union of events is said to have occured if any one of the outcomes in the combination occurs.

124
Q

Disjoint events

A

Are events that do not have any outcomes in common. When determining the probability associated with the union of two events, it is important to determine whether the events are disjoint.

125
Q

Simple Addition Rule of Prob.

A

P[A or B or C] = P[A] + P[B] + P[C]. This rule is only used if the events are disjoint.

126
Q

Problem with Nondisjoint Events

A

If you want to know the prob of drawing a king or a red, you will say P[King or Red]= P[King] + P[Red] = 4/52 + 26/52 = 30/52. This is wrong because you are double counting the kings, once for being kings and once for being red.

The double counting is why the events must be disjoint for the simple addition rule to provide correct prob. statements.

127
Q

General Addition Rule of Prob.

A

This addresses the nondisjoint problem.

P[A or B] = P[A] + P[B] - P[A and B]

By substracting the probability associated with the union (overlap) of the nondisjoint events, the effect of this double counting is eliminated.

128
Q

Intersection of Events

A

This refers to the event that all events will occur.

If event A is {A person selected is male} and event B is {A person selected is Republican}, then the intersection of these events, indicated as A and B, would occur if a randomly chosen indivudal in the US is both male and republican.

If events A and B are disjoint then they have no intersection and it is impossible for them to occur together P[A and B]=0

129
Q

Independent Events

A

Two events are independent if the prob. of B, P[B], is in no way related to or affected by whether or not event A has occured and vice versa. If the outcome of one event does not affect the outcome of the next event.

When determining the the prob. associated with the intersection of two events P[A and B], it is important to determine if the two events are independent.

130
Q

Simple Multiplication Rule of Prob.

A

P[A and B]= P[A] * P[B]. This rule is only used if the events are independent.

131
Q

Problem with Nonindependent Events

A

When answering a question such as “what is the prob that two cards drawn from the same deck will both be face cards”, you want to apply the simple mulpication rule of prob. (it involves the prob of one event AND another event).

P[face and face] = 12/52 * 12/52

However, the answer is wrong and that is because it is violating the independence assumption. Temoving cards from the deck changes the probabilities for subsequent draws. It should be 12/52*11/51

132
Q

General Multiplication Rule of Prob.

A

Addresses the problem with nonindependent events.

P[A and B] = P[A] * P[B|A]

By multiplying by a conditional probability, you are taking into account that the probability of B depends on A, hence the prob. of B given that A has occurred.

133
Q

Random variables

A

Are quantitative represetations of outcomes of random phenomena. Because random sampling variation is always present in scientific studies, all sample statistics (means, medians, proportions, standard deviations) are random variables and all statistical analyses are based on probability distribtuions for random variables

134
Q

Disjoint vs Independent

A

Events are considered disjoint if they never occur at the same time; these are also known as mutually exclusive events. Events are considered independent if they are unrelated.

135
Q

Theoretical Probability Values?

A

They are values based on assumptions about the nature of the random phenomenon and the application of the rules of probability.

For example, given the rules of probability, there is only one probability distribution for the random phenomenon of flipping a fair coin (50% heads and 50% tails; Sample space{heads, tails}

136
Q

Empirical (Data-Based) Probability

A

Refers to a probability estimated from observed long-term relative frequencies and it is computed as follows:

Prob of Event A (also known as the relative frequency of the occurrent of Event) = Number of ocurrences of Event A / Total number of observations

137
Q

When do we used Empirical Probabilities?

A

When the assumptions made about the event are not correct or possible. For example, a scenario that is not as simple as “the coin is fair”. In these circumstances, we must observe many repetitions of the random phenomenon (the event) to learn the probabilities of its various possible outcomes.

138
Q

Law of Large Numbers

A

As the number of observations increases, variation in the relative frequency of the event diminishes and the empirical probability aproaches the true probability.

Hence, the more data you have, the better will be your empirical estimate of the true probability of an event. However, a certain amount of the empirical from the true probability remains.

139
Q

What is a Binomial Random Variable?

A

Is a discrete quantitative variable that indicates the count of the number of observations or individuals in a sample that belong to a specified category, assuming four conditions are met:

  • The total number of observations is fixed (This means you counted the number of individuals with the characteristic of interest after sampling all your individuals, not until you got what you considered a good number of individuals with the characteristic of interest).
  • all individuals/observations have equal probability of being selected.
  • The probability that any one observation of the event will meet the specified criterion is constant.
  • the outcomes of the multiple individuals/observations of the event are independent.

Conditions 3 and 4 depend on the population size being 100x greater than the sample size.

140
Q

Binomial variables are based on what?

A

On observations of random phenomenon that are categorized based simply on whether or not a specified outcome occurs.

141
Q

How do you calculate the probability associated with each value of X of a Binomial count variable?

A

First, you list all possible outcomes of the random phenomenon. Ex. Flip the coin, all possible outcomes refers to Heads and Tails

Second, list all the possible values a Binomial random variable can take and assign probabilities to each value. In other words, create your sample space. If you flipped the coin 3 times, your sample space or the number of combinations for heads will be either {0, 1, 2, 3}. Assign probabilities, for example, 0.5 for heads.

Third, determine probabilities for all values in the sample space of the binomial count variable (X). Use the simple multiplication rule to do this.

Fourth, sum the probabilities for all outcomes (combinations) that produce the same value for the count variable X to determine the overall probaility of X.

142
Q

Empirical Estimate of P[X]

A

Is the relative frequency of the occurrence of each X value = Number of times that X value occurs / Total number of n observations.

Remember the more repetitions used to compute relative frequencies, the closer they will aproximate the true probabilites for each X value in the sample space.

143
Q

Continuous Variables and the Binomial Distribution

A

Because it is not possible to list all the possible values in the sample space for a continuous variables, statistical analyses using continuous variables must be based on a probability other than the Binomial distribution.

144
Q

Probabilities Associated with the Pop. Mean

A

There is a high probability associated with values close to the population mean

145
Q

Probabilities Associated with values away from the mean

A

These probabilies progressively decrease as the value deviates from the mean.

146
Q

Probabilities for Continuous Variables

A

These probabilities are determined only for ranges of values (a ≤ X ≤ b). This is because there an infinite number of values in the sample space, the probability associated with getting exactly one specific value is approximately zero.

147
Q

Probability Histograms for Discrete Random Variables

A

This type of graphic lists all the prossible values of X in the sample space on the x-axis and above each value is a histogram bar that displays the probability (relative frequency) associated with that value.

148
Q

Probability Density Curves

A

This graphic is used to represent probability distributions for continuous random variables. The X-axis displays the range (min to max) of values for the cont. variable. The probability of any event defined by a specific ranges of values within the sample space (a<X<b) is represented by the area under the curve above the specfied range of X-values.

The Y-axis is not important (it describes the prob. density which is a harder concept to grasp and irrelevant at this point)

149
Q

What is the Normal Distribution?

A

Is a family of probability density curves that are symmetric, unimodal (single-peaked), bell-shaped, defined by the mean μ, and the standard deviation σ of a continuous random variable

150
Q

What sample statistics are used to estimate population parameters μ and σ?

A

Sample mean (x bar) and the sample standard deviation (S)

151
Q

What determines the location of the center of the bell-shaped distribution along the X-axis?

A

The mean μ

152
Q

What determines the spread of the bell shaped distribution and how?

A

The standard deviation σ. The horizontal distance between the mean and the two points on the bell-shaped curve to either side of the mean where the curve changes from being convex up to concave up (called inflection points) is equal to the standard deviation σ. (See figure saved in favorites in your phone)

153
Q

Empirical Rule for Normal Distributions

A
  • Approximately 68% of the area under the curve is one standard deviation to the left and to the right of the mean. That is, the P{μ- 1σ ≤ X ≤ μ+ 1σ} ≈ 0.68
  • Approximately 95% of the area under the curve is two standard deviations to the left and to the right of the mean. That is, the P{μ- 2σ ≤ X ≤ μ+ 2σ} ≈ 0.95
  • Approximately 99% of the area under the curve is three standard deviations to the left and to the right of the mean. That is, the P{μ- 3σ ≤ X ≤ μ+ 3σ} ≈ 0.99
154
Q

What does the empirical rule provide?

A

A useful approximation for determining probabilities associated with values for Normally distributed random variables.

155
Q

Standard Normal Distribution (SND)

A

Overall, it is quite difficult to determine areas under the curves, which is why mathematicians have performed the calculations and produced the table of probabilities for a single SND.

This specific normal distribution has a mean mu=0 and a standard deviation sigma=1

156
Q

Standard Normal Distribution Variable

A

Is given the symbol Z

157
Q

Can you determine probabilities associated with values for any normally distributed variable using the SND table?

A

Yes

158
Q

How do you determine the probabilities associated with a range of values for a continuous Normally distributed random variable X?

A

You transform the original x-value(s) to z-values on the standard Normal Z scale, using the formula z=(x-mu)/sigma

159
Q

Can you determine probabilities associated a specific Z-value or a range of Z-values?

A

A range of Z-values

160
Q

The probabilities in the standard normal table are given only for the range defined by P[Z≤z], meaning?

A

You need to take the complement of P[Z≤z] to calcuate the probabilities associated to events such as P[Z≥z] and P[z_a ≤ Z ≤ z_b].

Remember determining this P[z_a ≤ Z ≤ z_b], only makes sense if z_b is larger than z_a

161
Q

If a histogram of the individual data values in a sample is approximately symmetric and bell-shaped, what can you assume?

A

You can assume the population distribution from which the data is obtained is also Normal; however, if the sample size is small, the distribution of data values obtained from a truly Normal distrbution may not appear bell-shaped in a histogram or bell-shaped.

162
Q

What do you use to evaluate whether or not a the data values of a quantitative random variable X are Normally distributed?

A

You should plot the data in a normal quantile (probability) plot. If the array of points in the plot form a straight line, the values of the variable are Normally distributed. Deviations from a straight line indicate a non-distribution.

163
Q

What is the Sampling Distribution of a Statistic?

A

It is the probability distribution for the values of a sample statistic. They describe the range of possible values for a sample statisitvc and display the probabilities associated with those values.

164
Q

How can the sampling distribution of a statistic be visualized?

A

As a probability histogram or a probability density curve. The list or range of possible values for the statistic are on the X-axis and the heaght of the bars or the area under the curve represents the relative frequency (probability of obtaining specific values of the statistic from random samples.

165
Q

What is assumed of the sample statistic in relation to the true pop. parameter if the sample units are selected by a random, unbiased procedure?

A

If selected by a random, unbiased procedure, simple logic dictates that the value of sample statistic is equally likely to fall above or below the true parameter value. If we compute many values of a sample statisitc, derived from repeated, independent samples form the same population, and we plot a frequency histogram of these values, the true value of the parameter should be at the center of the histogram.

166
Q

What is the expected value of the sample statisitc?

A

The center of the sampling distribution. Basically, if the study design is random and unbiased, the expected value of the statistic is the true value of the pop. parameter.

167
Q

What happens to the sampling distribution if the sample size n of each sample is increased?

A

The variation among values of a sample statistic computed from repeated samples from the same pop. should decrease.

168
Q

How can we maximize the probability of getting a representative sample?

A

By random and unbiasedly sampling a large number of sample units. The representative sample will provide a precise and accurate esitmate of the true pop. parameter value.

169
Q

What is the most common statistic for categorical variables?

A

Proportion

170
Q

Sample proportion

A

Denoted by the symbol p hat and computed as X/n. This statistic is an empirical estimate of the proportion of individuals in the entire population that fall into the specified category.

It is a discrete random variable that can have only integer values.

171
Q

The spread of the sampling distribution of p hat reflects what?

A

It reflects the amount of random sampling variation that would be exhibited by this statistic if independent samples of n observations were repeatedly taken from the same population.

172
Q

Standard Deviation of p hat

A

sigma phat = Square Root (P(1-P)/n)

As observed from the formula, sample size AND the value of the population proportion influence how much random variation is observed in the value of p hat

173
Q

Sample proportion and How it Becomes Approximates Continuous

A

Although sample proportion is a discrete random variable that can only take particular values {0.0, 0.25, 0.5, 0.75, 1.0}, as the sample size increases, the values in the sample space of p hat look more like a continuous variable.

The consequence is that the shape of the sampling distribution of p hat changes from a probability histogram that looks less like a staircase to one that looks more like a smooth curve. Hence, as the sample size increases, the sample space values for p hat become approximately continuous.

174
Q

Population Proportion, Large Sample Size, and the sampling distribution of P hat

A

The more the value of Pop. proportion deviates from 0.5, the more skewed is the sampling distribution of p hat; however, if the sample size is sufficiently large, the sampling distribution of p hat becomes approximately Normal

175
Q

Empirical estimates of the expected value of p hat, of the SD of p hat, and of the shape of the sampling distribution of p hat.

A

The mean of the p hat values from thousands of simulated samples is an empirical estimate of E(p hat)

The SD of the simulated p values is an empirical estimate of the SD pf p hat

The shape of the relative frequency histogram for p hat values from thousands of simulated samples is an empirical estimate of the shape of the sampling distribution of p hat

176
Q

When the empirical sampling distribution histogram for p hat is skewed, what happens with the normal probability plot?

A

It occurs in separate clustes. As the sample size increases, the points in the Normal probability plot begin to approximate a continuous straight line.

177
Q

In real-world science, do we take repeated samples from our population of interest to dicument the sampling distribution of our statistics?

A

Never. The theoretical sampling distributions (E(p hat)= P and sigma p hat= square root of (P(1-P)/n)) are then powerful tools that allow us to describe a large population based on data from a single sample.

178
Q

Since p hat can be considered a normally distributed variable if the sample size is large, then how can you determine its probabilities?

A

You can determine the probability associated with any range of values for p hat by converting the proportion value to a standard Normal Z-value and using the standard Normal distribution to obtain the probability.

P[p hat ≤ p hat observed] = P [Z≤ (p hat observed - E[p hat]/digma p hat] OR the same formula but replacing ≤ with ≥

179
Q

What is the sample mean?

A

It is used to estimate the value of true mean for the entire pop. of interest which is denoted by the symbol μ

180
Q

We use probability distributions for the value of the sample mean for what purpose?

A

To arrive at appropriate conclusions based on uncertain evidence provided by sample data. Since the sample mean is a continuous random variable, the sampling distribution for the sample mean i s a probability density curve.

181
Q

If individuals are sampled from the population of interest by a random, unbiased procedure, the value of the sample mean (x bar) is equally likely to above or below the the value of the population mean μ. True or False?

A

True. Therefore, the center of the sampling distribution of x bar is the true value of the population mean μ, assuming a represntative sample is obtained. Hence,

E(x bar) = μ

182
Q

The spread of the sampling distribution of x bar reflect what?

A

The amount of random sampling variation that would be exhibited by this statistic if independent samples of n observations were repeatedly taken from the same population.

183
Q

What is population standard deviation (sigma)?

A

It is the amount of variation among individuals in the population. This is a characteritic that differs bith between variables and between different populations. For example, there is more variation in body weight anong adults than among infants.

184
Q

The greater the variation among individuals in the population (sigma), the greater…

A

the amount of random sampling variation we can expect to see in values of x bar computed from independent samples.

185
Q

What does the SD of the sample mean (x bar) and what is its formula?

A

sigma subscript x bar = sigma / square root of n. This calculates the amount of random sampling variation in the value of x bar.

186
Q

Problem with the SD of the sample mean formula

A

We rarely know sigma (population standard deviation), which is why we will usually quantify the spread of the sampling distribution of x bar by replacing sigma with an estimated SD, denoted by S.

187
Q

How do you calculate the estimated SD (S)?

A

You calculated using data values from your sample

188
Q

Standard Error of the Mean

A

This is the resulting measure of spread for the sampling distribution of the mean, and it is given the symbol S subscript x bar.

189
Q

What determines the shape of the sampling distribution of x bar?

A
  1. The shape of the population distribution for variable X.
  2. Sample Size n
190
Q

What is Population Distribution?

A

It is the probavility distribution for individual values of variable X that would be obtained if the entire population were measured.

191
Q

What happens if the shape of the population distribution is Normal?

A

The sampling distribution of x bar will always be Normal.

192
Q

How do we assess if a population distribution is Normal?

A

Generally, investigators look at histograms, boxplots, or Normal quantile plots of individual data values in a sample (called data distribution)

193
Q

What happens if the data distribution (and by inference the pop. distribution) is not Normal? (For example, skewed or multimodal)

A

The shape of the sampling distribution of x bar will depend on the sample size n. If n is small, the shape of the sampling distribution will be similar to the shape of the data distribution (hence the pop. distribution); however, as n increases the sampling distribution of x bar will become approximately Normal (See Central Limit Theorem)

194
Q

Central Limit Theorem

A

This theorem says that when the sample size n is sufficiently large, the shape of the sampling distribution of the sample means x bar will be Normal, no matter what the shape of the population (data) distribution.

195
Q

How large a sample is sufficiently large?

A

This depends on how far the data distribution is from Normal (and by inference how far the pop. distri. is from Normal). However, you can say that the more skewed and multimodal the pop. distribution, the larger the sample size required before the sampling distribution of x bar will be Normal

196
Q

The levels of Skewness from Boxplots

A

A skewed distribution will have one whisker longer than the other and the median line will not usually be located in the middle of the box.

A mdoerately skewed distribution might have one whisker that is 2 to 5x longer than the other but no outliers.

Extremely skewed distributions might have one whisker more than 10x longer the other, and usually include outliers off the end of the whisker.

197
Q

Applying the Central Limit Theorem Example

A

If the largest data value is more than 10x the median, obtaining a larger sample (n ≥ 100) would be imperative, and if not possible, it would be difficult to assume the sampling distribution of x bar is Normal

198
Q

Why is it so important that the sampling distribution of x bar be Normal?

A

The most powerful statistical analyses for sample means are based on a Normal sampling distribtuion for x bar. If these procedures are used under circumstances in which the sampling distribution is not Normal, the results will be inaccurate.

There are alternative statistical analyses that are not based on Normal sampling distribution, tbut these procedures are foten less powerful and less familiar to many scientists.

199
Q

How can we determine the probability associated with any range of values for x bar?

A

By converting the observed value of the sample mean (x bar observed) to a standard Normal Z-value and then using the standard Normal distribution to obtain the probability.

P[x ≥ x bar observed] = P [Z ≥ x bar observed minus the expected value of x bar / sigma subscript x bar observed (which is equal to sigma divided by square root of n)] OR the same formula but replacing ≥ with ≤

200
Q
A