Ch. 12: Data-Based and Statistical Reasoning Flashcards

Question 1

Q

defn: measures of central tendency

Answer

A

those that describe the middle of a sample

Question 2

Q

how do we find the mean + aka?

Answer

A

aka: average, arithmetic mean

add up all the individual values within the data set and divide the result by the number of values

Question 3

Q

when are means good indicators of central tendency?

Answer

A

when all of the values tend to be fairly close to one another

Question 4

Q

defn + impact: outlier

Answer

A

an extremely large or extremely small value compared to the other data values (can shift the mean toward one end of the range)

Question 5

Q

defn + how to find: median

Answer

A

the midpoint of a set of data (half of data points are greater than the value and half are smaller)

in data sets with an odd number of values, the median will be one of the data points

in data sets with an even number of values, the median will be the mean of the two central data points

to calculate, first organize the data in increasing fashion

Question 6

Q

when is the median a good tool to use? when is it not helpful?

Answer

A

GOOD FOR: it is the least susceptible to outliers

BAD FOR: may not be useful for data sets with large ranges or multiple modes

Question 7

Q

what does it mean if the mean and median are far from each other? if they are close to each other?

Answer

A

IF FAR: this implies the presence of outliers or a skewed distribution

IF CLOSE: implies a symmetrical distribution

Question 8

Q

defn: mode

Answer

A

the number that appears the most often in a set of data

there may be multiple modes (or even no mode!)

peaks represent modes in a data set

Question 9

Q

is the mode a measure of central tendency?

Answer

A

no, but the number of modes and their distance from one another is informative

Question 10

Q

what does it mean to “solve” a normal distribution?

Answer

A

we can transform any normal distribution to a STANDARD distribution with a mean of zero and a standard deviation of one and then use the newly generated curve to get information about probability or percentages of populations

Question 11

Q

what is the basis of the bell curve?

Answer

A

the normal distrubition

Question 12

Q

what % of the distribution (normal) is within one SD? within 2 SD? within 3 SD?

Answer

A

1 SD: 68%

2 SD: 95%

3 SD: 99%

Question 13

Q

defn: skewed distribution

Answer

A

one that contains a tail on one side or the other of the data set

Question 14

Q

why are skewed distributions often confusing?

Answer

A

the VISUAL shift in the data appear OPPOSITE the direction of the skew

the direction of a skew in a sample is determined by its TAIL, not the bulk of the distribution

Question 15

Q

defn: negatively vs. positively skewed distribution

Answer

A

NEGATIVELY = tail on left (negative) side

POSITIVELY = tail on right (positive) side

Question 16

Q

why is the mean of a negatively skewed distribution lower than the median?

why is the mean of a positively skewed distribution higher than the median?

Answer

A

because the mean is more susceptible to outliers than the median

Question 17

Q

defn: bimodal

Answer

A

a distribution containing 2 peaks with a valley

note: it might only have one actual MODE if one peak is slightly higher than the other

Question 18

Q

in what circumstances can (but don’t have to be!) we analyze bimodal distributions as two separate distributions?

Answer

A

if there is sufficient separation of the two peaks, or a sufficiently small amount of data within the valley region

Question 19

Q

can measures of central tendency and measures of distribution be applied to bimodal distributions?

Question 20

Q

defn: range

Answer

A

the difference between its largest and smallest values

Question 21

Q

what is range affected heavily by?

Answer

A

the presence of data outliers

Question 22

Q

what is an estimate of the SD based on the range when it is not possible to calculate the SD?

Answer

A

SD is approx. 1/4 range

Question 23

Q

defn: quartile

Answer

A

divide data (when placed in ascending order) into groups that comprise one-fourth the entire set

Question 24

Q

what are the 4 steps to calculating the quartiles?

Answer

A

to find the position of Q1 in a set of data sorted in ascending order, multiply n by 0.25
if this is a whole number, the quartile is the mean of the value at this position and the next highest position
if this is a decimal, round up to the next whole number and take that as the quartile position
to calculate the position of Q3 multiply the value of n by 0.75. Again, if this is a whole number, take the mean of this position and the next. If it is a decimal, round up to the next whole number, and take that as the quartile position

Question 25

Q

how do you calculate the interquartile range?

Answer

A

IQR = Q3 - Q1

Question 26

Q

what is the IQR helpful for determining? + how?

Answer

A

outliers

any value that falls more than 1.5 IQRs below the first quartile or above the third quartile is considered an outlier

Question 27

Q

what is the most informative measure of distribution?

Answer

A

standard deviation

Question 28

Q

how is std dev calculated (in words)?

Answer

A

by taking the difference between each data point and the mean, squaring this value, dividing the sum of all of these squared values by the number of points in the data set minus one, and then taking the square root of the result

Question 29

Q

how can you use the std dev to determine whether a data point is an outlier?

Answer

A

if the data point. falls more than 3 SD’s from the mean, it is considered an outlier

Question 30

Q

what are the three main causes of outliers + examples?

Answer

A

a true statistical anomaly (a person who is over 7 feet tall)
a measurement error (reading the tape measure wrong)
a distribution that is not approximated by the normal distribution (a skewed distribution with a long tail)

Question 31

Q

how do you approach the data set across each of the three causes of outliers?

Answer

A

measurement error –> exclude the data from the analysis
true measurement, but not representative –> weight to reflect its rarity, include normally, or excluded (should be decided ahead of the study, not after the outlier is found)
not normal distribution? repeated or larger samples will demonstrate the truth

Question 32

Q

defn: independent vs. dependent events

Answer

A

independent events: have no effect on one another

dependent events: do have an impact on one another, such that the order changes the probability

Question 33

Q

defn: mutually exclusive

Answer

A

outcomes that cannot occur at the same time

the probability of two mutually exclusive outcomes occurring together is 0%

Question 34

Q

does the term mutually exclusive apply to events? or only outcomes?

Answer

A

only outcomes

Question 35

Q

defn + example: exhaustive

Answer

A

a group of outcomes is said to be exhaustive if there are no other possible outcomes

example: flipping heads or tails are the exhaustive outcomes of a coin flip

Question 36

Q

how do you calculate the probability of two or more independent events occurring at the same time?

Answer

A

P(A and B) = P(A) x P(B)

Question 37

Q

how do you calculate the probability of one of two independent events occurring?

Answer

A

P (A or B) = P(A) + P(B) - P(A and B)

Question 38

Q

what do hypothesis testing and confidence intervals allow us to do?

Answer

A

to draw conclusions about populations based on our sample data

Question 39

Q

defn: null hypothesis

Answer

A

a hypothesis of equivalence

says that two populations are equal or that a single population can be described by a parameter equal to a given value

Question 40

Q

what are the two options for the alternative hypothesis?

Answer

A

nondirectional: the populations are not equal

directional: example, the mean of population A is greater than the mean of population B

Question 41

Q

what distributions do z-tests and t-tests rely on?

Answer

A

z-tests: standard distribution

t-tests: t-distribution

Question 42

Q

defn: test statistic/p-value

Answer

A

calculated from the data collected and compared to a table to determine the likelihood that that statistic was obtained by random chance under the assumption that our null hypothesis is true

Question 43

Q

func + most common value + greek letter + meaning in words: significance level

Answer

A

func: compare to our p-value

greek letter: alpha

common value: 0.05

meaning: the level of risk we are willing to accept for incorrectly rejecting the null hypothesis

Question 44

Q

how do we respond to the null hypothesis if the p-value is greater than alpha?

if the p-value is less than alpha?

AND what does it mean?

Answer

A

p-value > alpha: we fail to reject the null hypothesis = there is not a statistically significant difference between the two populations

p-value < alpha: we reject the null hypothesis = there is a statistically significant difference between the two groups

Question 45

Q

defn: type I vs. type II error

Answer

A

type I error = the likelihood that we report a difference between two populations when one does not actually exist

type II error = occurs when we incorrectly fail to reject the null hypothesis = the likelihood that we report no difference between two populations when one actually exists

Question 46

Q

what is the greek letter for a type II error?

Question 47

Q

defn + eqn: power

Answer

A

the probability of correctly rejecting a false null hypothesis

= 1- B

Question 48

Q

defn: confidence

Answer

A

the probability of correctly failing to reject a true null hypothesis = reporting no difference between two populations when one does not exist

Question 49

Q

defn + calc: confidence intervals

Answer

A

the reverse of hypothesis testing

we determine a range of values from the sample mean and standard deviation. rather than finding a p-value, we begin with a desired confidence level (95% is standard) and use a table to find its corresponding z- or t-score

when we multiply the z or t score by the standard deviation, and then add and subtract this number from the mean, we create a range of values

Question 50

Q

defn: charts

Answer

A

present information in a visual format and are frequently used for categorical data

Question 51

Q

func + downside: pie/circle charts

Answer

A

used to represent relative amounts of entities and are especially popular in demographics

downside: as the number of represented categories increases, the visual representation loses impact and becomes confusing

Question 52

Q

func + benefit: bar charts

Answer

A

used for categorical data

likely to contain significantly more info than a pie chart for the same amount of page space

length of a bar is generally proportional to the value it represents

Question 53

Q

what should you be wary of in bar charts?

Answer

A

graphs that contain breaks! they may be enlarging the difference between bars

Question 54

Q

func: histogram + benefit

Answer

A

present numerical rather than discrete categories

useful for determining the mode of a data set because they are used to display the distribution of a data set

Question 55

Q

func: box plots

Answer

A

used to show the range, median, quartiles, and outliers for a set of data

Question 56

Q

defn: box-and-whisker

Answer

A

a labeled box plot

the box is bounded by Q1 and Q3
Q2 (median) is the line in the middle of the box
ends of the whiskers correspond to max and min values of the data set

OR outliers can be presented as individual data points with the ends of the whiskers corresp. to the largest and smallest values in the data set that are within 1.5IQR of the median

Question 57

Q

what are box-and-whisker plots useful for and why?

Answer

A

comparing data because they contain a large amount of data in a small amount of space, and multiple plots can be oriented on a single axis

Question 58

Q

how to approach a graph on test day? (3)

Answer

A

attempt to draw rough conclusions immediately
do not spend time analyzing all the details of the graph unless asked to do so by a question
look at the axes first

Question 59

Q

what makes good map data?

Answer

A

examining one or at most two pieces of information simultaneously

any further data may inhibit clarity

Question 60

Q

defn + char (2): linear graphs

Answer

A

show the relationships between two variables

typically involve two direct measurements

do not have to be a straight line

Question 61

Q

what are the 4 shapes of curves on linear graphs?

Answer

A

linear
parabolic
exponential
logarithmic

Question 62

Q

defn: slope (m)

Answer

A

change in the y-direction divided by the change in the x-direction for any two points

Question 63

Q

defn + benefit + char (2): semilog graphs

Answer

A

a specialized representation of a logarithmic data set

may be easier to interpret because the otherwise curved nature of the log data is made linear by a change in the axis ratio

one axis (usually x) maintains traditional unit spacing

other axis assigns spacing based on a ratio (the multiples may be of any number as long as there is consistency in the ration from one point on the axis to the next)

Question 64

Q

defn + char: log-log graph

Answer

A

both axes use a constant ratio from point to point on the axis

Answer 63

A

take a brief moment to glance at the title of a table
tables that do not have unusual data values should be approached especially briefly
when a table DOES contain significant organization, this structure is likely to be relevant to answering questions (e.g. a trend that suddenly appears or disappears)
you should be able to convert to rough graph or linear equation

Answer 64

A

refers to a connection (direct relationship, inverse, or otherwise) between data

Answer 65

A

POSITIVE: if two variables trend together (as one increases so does the other)

NEGATIVE: if two variables trend in opposite directions (one increases as the other decreases)

Answer 66

A

a number between -1 and +1 that represents the strength of a relationship

+1 = strong positive relationship
-1 = strong negative relationship
0 = no apparent relationship

Answer 67

A

no, not necessarily. avoid this assumption

Answer 68

A

state the apparent relationships between data
begin to draw connections to other concepts in science and to our background knowledge
consider the impact of the new data on the existing hypothesis
ideally, the new data would be integrated into all future investigations on the topic
develop a plausible rationale for the results
make decisions about our data’s impact on the real world and determine whether or not our evidence is substantial and impactful enough to necessitate changes in understanding or policy