Chapter 5: Identifying good measurement Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

All statistical analysis can be divided into two branches or categories:

A

Descriptive or inferential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Descriptive

A

Literally just describing the data in your sample,
with no attempted conclusions about the
entire population. Means, medians, standard
deviations, etc., and some other stuff like effect sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Inferential

A

Using your data as the basis to make inferences about
how things work in the overall population (i.e.,
beyond just your data). Assessing statistical
significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The overall goal of descriptive statistics is

A

to summarize data in a way that is efficient and accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Measures of central tendency tends to

A

One aspect of that is providing single a value that represents the many numbers that make up a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The most common measures of central tendency are

A

mean, median, mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

variability

A

How dispersed or spread out the values in a data set are. Examples of such measures include standard deviation, variance, and range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

To do a good job describing data

A

you need to provide information on central tendency AND variability. Neither one alone provides enough information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bar graph

A

Top of bar=central tendency, error=measure of variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Z scores

A

(x-μ)/σ
x= Individual’s score
μ=Mean score from sample
σ=Standard deviation of the sample’s scores.
Result of standardizing a score (value) with reference to the rest of the scores in the sample (a course section, in this case). Thus, it represents a score’s position relative to the mean in units of standard deviation. Where the score falls relative to the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The correlation coefficient (r-value) is usually considered

A

a descriptive statistic, not surprising given its description of an association.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

This inference relates to whether the conclusions drawn from the sample data can be applied to the population. But how?

A

A statistical test e.g., a t-test
A test statistic e.g., t
A p-value
Is it statistically significant?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

null hypothesis testing

A

It relates explicitly to inferential statistics and is a bit different from the theory-data cycle’s concept of a hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Null hypothesis

A

is always that the result is NOT statistically significant,

and statistical significance tells the researcher to reject the null hypothesis in favour of the alternative hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Rejecting the null hypothesis

A

The result is statistically significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Retaining the null hypothesis

A

A result of this magnitude is not statistically significant+

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

As mentioned before, the threshold for significance (or alpha, α) is usually

A

p < 0.05, meaning that the likelihood that the observed results are due to random chance and not “real” (applicable to the whole population) is < 5%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

t-value is to t-test as

A

r-value is to correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

t-test

A

common whenever you are comparing the mean and variability of two groups. The result is again a p-value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

an analysis of variance (ANOVA)

A
a statistical test that works like a t-test but compares more than two groups. After they show that overall difference, they compare between each pair
of groups (usually called pairwise or post-hoc comparisons or tests).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Alternative hypothesis

A

The result is statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Overall, there are tons of different statistical tests that can be use depending on the situation (and that you don’t have to know), but they all lead to

A

a p-value that can be used to assess statistical significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

There are four possible outcomes of the act of making an inference about statistical significance

A

Two correct conclusions and two types of errors.
Error 1: Reject the null hypothesis (conclude there is an effect) when there is no effect.(“false positive”)
Error 2: Retain the null hypothesis (conclude there is not enough evidence for an effect) when there is one. (“miss”)

24
Q

Type I error can be avoided by

A

lowering the alpha level (e.g. α = 0.01), making the threshold for significance harder to reach.

25
Q

Conversely, type II error can be avoided by

A

raising α (e.g., α = 0.10).

26
Q

Decisions about α depend on

A

the research topic, hypothesis, statistical analysis, etc. E.g., a researcher might use a higher alpha when expecting a new medicine to have a small but important effect.

27
Q

What’s great about p-values?

A

They can be applied to any type of statistical test,

and p < 0.05 always means the same thing!

28
Q

Weakness of p-value and inferential tests

A

The main potential weakness is that statistical significance (p-values) depend heavily on sample size. The relationship shown above may be statistically significant or not depending on sample size. In theory, with large enough sample size, any relationship could be statistically significant. the descriptive statistics stay the same, but p-value changes dramatically with sample size. With very large sample sizes, it’s increasingly likely that you will detect a difference or effect where there really isn’t one (Type I error). In the same spirit, a study with a small sample size might “miss” an effect that really exists (Type II error).

29
Q

The solution is to not base your conclusions only on statistical significance, but also look back to and use another type of measure within descriptive statistics:

A

Effect size: a quantitative measure of the size
or magnitude of a relationship or effect.Like other descriptive statistics (e.g., central tendency), it can be measured in several different ways. Unlike them, it describes not just some sample of data but some relationship within the data.

30
Q

Cohen’s d

A

d = M1 – M2 / SD (spooled)

In other words, Difference between two sets
of data’s means divided by The overall
variability in the data.
Can be used to determine which sample size is necessary.

Most common measure of effect size. One strength of this and other effect sizes is that it’s not
dramatically influenced by sample size.

31
Q

Increasingly in recent years, papers usually include both

A

inferential statistics (i.e., p-value) and a measure of effect size.

32
Q

Effect sizes typically have associated guidelines for use in interpreting the data:

A

Cohen’s d goes up from 0 and can go beyond 1, with the value indicating the strength or magnitude of the relationship
0.2=weak
0.5=moderate
0.8=strong
This means that if two groups’ means don’t differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant.

33
Q

There are other measures of effect size, including

A

including eta squared (ηp2) and others. Note the inclusion of r, which highlights that looking at effect size is not so much adding a wild new way of assessing things, more so just putting stock in the descriptive statistics themselves and not focusing entirely on inferential.

34
Q

In terms of statistical validity, the results are statistically significant, but the effect sizes are tiny. What’s the story?

A

Big sample size

35
Q

construct validity

A

“How well did they operationalize the variable?”

AKA how did they take a construct, with its conceptual definition, and manipulate or measure it in a specific way?

36
Q

Operationalizing variables

A
  1. Self-report measures
  2. Observational measures
  3. Physiological measures
37
Q

Categorical (or nominal) variables are used when

A

the different levels of a variable are categories (e.g., gender, presence/absence of disease, etc.). In keeping with “nominal”, the numbers signify categories and don’t mean anything in a quantitative sense. An independent variable manipulated in an experiment is likely to be a categorical variable.

38
Q

One type of quantitative variable is an ordinal scale

A

where the numbers represent an ranked order

39
Q

Another quantitative variable is interval scale

A

where the numbers represent intervals between levels but there is no “true zero”. Temperature and angle are simple examples: numbers can be related to each
other (e.g., 45° vs 90°) but zero doesn’t mean a complete absence. Zero degrees Celsius is not “no temperature.”

40
Q

Lastly, ratio scale works much like interval scale with the difference that

A

zero really means “nothing”. For example, in this memory task where you remember and repeat back sequences, getting a zero really means you remembered zero things. For example, in this memory task where you remember and repeat back sequences, getting a zero really means you remembered zero things.

41
Q

In terms of assessing construct validity for a

variable, there are two main aspects:

A

Reliability: How consistent are the results?
Validity: Is it measuring what it’s supposed to measure?
Where’s “valid but not reliable”? No such thing…reliability is a prerequisite for validity.

42
Q

Test-retest reliability

A

For constructs that you expect to be stable (e.g., IQ), are the scores for an individual consistent across time (aka from test to retest)? If not, how could it be a reliable measure? Information related to the reliabilities can be found in either the
Methods or Results section, depending on the paper. r = 0.50 or so is good test-retest reliability…The STAI is a good example as it measures both trait anxiety (consistent
over time) and state anxiety (may vary over time) via separate questions. As one would expect, test-retest reliability is high for one but not the other. Relevance depends on whether or not it’s a construct that you would expect to be stable over time.

43
Q

Test-retest reliability is usually seen

A

with questionnaires, but can apply to
behavioural measures as well. For example, the above disease rating scale has test-retest reliability > 0.90 (over a finite time period, at least).

44
Q

Interrater reliability

A

How similar are the scores or measures of two or more different observers? E.g. smiles as a measure of happiness Usually Measured via the correlation between the scores from one observe with those of another, often via r-value. Here, a good rule of thumb is r > 0.70 or so. Sometimes, researchers use the intraclass correlation coefficient (ICC), which is a bit more complicated than r-values but works similarly… for example, it still runs from 0 to 1. Relevance depends on whether there is a subjective element to the operationalization (e.g., self-report vs behavioural).

45
Q

For categorical variables, Cohen’s kappa (κ, another fancy type of correlation) is used

A

to measure interrater reliability. It works like r, but statistically factors in the possibility of raters agreeing by chance.

46
Q

Internal reliability (or “internal consistency”)

A

Do the various aspects of an operationalization meant to measure a
certain construct change in association with each other? Put another way, subjects’ answers should generally “point in the same direction.” Measured via Cronbach’s alpha (α) (“Coefficient alpha” above), another
fancy sort of correlation; it measures the covariance among the items. Again, >0.70 is a general rule of thumb. Needs to be addressed with most every scale/questionnaire, since they typically feature multiple items. As with test-retest, internal reliability is mostly seen with questionnaires but may also be used with other types of measures. Relevance depends on whether the operationalization
features multiple components or not.

47
Q

Assessing the validity of a measure

A

does it measure what it is supposed to

measure? – involves looking at five different components.

48
Q

Two of them are subjective in nature

A

face and content validity. Both of these validities bring to mind the conceptual definition, and whether the operationalization reflects it

49
Q

Face validity

A

Does it look like it’s a good operationalization (like it measures what it should)? Since this is a judgement call, researchers might consult experts in the field.

50
Q

Content validity

A

Does it cover all parts or aspects of the construct or variable? Another judgement call for which we may get expert help.

51
Q

Assessment of content validity depends on

A

your perspective / definition.

52
Q

construct validity

A

1) Criterion validity
2) Convergent
3) Discriminant

53
Q

criterion validity

A

Does the measure correlate with some
“real world” or behavioural outcome associated with the conceptual definition? If your measure is a test of sales aptitude, a sensible criterion validity would be actual sales performance. If these don’t match up, there’s probably a problem with your
measure.

54
Q

One special approach to assessing criterion

validity is the known-groups paradigm

A

whether your measure differs between
groups that are otherwise established to be
different in some regard. A depression scale with criterion validity should generate higher scores for those diagnosed with depression (a “known group”) than for others. This approach works the same regardless of how the groups have
been previously divided up.

55
Q

Criterion validity could be assessed with

A

both or either approach, depending on
the variable and what is available for
comparison.

56
Q

Convergent (sometimes called concurrent) validity

A

Does the measure correlate with an
established measure of the same
(or very similar/related) construct? Does it correlate with something that you would expect it to correlate with? E.g., I just designed a new depression scale… does it produce results similar to that famous depression scale that everyone uses? Note that we are looking for a correlation, but not absolutely necessarily a positive one! One important part is that the scale being used for comparison should itself have been rigorously assessed for validity in the past

57
Q

Discriminant (sometimes called divergent) validity:

A

Does the measure NOT correlate with
measures of constructs that are
different/unrelated? Evidence in the form of a lack of correlation with very different/unrelated constructs. A strong association would be a red flag