Ch. 13 Flashcards

1
Q

statistics

A

Descriptive data that involves measuring one or more variables in a sample and computing descriptive summary data (e.g., means, correlation coefficients) for those variables.

In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

parameters

A

Corresponding values in the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

sampling error

A

The random variability in a statistic from sample to sample.

(Note that the term error here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population.

A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population.

But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

any statistical relationship in a sample can be interpreted in two ways:

A

There is a relationship in the population, and the relationship in the sample reflects this.

There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Null hypothesis testing

A

A formal approach to deciding between two interpretations of a statistical relationship in a sample.

One interpretation is called the null hypothesis (often symbolized H0 and read as “H-zero”).

This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error.

Informally, the null hypothesis is that the sample relationship “occurred by chance.”

The other interpretation is called the alternative hypothesis (often symbolized as H1).

This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

A

Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.

Determine how likely the sample relationship would be if the null hypothesis were true.

If the sample relationship would be extremely unlikely, then reject the null hypothesis in favor of the alternative hypothesis.

If it would not be extremely unlikely, then retain the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

p value

A

A crucial step in null hypothesis testing is finding the probability of the sample result or a more extreme result if the null hypothesis were true.

This probability is called the p value.

A low p value means that the sample or more extreme result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis.

A p value that is not low means that the sample or more extreme result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

α (alpha)

A

The criterion that shows how low a p-value should be before the sample result is considered unlikely enough to reject the null hypothesis (Usually set to .05).

If there is a 5% chance or less of a result at least as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected.

When this happens, the result is said to be statistically significant.

If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained.

This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it.

Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Role of Sample Size and Relationship Strength

A

the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true.

That is, the lower the p value.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small.

shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant.

If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

It is extremely useful to be able to develop this kind of intuitive judgment.

A

One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses.

A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Statistical Significance Versus Practical Significance

A

A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample.

But the word significant can cause people to interpret these differences as strong and important.

This is why it is important to distinguish between the statistical significance of a result and the practical significance of that result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Practical significance

A

Refers to the importance or usefulness of the result in some real-world context.

Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

t- test

A

A test that involves looking at the difference between two means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

one-sample t-test

A

Used to compare a sample mean (M) with a hypothetical population mean (μ0) that provides some interesting standard of comparison.

The null hypothesis is that the mean for the population (µ) is equal to the hypothetical population mean: μ = μ0.

The alternative hypothesis is that the mean for the population is different from the hypothetical population mean: μ ≠ μ0.

To decide between these two hypotheses, we need to find the probability of obtaining the sample mean (or one more extreme) if the null hypothesis were true.

But finding this p value requires first computing a test statistic called t. (A test statistic is a statistic that is computed only to help find the p value.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The reason the t statistic (or any test statistic) is useful is

A

that we know how it is distributed when the null hypothesis is true.

we do not have to deal directly with the distribution of t scores.

If we were to enter our sample data and hypothetical mean of interest into one of the online statistical tools in Chapter 12 or into a program like SPSS the output would include both the t score and the p value.

At this point, the rest of the procedure is simple.

If p is equal to or less than .05, we reject the null hypothesis and conclude that the population mean differs from the hypothetical mean of interest.

If p is greater than .05, we retain the null hypothesis and conclude that there is not enough evidence to say that the population mean differs from the hypothetical mean of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

critical values

A

The absolute value that a test statistic (e.g., F, t, etc.) must exceed to be considered statistically significant.

two-tailed critical values, Each of these values should be interpreted as a pair of values: one positive and one negative.

The idea is that any t score below the lower critical value is in the lowest 2.5% of the distribution, while any t score above the upper critical value (the right-hand red line) is in the highest 2.5% of the distribution.

Therefore any t score beyond the critical value in either direction is in the most extreme 5% of t scores when the null hypothesis is true and has a p value less than .05.

Thus if the t score we compute is beyond the critical value in either direction, then we reject the null hypothesis.

If the t score we compute is between the upper and lower critical values, then we retain the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

two-tailed test

A

Where we reject the null hypothesis if the test statistic for the sample is extreme in either direction (+/-).

This test makes sense when we believe that the sample mean might differ from the hypothetical population mean but we do not have good reason to expect the difference to go in a particular direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

one-tailed test

A

Where we reject the null hypothesis only if the t score for the sample is extreme in one direction that we specify before collecting the data.

This test makes sense when we have good reason to expect the sample mean will differ from the hypothetical population mean in a particular direction.

Each one-tailed critical value can again be interpreted as a pair of values: one positive and one negative.

A t score below the lower critical value is in the lowest 5% of the distribution, and a t score above the upper critical value is in the highest 5% of the distribution.

However, for a one-tailed test, we must decide before collecting data whether we expect the sample mean to be lower than the hypothetical population mean, in which case we would use only the lower critical value, or we expect the sample mean to be greater than the hypothetical population mean, in which case we would use only the upper critical value.

Notice that we still reject the null hypothesis when the t score for our sample is in the most extreme 5% of the t scores we would expect if the null hypothesis were true—so α remains at .05.

We have simply redefined extreme to refer only to one tail of the distribution.

The advantage of the one-tailed test is that critical values are less extreme.

If the sample mean differs from the hypothetical population mean in the expected direction, then we have a better chance of rejecting the null hypothesis.

The disadvantage is that if the sample mean differs from the hypothetical population mean in the unexpected direction, then there is no chance at all of rejecting the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Dependent-Samples t–Test

A

Used to compare two means for the same sample tested at two different times or under two different conditions (sometimes called the paired-samples t-test).

This comparison is appropriate for pretest-posttest designs or within-subjects experiments.

The null hypothesis is that the means at the two times or under the two conditions are the same in the population.

The alternative hypothesis is that they are not the same.

This test can also be one-tailed if the researcher has good reason to expect the difference goes in a particular direction.

It helps to think of the dependent-samples t-test as a special case of the one-sample t-test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

the first step in the dependent-samples t-test

A

is to reduce the two scores for each participant to a single difference score by taking the difference between them.

Difference score: A method to reduce pairs of scores (e.g., pre- and post-test) to a single score by calculating the difference between them.

At this point, the dependent-samples t-test becomes a one-sample t-test on the difference scores.

The hypothetical population mean (µ0) of interest is 0 because this is what the mean difference score would be if there were no difference on average between the two times or two conditions.

We can now think of the null hypothesis as being that the mean difference score in the population is 0 (µ0 = 0) and the alternative hypothesis as being that the mean difference score in the population is not 0 (µ0 ≠ 0).

21
Q

Independent-Samples t-Test

A

Used to compare the means of two separate samples (M1 and M2).

The two samples might have been tested under different conditions in a between-subjects experiment, or they could be pre-existing groups in a cross-sectional design (e.g., women and men, extraverts and introverts).

The null hypothesis is that the means of the two populations are the same: µ1 = µ2.

The alternative hypothesis is that they are not the same: µ1 ≠ µ2.

Again, the test can be one-tailed if the researcher has good reason to expect the difference goes in a particular direction.

The t statistic here is a bit more complicated because it must take into account two sample means, two standard deviations, and two sample sizes.

formula includes squared standard deviations (the variances) that appear inside the square root symbol.

Also, lowercase n1 and n2 refer to the sample sizes in the two groups or condition (as opposed to capital N, which generally refers to the total sample size).

The only additional thing to know here is that there are N − 2 degrees of freedom for the independent-samples t- test.

22
Q

The Analysis of Variance

A

When there are more than two groups or condition means to be compared, the most common null hypothesis test is the analysis of variance (ANOVA).

ANOVA: A statistical test used when there are more than two groups or condition means to be compared.

23
Q

One-Way ANOVA

A

Used for between-subjects designs with a single independent variable.

The one-way ANOVA is used to compare the means of more than two samples (M1, M2…MG) in a between-subjects design.

The null hypothesis is that all the means are equal in the population: µ1= µ2 =…= µG.

The alternative hypothesis is that not all the means in the population are equal.

The test statistic for the ANOVA is called F. It is a ratio of two estimates of the population variance based on the sample data.

One estimate of the population variance is called the mean squares between groups (MSB)

The other is called the mean squares within groups (MSW).

The F statistic is the ratio of the MSB to the MSW.

Again, the reason that F is useful is that we know how it is distributed when the null hypothesis is true.

The precise shape of the distribution depends on both the number of groups and the sample size, and there are degrees of freedom values associated with each of these.

The between-groups degrees of freedom is the number of groups minus one: dfB = (G − 1).

The within-groups degrees of freedom is the total sample size minus the number of groups: dfW = N − G.

Again, knowing the distribution of F when the null hypothesis is true allows us to find the p value.

If p is equal to or less than .05, then we reject the null hypothesis and conclude that there are differences among the group means in the population.

If p is greater than .05, then we retain the null hypothesis and conclude that there is not enough evidence to say that there are differences.

In the unlikely event that we would compute F by hand, we can use a table of critical values.

The idea is that any F ratio greater than the critical value has a p value of less than .05.

Thus if the F ratio we compute is beyond the critical value, then we reject the null hypothesis.

If the F ratio we compute is less than the critical value, then we retain the null hypothesis.

24
Q

mean squares between groups (MSB)

A

An estimate of the population variance and is based on the differences among the sample means.

25
Q

mean squares within groups (MSW)

A

An estimate of the population variance and is based on the differences among the scores within each group.

26
Q

Post Hoc Comparisons

A

An unplanned (not hypothesized) test of which pairs of group mean scores are different from which others.

When we reject the null hypothesis in a one-way ANOVA, we conclude that the group means are not all the same in the population.

But this can indicate different things.

With three groups, it can indicate that all three means are significantly different from each other.

Or it can indicate that one of the means is significantly different from the other two, but the other two are not significantly different from each other.

For this reason, statistically significant one-way ANOVA results are typically followed up with a series of post hoc comparisons of selected pairs of group means to determine which are different from which others.

One approach to post hoc comparisons would be to conduct a series of independent-samples t-tests comparing each group mean to each of the other group means.

But there is a problem with this approach.

In general, if we conduct a t-test when the null hypothesis is true, we have a 5% chance of mistakenly rejecting the null hypothesis.

If we conduct several t-tests when the null hypothesis is true, the chance of mistakenly rejecting at least one null hypothesis increases with each test we conduct.

Thus researchers do not usually make post hoc comparisons using standard t-tests because there is too great a chance that they will mistakenly reject at least one null hypothesis.

Instead, they use one of several modified t-test procedures—among them the Bonferonni procedure, Fisher’s least significant difference (LSD) test, and Tukey’s honestly significant difference (HSD) test.

It is to keep the risk of mistakenly rejecting a true null hypothesis to an acceptable level (close to 5%).

27
Q

Repeated-Measures ANOVA

A

Compares the means from the same participants tested under different conditions or at different times in which the dependent variable is measured multiple times for each participant.

The basics of the repeated-measures ANOVA are the same as for the one-way ANOVA.

The main difference is that measuring the dependent variable multiple times for each participant allows for a more refined measure of MSW.

Imagine, for example, that the dependent variable in a study is a measure of reaction time.

Some participants will be faster or slower than others because of stable individual differences in their nervous systems, muscles, and other factors.

In a between-subjects design, these stable individual differences would simply add to the variability within the groups and increase the value of MSW (which would, in turn, decrease the value of F).

In a within-subjects design, however, these stable individual differences can be measured and subtracted from the value of MSW.

This lower value of MSW means a higher value of F and a more sensitive test.

28
Q

Factorial ANOVA

A

A statistical method to detect differences in the means between conditions when there are two or more independent variables in a factorial design.

It allows the detection of main effects and interaction effects.

the basics of the factorial ANOVA are the same as for the one-way and repeated-measures ANOVAs.

The main difference is that it produces an F ratio and p value for each main effect and for each interaction.

Returning to our calorie estimation example, imagine that the health psychologist tests the effect of participant major (psychology vs. nutrition) and food type (cookie vs. hamburger) in a factorial design.

A factorial ANOVA would produce separate F ratios and p values for the main effect of major, the main effect of food type, and the interaction between major and food.

Appropriate modifications must be made depending on whether the design is between-subjects, within-subjects, or mixed.

29
Q

Testing Correlation Coefficients

A

For relationships between quantitative variables, where Pearson’s r (the correlation coefficient) is used to describe the strength of those relationships, the appropriate null hypothesis test is a test of the correlation coefficient.

The basic logic is exactly the same as for other null hypothesis tests.

In this case, the null hypothesis is that there is no relationship in the population.

We can use the Greek lowercase rho (ρ) to represent the relevant parameter: ρ = 0.

The alternative hypothesis is that there is a relationship in the population: ρ ≠ 0.

As with the t- test, this test can be two-tailed if the researcher has no expectation about the direction of the relationship or one-tailed if the researcher expects the relationship to go in a particular direction.

It is possible to use the correlation coefficient for the sample to compute a t score with N − 2 degrees of freedom and then to proceed as for a t-test.

However, because of the way it is computed, the correlation coefficient can also be treated as its own test statistic.

The online statistical tools and statistical software such as Excel and SPSS generally compute the correlation coefficient and provide the p value associated with that value.

As always, if the p value is equal to or less than .05, we reject the null hypothesis and conclude that there is a relationship between the variables in the population.

If the p value is greater than .05, we retain the null hypothesis and conclude that there is not enough evidence to say there is a relationship in the population.

If we compute the correlation coefficient by hand, we can use a table which shows the critical values of r for various samples sizes when α is .05.

A sample value of the correlation coefficient that is more extreme than the critical value is statistically significant.

30
Q

Errors in Null Hypothesis Testing

A

In null hypothesis testing, the researcher tries to draw a reasonable conclusion about the population based on the sample. Unfortunately, this conclusion is not guaranteed to be correct.

Decision: Ho False Ho True
Reject Ho: Correct decision. Type 1 error
Retain Ho: Type 2 error. Correct decision

31
Q

Type I error

A

A false positive in which the researcher concludes that their results are statistically significant when in reality there is no real effect in the population and the results are due to chance.

In other words, rejecting the null hypothesis when it is true.

In fact, when the null hypothesis is true and α is .05, we will mistakenly reject the null hypothesis 5% of the time. (This possibility is why α is sometimes referred to as the “Type I error rate.”)

This provides some insight into why the convention is to set α to .05. There is some agreement among researchers that the .05 level of α keeps the rates of both Type I and Type II errors at acceptable levels.

32
Q

Type II error

A

A missed opportunity in which the researcher concludes that their results are not statistically significant when in reality there is a real effect in the population and they just missed detecting it.

In other words, retaining the null hypothesis when it is false.

In practice, Type II errors occur primarily because the research design lacks adequate statistical power to detect the relationship (e.g., the sample is too small).

Similarly, it is possible to reduce the chance of a Type II error by setting α to something greater than .05 (e.g., .10). But making it easier to reject false null hypotheses also makes it easier to reject true ones and therefore increases the chance of a Type I error.

This provides some insight into why the convention is to set α to .05.

There is some agreement among researchers that the .05 level of α keeps the rates of both Type I and Type II errors at acceptable levels.

33
Q

The possibility of committing Type I and Type II errors has several important implications for interpreting the results of our own and others’ research.

A

One is that we should be cautious about interpreting the results of any individual study because there is a chance that it reflects a Type I or Type II error.

This possibility is why researchers consider it important to replicate their studies.

Each time researchers replicate a study and find a similar result, they rightly become more confident that the result represents a real phenomenon and not just a Type I or Type II error.

34
Q

file drawer problem

A

The problem of research results not being published that fail to find a statistically significant result.

As a consequence, the published literature fails to contain a full representation of the positive and negative findings about a research question.

When researchers obtain non-significant results, they tend not to submit them for publication, or if they do submit them, journal editors and reviewers tend not to accept them.

One effect of this tendency is that the published literature probably contains a higher proportion of Type I errors than we might expect on the basis of statistical considerations alone.

Even when there is a relationship between two variables in the population, the published research literature is likely to overstate the strength of that relationship.

The file drawer problem is a difficult one because it is a product of the way scientific research has traditionally been conducted and published.

35
Q

Solutions to file drawer problem

A
  1. Registered reports, whereby journal editors and reviewers evaluate research submitted for publication without knowing the results of that research.

The idea is that if the research question is judged to be interesting and the method judged to be sound, then a non-significant result should be just as important and worthy of publication as a significant one.

  1. Short of such a radical change in how research is evaluated for publication, researchers can still take pains to keep their non-significant results and share them as widely as possible (e.g., in publicly available repositories and at professional conferences).
  2. Many scientific disciplines now have journals devoted to publishing non-significant results.
36
Q

p-hacking

A

When researchers make various decisions in the research process to increase their chance of a statistically significant result (and type I error) by arbitrarily removing outliers, selectively choosing to report dependent variables, only presenting significant results, etc. until their results yield a desirable p value.

Their groundbreaking paper contributed to a major conversation in the field about publishing standards and improving the reliability of our results that continues today.

37
Q

Statistical Power

A

In research design, it means the probability of rejecting the null hypothesis given the sample size and expected relationship strength.

For example, the statistical power of a study with 50 participants and an expected Pearson’s r of +.30 in the population is .59. That is, there is a 59% chance of rejecting the null hypothesis if indeed the population correlation is +.30.

Statistical power is the complement of the probability of committing a Type II error.

So in this example, the probability of committing a Type II error would be 1 − .59 = .41.

Clearly, researchers should be interested in the power of their research designs if they want to avoid making Type II errors.

In particular, they should make sure their research design has adequate power before collecting data.

A common guideline is that a power of .80 is adequate.

This guideline means that there is an 80% chance of rejecting the null hypothesis for the expected relationship strength.

38
Q

What should you do if you discover that your research design does not have adequate power?

A

Given that statistical power depends primarily on relationship strength and sample size, there are essentially two steps you can take to increase statistical power: increase the strength of the relationship or increase the sample size.

Increasing the strength of the relationship can sometimes be accomplished by using a stronger manipulation or by more carefully controlling extraneous variables to reduce the amount of noise in the data (e.g., by using a within-subjects design rather than a between-subjects design).

The usual strategy, however, is to increase the sample size.

For any expected relationship strength, there will always be some sample large enough to achieve adequate power.

39
Q

Criticisms of Null Hypothesis Testing

A
  1. Some criticisms of null hypothesis testing focus on researchers’ misunderstanding of it.

We have already seen, for example, that the p value is widely misinterpreted as the probability that the null hypothesis is true. (Recall that it is really the probability of the sample result if the null hypothesis were true.)

A closely related misinterpretation is that 1 − p equals the probability of replicating a statistically significant result.

  1. Another set of criticisms focuses on the logic of null hypothesis testing.

To many, the strict convention of rejecting the null hypothesis when p is less than .05 and retaining it when p is greater than .05 makes little sense.

This criticism does not have to do with the specific value of .05 but with the idea that there should be any rigid dividing line between results that are considered significant and results that are not.

Imagine two studies on the same statistical relationship with similar sample sizes. One has a p value of .04 and the other a p value of .06.

Although the two studies have produced essentially the same result, the former is likely to be considered interesting and worthy of publication and the latter simply not significant.

This convention is likely to prevent good research from being published and to contribute to the file drawer problem.

  1. Yet another set of criticisms focus on the idea that null hypothesis testing—even when understood and carried out correctly—is simply not very informative.

Recall that the null hypothesis is that there is no relationship between variables in the population.

So to reject the null hypothesis is simply to say that there is some nonzero relationship in the population. But this assertion is not really saying very much.

Imagine if chemistry could tell us only that there is some relationship between the temperature of a gas and its volume—as opposed to providing a precise equation to describe that relationship.

Some critics even argue that the relationship between two variables in the population is never precisely 0 if it is carried out to enough decimal places.

In other words, the null hypothesis is never literally true.

So rejecting it does not tell us anything we did not already know!

40
Q

defense of null hypothesis testing.

A

One of them, Robert Abelson, has argued that when it is correctly understood and carried out, null hypothesis testing does serve an important purpose.

Especially when dealing with new phenomena, it gives researchers a principled way to convince others that their results should not be dismissed as mere chance occurrences.

41
Q

What to Do?

confidence intervals

A

A range of values that is computed in such a way that some percentage of the time (usually 95%) the population parameter will lie within that range.

Advocates of confidence intervals argue that they are much easier to interpret than null hypothesis tests.

Another advantage of confidence intervals is that they provide the information necessary to do null hypothesis tests should anyone want to.

42
Q

What to Do?

A

Even those who defend null hypothesis testing recognize many of the problems with it.

But what should be done?

Some suggestions now appear in the APA Publication Manual.

One is that each null hypothesis test should be accompanied by an effect size measure such as Cohen’s d or Pearson’s r.

By doing so, the researcher provides an estimate of how strong the relationship in the population is—not just whether there is one or not.

(Remember that the p value cannot substitute as a measure of relationship strength because it also depends on the sample size. Even a very weak result can be statistically significant if the sample is large enough.)

43
Q

What to Do?

Bayesian statistics

A

An approach in which the researcher specifies the probability that the null hypothesis and any important alternative hypotheses are true before conducting the study, conducts the study, and then updates the probabilities based on the data.

It is too early to say whether this approach will become common in psychological research.

For now, null hypothesis testing—supported by effect size measures and confidence intervals—remains the dominant approach.

44
Q

replicability crisis

A

A phrase that refers to the inability of researchers to replicate earlier research findings.

Of course, a failure to replicate a result by itself does not necessarily discredit the original study as differences in the statistical power, populations sampled, and procedures used, or even the effects of moderating variables could explain the different results

45
Q

Although many believe that the failure to replicate research results…

A

is an expected characteristic of cumulative scientific progress, others have interpreted this situation as evidence of systematic problems with conventional scholarship in psychology, including a publication bias that favors the discovery and publication of counter-intuitive but statistically significant findings instead of the duller (but incredibly vital) process of replicating previous findings to test their robustness.

46
Q

Worse still is the suggestion that the low replicability of many studies is evidence of the widespread use of questionable research practices by psychological researchers.

These may include:

A
  1. The selective deletion of outliers in order to influence (usually by artificially inflating) statistical relationships among the measured variables.
  2. The selective reporting of results, cherry-picking only those findings that support one’s hypotheses.
  3. Mining the data without an a priori hypothesis, only to claim that a statistically significant result had been originally predicted, a practice referred to as “HARKing” or hypothesizing after the results are known (Kerr, 1998[8]).
  4. A practice colloquially known as “p-hacking” (briefly discussed in the previous section), in which a researcher might perform inferential statistical calculations to see if a result was significant before deciding whether to recruit additional participants and collect more data (Head, Holman, Lanfear, Kahn, & Jennions, 2015)[9]. As you have learned, the probability of finding a statistically significant result is influenced by the number of participants in the study.
  5. Outright fabrication of data (as in the case of Diederik Stapel, described at the start of Chapter 3), although this would be a case of fraud rather than a “research practice.”
47
Q

It is important to shed light on these questionable research practices to

A

ensure that current and future researchers (such as yourself) understand the damage they wreak to the integrity and reputation of our discipline.

48
Q

However, in addition to highlighting what not to do, this so-called “crisis” has also highlighted the importance of enhancing scientific rigor by:

A
  1. Designing and conducting studies that have sufficient statistical power, in order to increase the reliability of findings.
  2. Publishing both null and significant findings (thereby counteracting the publication bias and reducing the file drawer problem).
  3. Describing one’s research designs in sufficient detail to enable other researchers to replicate your study using an identical or at least very similar procedure.
  4. Conducting high-quality replications and publishing these results (Brandt et al., 2014).
49
Q

open science practices

A

A practice in which researchers openly share their research materials with other researchers in hopes of Increasing the transparency and openness of the scientific enterprise.

journals now issue digital badges to researchers who pre-registered their hypotheses and data analysis plans, openly shared their research materials with other researchers (e.g., to enable attempts at replication), or made available their raw data with other researchers.

These initiatives, which have been spearheaded by the Center for Open Science, have led to the development of “Transparency and Openness Promotion guidelines”