Yarkoni & Westfall (2017): Choosing prediction over Explanation in Psychology Flashcards by oisin mcelwain

The goal of scientific psychology is to understand human behaviour. Historically this has meant being able both to explain behaviour—that is, to accurately describe its causal underpinnings—and to predict behaviour. How is this not the case now?

These two goals are rarely distinguished. The understanding seems to be that the two are so deeply intertwined that there would be little point in distinguishing them, except perhaps as a philosophical exercise. According to this understanding, explanation necessarily facilitates prediction.

How well did you know this?

Not at all

Perfectly

Is it the case that the model that best approximates the mental processes that produces an observed prediction?

Unfortunately, although explanation and prediction may be philosophically compatible, there are good reasons to think that they are often in statistical and pragmatic tension with one another.

How well did you know this?

Not at all

Perfectly

From a statistical standpoint why may explanatory models not have the best precision?

From a statistical standpoint, it is simply not true that the model that most closely approximates the data-generating process will in general be the most successful at predicting real-world outcomes. Overfitting can often lead to a biased, psychologically implausible model to outperform a mechanistically more accurate but also more complex model.

How well did you know this?

Not at all

Perfectly

What may scientists in many areas of psychology have to choose between?

(a) developing complex models that can accurately predict outcomes of interest but fail to respect known psychological or neurobiological constraints and
(b) building simple models that appear theoretically elegant but have very limited capacity to predict actual human behavior.

How well did you know this?

Not at all

Perfectly

What does this decision mean practically speaking?

A researcher cannot know whether there is a relatively simple explanatory model waiting to be found, so they must decide on a case by case basis what to prioritise; identifying abstract generalisable principles or prediction without caring how that goal is achieved.

How well did you know this?

Not at all

Perfectly

Why is it posited that explanation has been. favoured in the past? Why might this change?

Successful predictive science were poorly understood and rarely deployed in most fields of social and biomedical science. Recent advances in machine learning where prediction of unobserved data is the gold standard and explanation is typically of little interest as well as the availability of large datasets of human behaviour.

How well did you know this?

Not at all

Perfectly

What are the two separate senses in which psychologists have been deficient when it comes to predicting behaviour?

First, research papers in psychology rarely take steps to verify that the models they propose are capable of predicting the behavioral outcomes they are purportedly modelling.

Second, there is mounting evidence from the ongoing replication crisis that the published results of many papers in psychology do not, in fact, hold up when the same experiments and analyses are independently conducted at a later date

How well did you know this?

Not at all

Perfectly

Instead of testing predictions, what are psychological models typically evaluated on?

Instead, research is typically evaluated based either on “goodness of fit” between the statistical model and the sample data or on whether the sizes and directions of certain regression coefficients match what is implied by different theoretical perspectives.

How well did you know this?

Not at all

Perfectly

What implications do the lack of replicability have for prediction?

Models that are held up as good explanations of behavior in an initial sample routinely fail to accurately predict the same behaviors in future samples—even when the experimental procedures are closely matched.

How well did you know this?

Not at all

Perfectly

What is likely the reason for this replication failure?

P-hacking

How well did you know this?

Not at all

Perfectly

large number of psychology articles prominently feature the word prediction in their titles. What is the problem with this?

Such assertions reflect the intuitive idea that a vast range of statistical models (e.g regression) are, in a sense, predictive models. When a researcher obtains a coefficient of determination of, say, b0, 0, and thus reports that she is able to “predict” 50% of the variance of x using y predictors, she is implicitly claiming that she would be able to make reasonably accurate predictions about x for a random person in the same underlying population. The problem lies in the inference that the parameter estimates obtained in the sample at hand will perform comparably well when applied to other samples drawn from the same population. Instead, the R^2 statistic answers whether in repeated random samples similar to this one, if one fits a model with the form of equation 1 in each new sample—each time estimating new values of b0, b1, and b2—what will be the average proportional reduction in the sum of squared errors? In other words, R^2 does not estimate the performance of a specific equation 2 but rather of the more general equation 1. It turns out that the performance of equation 1 is virtually always an overly optimistic estimate of the performance of equation 2.

How well did you know this?

Not at all

Perfectly

Why is the performance of equation 1 is virtually always an overly optimistic estimate of the performance of equation 2?

The values of b0, b1, and b2 estimated in any given sample are specifically selected so as to minimise the sum of squared errors in that particular sample.

How well did you know this?

Not at all

Perfectly

Why does machine learning not have the same problems with overfitting?

They train it against a different test dataset and distinguish training error and test error

How well did you know this?

Not at all

Perfectly

When are the problems of overfitting negligible and when is it most pronounced?

When predictors have strong effects and researchers fit relatively compact models in large samples, overfitting is negligible. As the number of predictors
increases and/or sample size and effect size drop, overfitting increases, sometimes dramatically.

How well did you know this?

Not at all

Perfectly

What do Yarkon and Westfall refer to as procedural overfitting?

P-hacking can be usefully conceptualised as a special case of overfitting. Specifically, it can bethought of as a form of procedural overfitting that takes place prior to (or in parallel with) model estimation

How well did you know this?

Not at all

Perfectly

Comment on the the issue of p-hacking in psychology

Study These Flashcards

Super prevalent and even a little flexibility can easily produce false-positives rates in access of 60%.

What is ‘one of the central challenges of statistical inference in science’?

Study These Flashcards

Balancing these two competing motivations—that is, facilitating exploration of novel ideas and preliminary results while simultaneously avoiding being led down garden paths

Psychology has only recently begun to appreciate the
need to place hard constraints on the flexibility afforded
to data analysts and researchers. Where is the appreciation most evident?

Study These Flashcards

In the numerous recent calls for routine preregistration of studies, where a sharp distinction is to be drawn between the exploratory (where flexibility is encouraged) and confirmatory (where flexibility is denied) components of a research program

How does the term bias differ in meaning in psychometric tradition and machine learning?

Study These Flashcards

To researchers trained in the psychometric tradition, the very term bias is practically synonymous with error, tending to connote general
wrongness. But in the statistical learning literature, bias is defined in a narrower and much less offensive way. Specifically, it refers to one particular kind of error: the tendency for a model to consistently produce answers that are wrong in a particular direction (e.g., estimates that are consistently too high).

… partitions the total sum of squared errors into two separate components: …

Study These Flashcards

The bias-variance decomposition partitions the total sum of squared errors into two separate components: a bias term that captures a model’s systematic tendency to deviate from the true scores in a predictable way and a variance term that represents the deviations of the individual observations from the model’s expected prediction.

Why quantify bias and variance explicitly, rather than
treating prediction error as a single sum?

Study These Flashcards

One important reason is that a researcher typically has some control over the bias of a model and hence can indirectly influence the total error as well.

What is meant by the variance- bias trade-off?

Study These Flashcards

when we increase the bias of an estimator, we decrease its variance, because by biasing our estimator to preferentially search one part of the parameter space, we simultaneously inhibit its ability to explore other, nonpreferred points in the space.

Why use a biased estimator in machine learning?

Study These Flashcards

Adopting an estimator so biased that it entirely ignores the observed data is clearly a recipe for disaster. In general, however, judicious use of a biased estimator will often reduce total prediction error.  Improper models are 
much less flexible than their traditional counterparts, so 
they  often  dramatically  reduce  the  variance  associated  
with overfitting (at the cost of an increase in bias).

What is the solution to this tradeoff?

Study These Flashcards

Would be nice if there was one but there isn’t: The bias-variance tradeoff is fundamental and unavoidable, and we all must decide how much bias we wish to trade for variance, or vice versa.

What is the explanatory approach to the trade-off?

The explanatory approach to the tradeoff prioritizes minimising bias. How-ever, because the total prediction error is equal to the sum of bias and variance, this approach runs the risk of producing models that are essentially useless for prediction, due to the variance being far too large. This also means that the utility of the theories under investigation diminishes since the models derived from data are highly unstable.

What is the prediction approach to the trade-off?

In machine learning, by contrast, the primary goal is usually to predict future observations as accurately as possible—in other words, to minimise prediction error. Thus, the machine learning approach to the bias-variance tradeoff is clear: One should prefer whatever the ratio of bias to variance is that minimises the expected prediction error for the problem at hand.

What does finding a model that minimises the expected prediction error typically require?

1. One must use datasets large enough to support training of statistical models that can make good predictions. 2. One must be able to accurately estimate prediction error, so as to objectively assess a model’s performance and determine when and how the model can be improved. 3. One must be able to exert control over the bias-variance tradeoff when appropriate, by using biased models that can push predictions toward areas of the parameter space that are more likely to contain the true parameter values.

What three methodological practices are posited to be able to help psychology address these requirements?

1. the routine use of very large datasets as a means of improving predictive accuracy 2. The reliance on cross-validation to assess model performance; 3. the use of regularisation as a way of biasing one’s predictions in desirable ways.

What is one of the chief benefits of large datasets?

They provide a natural guard against overfitting. The larger a sample, the more representative it is of the population from which it is drawn.consequently, as sample size grows, it becomes increasingly difficult for a statistical model to capitalise on patterns that occur in the training data but not in the broader population.

What trend have we seen with effect sizes as datasets grow larger?

As sample sizes have grown, effect sizes have consistently shrunk, often to the point where the explanatory utility of a massively complex model fitted to enormous amounts of data remains somewhat unclear.

How can you explain the discrepancy between large studies and their smaller counterparts?

The likely possibility is that the results generated by small-sample studies tend to be massively overfitted. In other words, the reason effect sizes in many domains have shrunk is that they were never truly big to begin with.

What is the obvious implication of this study?

The obvious implication is that, if we’re serious about producing replicable, reliable science, we should generally favour small effects from large samples over large effects from small samples. (Of course, large effects from large samples are even better, on the rare occasions that we can obtain them.) In fact, in many cases, there is a serious debate to be had about whether it is scientifically useful to conduct small-sample research at all

What alternatives are there to conducting small-sample research?

For researchers to participate in large, multilab, collaborative projects. Another would be to conduct novel analyses on some of the existing large datasets that are available to researchers.

Why are these databases often under-utilised in psychology?

often under the assumption that the benefits of having complete experimental control over one’s study outweigh the enormous increase in estimation variance and associated risk of overfitting.

Yarkoni & Westfall (2017): Choosing prediction over Explanation in Psychology Flashcards

(34 cards)