lecture 4 - model fitting Flashcards

1
Q

types of models

A
  1. model animal
  2. algorithmic model
  3. artificial neural network
  4. data driven models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

model animal

A
  • model animals allow researchers to draw conclusions that may generalize across species
  • e.g. mice are often used as models to study biological processes and behaviors relevant to humans
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

algorithmic model

A
  • never touches data
  • relies on algorithms or theoretical constructs
  • they are abstract and typically focus on understanding or simulating processes in a hypothetical or idealized way
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

artificial neural network

A
  • more of a tool than a scientific model
  • typically applied in engineering contexts to process data, rather than to explain underlying biological mechanisms
  • do mimic some properties of biological neural networks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

data-driven model

A
  • used by scientists to explain data
  • explicitly created to analyze and interpret real data, helping scientists draw insights directly from empirical observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

George Box: ‘All models are wrong. But some models are useful’

A
  • emphasizes that no model can perfectly represent reality, because models are simplifications of complex systems.
  • they leave out details and assumptions, and therefore cannot fully capture the intricacies of real-world phenomena.
  • however, despite their limitations, models can still be valuable tools.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

descriptive models

A
  • mathematical description of the data
  • ‘fitting’ is important
  • fitted parameters can be assessed, but are properties of the data more than of the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

process models

A
  • mathematical description of the process that gave rise to the data
  • ‘fitting’ is important
  • fitted parameters have meaning because they tell us something about the generative process - i.e., how the data was produced
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

utility of descriptive models

A
  1. gaussian distribution (central limit theorem): models noise
  2. n-degree polynomial: describe the shape of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

utility of process models

A
  • parameters have cognitive/neural meaning

E.G.,

  • parameters quantify the process by which the brain reaches a decision
  • modeler commits to latent variables (e.g., action value in RL)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

why are process models harder to formulate

A
  • because you have to think about the underlying process, not just the data
  • i.e., they require a deep understanding of the cognitive or neural mechanisms, not just the ability to fit a dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

process models or descriptive models

A
  • many models are a bit of both descriptive and process categories
  • e.g., SDT
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

occam’s razor/rule

A
  • helps decide what the better model is
  • lex parsimoniae: suggests that “entities are not to be multiplied without necessity.”
  • if two models give an accurate description of the data, the simpler model is to be preferred
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why choose the simpler model

A
  1. generalization: a good model is not dependent on the experiment: it generalizes.
  2. there is always a model with more parameters, giving a better fit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

key questions for model selection

A
  1. does the data require more complexity: if a simpler model fits the data adequately, adding complexity might not be necessary.
  2. are you fitting the process or the noise: the goal is to model the actual process that generated the data, not the random fluctuations or noise within it.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

overfitting

A

happens when there are too many parameters for the data, and are fitting the noise in a specific dataset, rather than the true underlying pattern

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

cross-validation

A
  • main method against overfitting
  • splits the data into fit (train) and test datasets
  • if you’re fitting noise that is unique to the training set, your model should fail to predict the test data (CVr^2 = N(0,σ)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

comparing model fits

A
  1. cross validation: splits data into training and test sets to check if the model generalizes well to unseen data
  2. information criteria: downweight the quality of fit of a model by penalizing the number of parameters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

types of information criteria

A
  1. bayesian information criterion (BIC): -2ln(L) + k * ln(n)
  2. akaike information criterion (AIC): -2ln(L) + 2k

k = nr parameters, n = sample size, ln(L) = loglikelihood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

how do information criteria work

A
  • If two models fit the data equally well, the one with fewer parameters will be preferred because it incurs a smaller penalty.
  • the criteria favor simpler models unless the more complex model demonstrates a significantly better fit to the data
21
Q

BIC & AIC: quantity of parameters

A
  • there is flexibility in the number of parameters in a model
  • it is possible to formulate a model that decreases the number parameters without losing predictive power
22
Q

BIC and AIC: what does it mean to say that they are conservative metrics

A
  • BIC and AIC are conservative: they favor simpler models that avoid overfitting by penalizing added complexity
  • BIC is a more conservative criterion as it includes a stronger penalty for the number of parameters than AIC, especially when the sample size is large
  • therefore, first do cross validation. if you can’t then you look at information criteria
23
Q

what does it mean to ‘fit a model to data’

A
  • it means we found the parameters for our model that, when used to create a prediction (simulate), best explain our data
24
Q

model fitting methods

A
  1. quantify explanation differently
  2. search the parameters differently
25
Q

we want to find parameters that

A
  1. minimise the (euclidean) distance between the model and the data (SSE)
  2. maximise the likelihood of the data, given the model parameters (MLE)
26
Q

likelihood

A
  • p(y∣θ)
  • tells us how likely the data is given a specific set of parameters
27
Q

maximum likelihood

A

directly uses the likelihood function to find the set of “best” parameters that maximizes the likelihood of the observed data

28
Q

How does a model account for variability or noise in data?

A

By assigning probabilities to data points instead of deterministic values, the model assumes the data is generated as y=f(θ)+N(0,σ), where noise follows a normal distribution.

29
Q

what is f(θ)

A
  • function of θ
  • can be a vector, matrix, distribution, etc. of parameters
  • each setting of θ defines a specific model instance
30
Q

What does it mean for a model to be a probability density function (pdf)?

A
  • each set of parameters defines a model instance and a probability distribution over the outcomes of y, rather than a deterministic value
  • MLE identifies the parameter set (θ-hat) that maximizes p(y∣θ).
31
Q

how are p(y∣θ) and p(θ|y) related to each other

A
  • p(y∣θ) - likelihood: by varying θ, we can quantify the probability of the data given θ
  • p(θ|y) - posterior: from this, you can make a distribution that shows the value of θ that maximizes the likelihood of the data p(θ|y)
32
Q

MLE vs bayes

A

MLE
1. only considers the likelihood term
2. ignores prior probability term as MLE does not use strong prior expectations
3. ignores marginal probability term since it uses fixed data

bayes
1. combines evidence, expectations, and hypotheses

33
Q

MLE problems and solutions

A
  • problem 1: works with probabilities of observations, which can become really small (p = 0.00001) for the entire data set really fast if something is unlikely
  • solution 1: use logarithm of the likelihood (loglikelihood)
  • problem 2: many methods are used to work with distances (e.g., SSE), which are minimised, not maximized
  • solution 2: minimize negative log-likelihood, as this is mathematically equivalent to maximizing log(L), but it aligns with standard minimization-based optimization frameworks
34
Q

smoothness of the likelihood landscape

A
  • depends on the quality and amount of data
  • with more data, the landscape becomes smoother, making optimization easier and more reliable.
  • sparse or noisy data can lead to a rugged likelihood surface with many local minima.
35
Q

relation of the parameters in the likelihood landscape

A

they are dependent on each other

36
Q

grid search

A
  • simplest optimiser
  • exhaustively goes through all parameter combinations in a grid to find the best model
37
Q

grid search: downside

A
  • higher number of parameters leads to exponential growth of the number of evaluations, causing computational explosion
  • therefore only feasible for models with very few parameters and small search spaces
38
Q

gradient descent

A
  • more complicated optimiser
  • systematically searches for the minimum of the likelihood landscape by following the gradient
  • more efficient than grid search — can be done in fewer iterations
  • efficient for larger parameter spaces
39
Q

gradient descent: downside

A
  • relies on a smooth likelihood landscape.
  • if the landscape has multiple local minima, e.g., for complicated fits (non convex parameter space), gradient descent might converge to a suboptimal solution
40
Q

gradient descent: how it works

A
  1. set up a cost function J(θ)
  2. set up an objective function that minimizes J(θ) based on distance (SSE) or likelihood (MLE)
  3. on each iteration, determine the direction to step in based on the gradient at the present step
  4. update based on the gradient multiplied by the learning rate
  5. repeat until convergence
41
Q

learning rate value: too high

A

potentially overshooting the minimum or leading to divergence, oversensitive to noise

42
Q

learning rate value: too low

A

slow convergence and potentially getting stuck in local minima if the parameter space is not smooth enough

43
Q

gradient descent and reinforcement learning

A

gradient descent is mathematically the same to RL speed-accuracy trade off: a good choice of the learning rate balances speed and stability

44
Q

gradient descent: problem & solution

A
  • problem: complicated fits (non-convex) run the risk of getting stuck in a local minimum/maximum
  • solution: difficult to find if you really found the global solution, so choose starting values wisely and check results to see if multiple intializations converge to the same solution
45
Q

bayesian fitting: hierarchical bayesian modeling

A
  • allows us to specify the same model for all participants, and fit per-subject and group-level parameter estimates.
46
Q

parameter recovery

A
  • fitting comes with different problems, like sometimes we have different sets of parameters that produce the same outputs.
  • we need to know if our problem is fittable
  • recovery ensures your modeling process produces results that are interpretable and replicable
47
Q

parameter recovery recipe

A
  1. simulate using fitted parameters
  2. add noise
  3. refit
  4. check if parameters can be recovered
48
Q

parameter recovery outcome

A
  • if parameters come from this distribution, they should be able to simulate data that is close to the original dataset
  • if your model cannot recover known parameters, it may be overfitting, underfitting, or poorly specified.
49
Q

George Box

A

‘All models are wrong. But some models are useful’