lecture 4 - model fitting Flashcards
types of models
- model animal
- algorithmic model
- artificial neural network
- data driven models
model animal
- model animals allow researchers to draw conclusions that may generalize across species
- e.g. mice are often used as models to study biological processes and behaviors relevant to humans
algorithmic model
- never touches data
- relies on algorithms or theoretical constructs
- they are abstract and typically focus on understanding or simulating processes in a hypothetical or idealized way
artificial neural network
- more of a tool than a scientific model
- typically applied in engineering contexts to process data, rather than to explain underlying biological mechanisms
- do mimic some properties of biological neural networks
data-driven model
- used by scientists to explain data
- explicitly created to analyze and interpret real data, helping scientists draw insights directly from empirical observations
George Box: ‘All models are wrong. But some models are useful’
- emphasizes that no model can perfectly represent reality, because models are simplifications of complex systems.
- they leave out details and assumptions, and therefore cannot fully capture the intricacies of real-world phenomena.
- however, despite their limitations, models can still be valuable tools.
descriptive models
- mathematical description of the data
- ‘fitting’ is important
- fitted parameters can be assessed, but are properties of the data more than of the model
process models
- mathematical description of the process that gave rise to the data
- ‘fitting’ is important
- fitted parameters have meaning because they tell us something about the generative process - i.e., how the data was produced
utility of descriptive models
- gaussian distribution (central limit theorem): models noise
- n-degree polynomial: describe the shape of data
utility of process models
- parameters have cognitive/neural meaning
E.G.,
- parameters quantify the process by which the brain reaches a decision
- modeler commits to latent variables (e.g., action value in RL)
why are process models harder to formulate
- because you have to think about the underlying process, not just the data
- i.e., they require a deep understanding of the cognitive or neural mechanisms, not just the ability to fit a dataset
process models or descriptive models
- many models are a bit of both descriptive and process categories
- e.g., SDT
occam’s razor/rule
- helps decide what the better model is
- lex parsimoniae: suggests that “entities are not to be multiplied without necessity.”
- if two models give an accurate description of the data, the simpler model is to be preferred
why choose the simpler model
- generalization: a good model is not dependent on the experiment: it generalizes.
- there is always a model with more parameters, giving a better fit
key questions for model selection
- does the data require more complexity: if a simpler model fits the data adequately, adding complexity might not be necessary.
- are you fitting the process or the noise: the goal is to model the actual process that generated the data, not the random fluctuations or noise within it.
overfitting
happens when there are too many parameters for the data, and are fitting the noise in a specific dataset, rather than the true underlying pattern
cross-validation
- main method against overfitting
- splits the data into fit (train) and test datasets
- if you’re fitting noise that is unique to the training set, your model should fail to predict the test data (CVr^2 = N(0,σ)
comparing model fits
- cross validation: splits data into training and test sets to check if the model generalizes well to unseen data
- information criteria: downweight the quality of fit of a model by penalizing the number of parameters.
types of information criteria
- bayesian information criterion (BIC): -2ln(L) + k * ln(n)
- akaike information criterion (AIC): -2ln(L) + 2k
k = nr parameters, n = sample size, ln(L) = loglikelihood
how do information criteria work
- If two models fit the data equally well, the one with fewer parameters will be preferred because it incurs a smaller penalty.
- the criteria favor simpler models unless the more complex model demonstrates a significantly better fit to the data
BIC & AIC: quantity of parameters
- there is flexibility in the number of parameters in a model
- it is possible to formulate a model that decreases the number parameters without losing predictive power
BIC and AIC: what does it mean to say that they are conservative metrics
- BIC and AIC are conservative: they favor simpler models that avoid overfitting by penalizing added complexity
- BIC is a more conservative criterion as it includes a stronger penalty for the number of parameters than AIC, especially when the sample size is large
- therefore, first do cross validation. if you can’t then you look at information criteria
what does it mean to ‘fit a model to data’
- it means we found the parameters for our model that, when used to create a prediction (simulate), best explain our data
model fitting methods
- quantify explanation differently
- search the parameters differently
we want to find parameters that
- minimise the (euclidean) distance between the model and the data (SSE)
- maximise the likelihood of the data, given the model parameters (MLE)
likelihood
- p(y∣θ)
- tells us how likely the data is given a specific set of parameters
maximum likelihood
directly uses the likelihood function to find the set of “best” parameters that maximizes the likelihood of the observed data
How does a model account for variability or noise in data?
By assigning probabilities to data points instead of deterministic values, the model assumes the data is generated as y=f(θ)+N(0,σ), where noise follows a normal distribution.
what is f(θ)
- function of θ
- can be a vector, matrix, distribution, etc. of parameters
- each setting of θ defines a specific model instance
What does it mean for a model to be a probability density function (pdf)?
- each set of parameters defines a model instance and a probability distribution over the outcomes of y, rather than a deterministic value
- MLE identifies the parameter set (θ-hat) that maximizes p(y∣θ).
how are p(y∣θ) and p(θ|y) related to each other
- p(y∣θ) - likelihood: by varying θ, we can quantify the probability of the data given θ
- p(θ|y) - posterior: from this, you can make a distribution that shows the value of θ that maximizes the likelihood of the data p(θ|y)
MLE vs bayes
MLE
1. only considers the likelihood term
2. ignores prior probability term as MLE does not use strong prior expectations
3. ignores marginal probability term since it uses fixed data
bayes
1. combines evidence, expectations, and hypotheses
MLE problems and solutions
- problem 1: works with probabilities of observations, which can become really small (p = 0.00001) for the entire data set really fast if something is unlikely
- solution 1: use logarithm of the likelihood (loglikelihood)
- problem 2: many methods are used to work with distances (e.g., SSE), which are minimised, not maximized
- solution 2: minimize negative log-likelihood, as this is mathematically equivalent to maximizing log(L), but it aligns with standard minimization-based optimization frameworks
smoothness of the likelihood landscape
- depends on the quality and amount of data
- with more data, the landscape becomes smoother, making optimization easier and more reliable.
- sparse or noisy data can lead to a rugged likelihood surface with many local minima.
relation of the parameters in the likelihood landscape
they are dependent on each other
grid search
- simplest optimiser
- exhaustively goes through all parameter combinations in a grid to find the best model
grid search: downside
- higher number of parameters leads to exponential growth of the number of evaluations, causing computational explosion
- therefore only feasible for models with very few parameters and small search spaces
gradient descent
- more complicated optimiser
- systematically searches for the minimum of the likelihood landscape by following the gradient
- more efficient than grid search — can be done in fewer iterations
- efficient for larger parameter spaces
gradient descent: downside
- relies on a smooth likelihood landscape.
- if the landscape has multiple local minima, e.g., for complicated fits (non convex parameter space), gradient descent might converge to a suboptimal solution
gradient descent: how it works
- set up a cost function J(θ)
- set up an objective function that minimizes J(θ) based on distance (SSE) or likelihood (MLE)
- on each iteration, determine the direction to step in based on the gradient at the present step
- update based on the gradient multiplied by the learning rate
- repeat until convergence
learning rate value: too high
potentially overshooting the minimum or leading to divergence, oversensitive to noise
learning rate value: too low
slow convergence and potentially getting stuck in local minima if the parameter space is not smooth enough
gradient descent and reinforcement learning
gradient descent is mathematically the same to RL speed-accuracy trade off: a good choice of the learning rate balances speed and stability
gradient descent: problem & solution
- problem: complicated fits (non-convex) run the risk of getting stuck in a local minimum/maximum
- solution: difficult to find if you really found the global solution, so choose starting values wisely and check results to see if multiple intializations converge to the same solution
bayesian fitting: hierarchical bayesian modeling
- allows us to specify the same model for all participants, and fit per-subject and group-level parameter estimates.
parameter recovery
- fitting comes with different problems, like sometimes we have different sets of parameters that produce the same outputs.
- we need to know if our problem is fittable
- recovery ensures your modeling process produces results that are interpretable and replicable
parameter recovery recipe
- simulate using fitted parameters
- add noise
- refit
- check if parameters can be recovered
parameter recovery outcome
- if parameters come from this distribution, they should be able to simulate data that is close to the original dataset
- if your model cannot recover known parameters, it may be overfitting, underfitting, or poorly specified.
George Box
‘All models are wrong. But some models are useful’