Behaviour module Flashcards
What are the issues with grid search?
- If the best parameter value(s) are outside of the range of values you evaluate, you will obviously not find the best parameter during search
- If the best parameter identified is on the edge of the parameter range evaluated, you likely are missing the true best parameter(s)
- The accuracy of grid search depends on how finely you evaluate the parameter range
- Grid search only works well when the number of fitted parameters is small (2-3 or less)
why do we maximize LOG likelihood instead of just likelihood?
The likelihood is the product of many numbers between 0 and 1. For large datasets, eventually this number will get rounded down to zero (numerical underflow)
4 steps for maximum likelihood
Step 1: Formulate a model that predicts probabilites of all possible outcomes as function of parameters
Step 2: Calculate the probability of each observation given parameters
Step 3: The product of the probability of all observations is the Likelihood
Step 4: Search/Solve for parameters that maximize Likelihood
What is the difference in how we fit linear and non-linear models to data?
- Linear models: we can directly solve for the parameters that best fit the data
using calc and linear algebra (done automatically by stat software) - Non-linear models: we have to iteratively search for the best parameters (more on this later)
Name four types of models used in behavioral sciences
- Simple general linear model with Gaussian error (General Linear Models)
Linear-regression
Comparing groups (t-tests/ANOVA) - Simple linear models with other error distributions (Generalized Linear Models)
Logistic regression
Poisson regression - Non-linear models:
Descriptive models that are non-linear in the parameters - Process-based models
aim to describe the underlying mechanisms and sequences of operations that give rise to cognitive functions and behaviour
Typically non-linear
What is a poisson distribution?
- A discrete probability distribution hat expresses the probability of a given number of events occurring in a fixed interval of time
- Assumptions
- events occur with a constant mean rate
- Events occur independently of the time since the last Event
When (and when not) would you use a poisson distribution and why?
- Using discrete probability distributions like the Poisson to model count data is generally only required when counts are low
- As λ increases, the Poisson distribution becomes symmetric, and you can use Normal distribution to model data
Process-based/computational models of behavior
- mathematical equations that link the experimentally observable variables (e.g. stimuli, outcomes, past experiences) to behaviour
- computational models represent different “algorithmic hypotheses” about how behaviour is generated
General framework for fitting/analyzing nearly all models
- Maximum Likelihood (quantify goodness of fit)
- Non-linear optimization (finding parameters the best fit the data)
- Quanitfying uncertainty – likelihood profiles and bootstrapping
- Comparing models: Information Criteria and cross-validation
What do we use instead of OLS (in this course) and why?
Likelihood. reason: OLS does not work for all types of data (e.g non-normally distributed, binary)
What is likelihood
Likelihood is the joint probability/ probability density of the data given a set of parameter values
* In other words “the probability of the data given the parameter values”
* When errors of data points are independent, the joint probability of the data is the product of the probabilities/probability densities of all observations
PMF AND PDF
PMF: probability mass function – for discrete probability distributions (gives probability of
observations as function of parameters)
PDF: probability density function - for continuous probability distributions (gives probability density of observations as function of parameters)
What is the problem we are trying to solve by optimization methods?
Non-linear models,
General problem: we want to find the parmeters that maximize the log likelihood, but don’t know what the likelihood surface looks like, we can only evaluate the Likelihood one parameter combination at a time
2 types of optimization methods
Gradient-based methods (e.g newtons method, gradient descent), Gradient free methods (Nelder-mead simplex)
Nelder-mead simplex
- Nelder-mead simplex is an algorithm for search parameter space to find a minimum
- start by going in what seems to the best direction by reflecting the high (worst) point in the simplex through the face opposite it;
- if the goodness-of-fit at the new point is better than the best (lowest) other point in the simplex, expand the length of the jump in that direction;
- if this jump was bad—the height at the new point is worse than the second- worst point in the simplex—then try a point that’s only half as far out as the initial try;
- if this second try, closer to the original, is also bad, then contract the simplex around the current best (lowest) point
How does gradient descent find the best parameters likelihood
- Calculate the partial derivatives of the –LL with respect to the parameters
- The vector of partial derivatives of LL with respect to parameters, is the gradient, which is the direction of steepest ascent of the –LL
- We want to minimize the – LL so we move in the opposite direction a small amount:
How do we avoid local minima
- All optimizers require an initital guess for parameter values, which is where the search process begins:
Good starting guesses for initial parameters is helps avoid local minima (Either based on data, or previous studies, biological interpretation) - Generally gradient-free methods are more robust to local minima
- Optimize multiple times with different initial parameters (E.g. coarse grid parmeters, and run optimization inititalized at all combinations of the gridded parameters)
Pros and cons of grid search, gradient-based and gradient free optimization methods
Grid search
Pros:
*Easy
*unlikeliy to miss global minima if grid set appropriatly
Cons:
*Very slow for high dimensional problems
*Only as precise as the grid
Gradient-based (e.g. Newton’s Method)
* Pros:
Fast - Converge much faster to minima
Cons:
Easily caught in global minima if they exist
Gradient free (e.g. Nelder Mead)
Pros:
Works models of intermediate complexity
Faster than grid search
Cons:
Slower to converge than gradient methods
Can still get caught in local minim
What question are we trying to answer with parameter recoverability and what steps does it involve?
if this cognitive process/behavior works like I think it works, will my
experiment provide sufficient information to recover parameters
with the desired precision and without bias?
Steps of Parameter recoverability
1. Use your model and known parameter values to generate a synthetic data set
2. Simulate the experiment you plan to conduct (# of replicates, ect.)
3. Fit your model to the simulated data set
4. Compare the true and fitted parameter values
5. Repeat many times and evaluate the distribution of fitted parameter estimates compared to the true value that generated the dataset (to estimate precision and bias
- When we are uncertain about what the likely range of parameter values is, we can do parameter recover over a range of parameter values to see under what range of parameter values we get precise, unbiassed estimates
Why must we use probability density instead of probability for continous distributions
- If a random variable is continuous, there are an infinite number of values it could take
- The probability of observing any specific value exactly is 0
- We can only define probabilities of observing values that fall within a specific range (e.g. between 0 and 1)
3 continuous distributions other than gaussian (normal)
- Exponential distribution
Time to event, or time between events, when events happen at constant rate - Weibull distribution
Time to event, prob of event increases or decrease with time - Inverse-Gaussian
(.e.g) expected first-passage time distribution for drift diffusion with one boundary
Probability density function
- A probability density function describes the relative likelihood that a value of a random variable would be equal to a specific value
- If we draw many random numbers from a continuous probability distribution, the probability density tells us the relative likelihood of drawing values near a specific value
- Example: If the prob density of x = 0 is 0.4, and the prob density of x = 1 is 0.2, we should expect to see twice as many observations near 0 compared to near 1.
- Integrates to 1
- Probability of observing a value of x between two bounds is equivalent to the area under the curve between the two bounds
What type of distribution is usually used to model RTs?
- Non-normal continous probability distributions are often used to model reaction times, which tend to be possitivly skewed
- These distributions generally have multiple parameters that inlfuence both the mean and shape of the distribution
- We can model how differences in the parameters differ acorss treatments and groups
Standard error and CIs (in normal distribution)
- Standard error – because the sampling distribution is normally distributed, we can
quantify the shape of the sampling distribution by the estimated value of the parameter,
and the the standard deviation of the sampling distribution which we call a standard
error - Confidence intervals – we can use standard errors to construct confidence intervals. By
definition, we expect the true parameter value to fall within 95% confidence intervals
95% of the time
Sampling distribution & central limit theorem
- Sampling distribution – the distribution of parameter estimates we would get if we
repeated our study many times
The width of this distribution depend on the model/parameter, how noisy our
data is, and sample size - Central Limit Theorem – if our sample size is large, the sampling distribution of
parameter estimates is normally distributed
How does estimation of standard errors differ between linear and non-linear models?
- Estimating standard errors from one sample:
In general linear models, we can estimate the standard deviation of the sampling distribution for each parameter directly from our one dataset (standard errors). - Like with parameter estimation, standard errors y need to be computed computationally in non-linear likelihood profile or BOOTSTRAPPING
Bootstrapping steps
- For i in range(n_bootsraps):
- Resample data with replacement: individual observations can be sampled
more than once - Sample the row indices of your df with replacement
- Number of sample rows should equal the # of rows in the df
- Fit model, estimate parameters
- Store fitted parameters
- Calculate standard error/CI based on the distribution of parameter values
- 0.025 and 0.975 quantiles of the bootstrapped parameters for 95% CI
- Standard deviation of bootstrapped parameters for standard errors
Why and how use hierarchical botstrapping?
- In hierarchical data, observations are not independent.
- Systematic variability across individuals or groups can lead to violations of
the assumption of uncorrelated errors. - Treating each data point as independent can result in overly narrow confidence intervals for parameter estimates, misrepresenting uncertainty.
Hierarchical bootstrapping: A method to estimate model uncertainty for hierarchical
data while preserving its structure.
* Example: Measuring recall as a function of time for 10 subjects, each
tested at 20 different delay periods.
* Steps:
1.Resample Groups: Randomly sample 10 subjects with replacement.
2.Resample Observations: For each sampled subject, randomly sample 20
observations with replacement.
3.Fit the Model: Fit the model on this resampled dataset.
4.Repeat: Perform this process many times to generate a bootstrapped
sampling distribution.
2 main Methods for model comparison
- Two methods of model comparison:
- LOOCV
- Akaike information criterion (AIC)
What is likelihood ratio and when can we use it to compare models
- only when models have the same number of parameters
Likelihood ratios as a simple way for comparing models
Fit two alternative models to data
Calculate the log-likelihood of each model
* Likelihood ratio: is the ratio of the likelihoods of each model:
L(m1)/L(m2)
Or exp(logL(m1) - logL(m2) )
What is the purpose of cross-validation?
- Adding complexity to models will almost always improve the ability to fit a data set
- But in general, the goal isn’t fitting the data we have as well as possible, we want our models to generalize to new data
- One way of assessing this is to fit model to part of data, and test the predictions against another part of the data (cross-validation)
LOOCV
Leave-one-out cross validation (LLOCV)
1. Randomly select one observation as the test data, and the rest will be used to for the model (training data)
2. Use model to predict y in the test data, and calculate the difference between observed and predicted values
- Do this for every data point
Why do complex models tend to be worse at generalizing?
- In general performance is worse on out of sample data (data that was not used to fit the model), because our models fit noise
- Models that are too complex are worse as generalizing to new data because the are more flexible to fit noise in our sample
Bias-variance tradeoff
- The bias error is an error from erroneous assumptions in the model. For example, important variables are left out of the model that leads to underfitting.
- The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from a model fitting the random noise in the training data (overfitting).
What is AIC
- AIC is a way of comparing and ranking models of different
complexity - Depends on the log-likelihood + penalty for number of parameters
- AIC score decreases with log-likelihood (how well model fits data)
- AIC increases with the number of parameters
- Goal is to find models that minimize AIC scor
- AIC, based on information theory, aimed at scoring models by their ability to generalize to new data
- Models with lower AIC scores are generally considered better because they strike a balance between goodness-of-fit and model complexity.
AIC weights
- AIC is a way of comparing and ranking models of different
complexity - Depends on the log-likelihood + penalty for number of parameters
- AIC score decreases with log-likelihood (how well model fits data)
- AIC increases with the number of parameters
- Goal is to find models that minimize AIC scor
- AIC, based on information theory, aimed at scoring models by their ability to generalize to new data
- Models with lower AIC scores are generally considered better because they strike a balance between goodness-of-fit and model complexity.