Behaviour module Flashcards

Question 1

Q

What are the issues with grid search?

Answer

A

If the best parameter value(s) are outside of the range of values you evaluate, you will obviously not find the best parameter during search
If the best parameter identified is on the edge of the parameter range evaluated, you likely are missing the true best parameter(s)
The accuracy of grid search depends on how finely you evaluate the parameter range
Grid search only works well when the number of fitted parameters is small (2-3 or less)

Question 2

Q

why do we maximize LOG likelihood instead of just likelihood?

Answer

A

The likelihood is the product of many numbers between 0 and 1. For large datasets, eventually this number will get rounded down to zero (numerical underflow)

Question 3

Q

4 steps for maximum likelihood

Answer

A

Step 1: Formulate a model that predicts probabilites of all possible outcomes as function of parameters
Step 2: Calculate the probability of each observation given parameters
Step 3: The product of the probability of all observations is the Likelihood
Step 4: Search/Solve for parameters that maximize Likelihood

Question 4

Q

What is the difference in how we fit linear and non-linear models to data?

Answer

A

Linear models: we can directly solve for the parameters that best fit the data
using calc and linear algebra (done automatically by stat software)
Non-linear models: we have to iteratively search for the best parameters (more on this later)

Question 5

Q

Name four types of models used in behavioral sciences

Answer

A

Simple general linear model with Gaussian error (General Linear Models)
Linear-regression
Comparing groups (t-tests/ANOVA)
Simple linear models with other error distributions (Generalized Linear Models)
Logistic regression
Poisson regression
Non-linear models:
Descriptive models that are non-linear in the parameters
Process-based models
aim to describe the underlying mechanisms and sequences of operations that give rise to cognitive functions and behaviour
Typically non-linear

Question 6

Q

What is a poisson distribution?

Answer

A

A discrete probability distribution hat expresses the probability of a given number of events occurring in a fixed interval of time
Assumptions
events occur with a constant mean rate
Events occur independently of the time since the last Event

Question 7

Q

When (and when not) would you use a poisson distribution and why?

Answer

A

Using discrete probability distributions like the Poisson to model count data is generally only required when counts are low
As λ increases, the Poisson distribution becomes symmetric, and you can use Normal distribution to model data

Question 8

Q

Process-based/computational models of behavior

Answer

A

mathematical equations that link the experimentally observable variables (e.g. stimuli, outcomes, past experiences) to behaviour
computational models represent different “algorithmic hypotheses” about how behaviour is generated

Question 9

Q

General framework for fitting/analyzing nearly all models

Answer

A

Maximum Likelihood (quantify goodness of fit)
Non-linear optimization (finding parameters the best fit the data)
Quanitfying uncertainty – likelihood profiles and bootstrapping
Comparing models: Information Criteria and cross-validation

Question 10

Q

What do we use instead of OLS (in this course) and why?

Answer

A

Likelihood. reason: OLS does not work for all types of data (e.g non-normally distributed, binary)

Question 11

Q

What is likelihood

Answer

A

Likelihood is the joint probability/ probability density of the data given a set of parameter values
* In other words “the probability of the data given the parameter values”
* When errors of data points are independent, the joint probability of the data is the product of the probabilities/probability densities of all observations

Question 12

Q

PMF AND PDF

Answer

A

PMF: probability mass function – for discrete probability distributions (gives probability of
observations as function of parameters)
PDF: probability density function - for continuous probability distributions (gives probability density of observations as function of parameters)

Question 13

Q

What is the problem we are trying to solve by optimization methods?

Answer

A

Non-linear models,
General problem: we want to find the parmeters that maximize the log likelihood, but don’t know what the likelihood surface looks like, we can only evaluate the Likelihood one parameter combination at a time

Question 14

Q

2 types of optimization methods

Answer

A

Gradient-based methods (e.g newtons method, gradient descent), Gradient free methods (Nelder-mead simplex)

Question 15

Q

Nelder-mead simplex

Answer

A

Nelder-mead simplex is an algorithm for search parameter space to find a minimum
start by going in what seems to the best direction by reflecting the high (worst) point in the simplex through the face opposite it;
if the goodness-of-fit at the new point is better than the best (lowest) other point in the simplex, expand the length of the jump in that direction;
if this jump was bad—the height at the new point is worse than the second- worst point in the simplex—then try a point that’s only half as far out as the initial try;
if this second try, closer to the original, is also bad, then contract the simplex around the current best (lowest) point

Question 16

Q

How does gradient descent find the best parameters likelihood

Answer

A

Calculate the partial derivatives of the –LL with respect to the parameters
The vector of partial derivatives of LL with respect to parameters, is the gradient, which is the direction of steepest ascent of the –LL
We want to minimize the – LL so we move in the opposite direction a small amount:

Question 17

Q

How do we avoid local minima

Answer

A

All optimizers require an initital guess for parameter values, which is where the search process begins:
Good starting guesses for initial parameters is helps avoid local minima (Either based on data, or previous studies, biological interpretation)
Generally gradient-free methods are more robust to local minima
Optimize multiple times with different initial parameters (E.g. coarse grid parmeters, and run optimization inititalized at all combinations of the gridded parameters)

Question 18

Q

Pros and cons of grid search, gradient-based and gradient free optimization methods

Answer

A

Grid search
Pros:
*Easy
*unlikeliy to miss global minima if grid set appropriatly
Cons:
*Very slow for high dimensional problems
*Only as precise as the grid

Gradient-based (e.g. Newton’s Method)
* Pros:
Fast - Converge much faster to minima
Cons:
Easily caught in global minima if they exist

Gradient free (e.g. Nelder Mead)
Pros:
Works models of intermediate complexity
Faster than grid search
Cons:
Slower to converge than gradient methods
Can still get caught in local minim

Question 19

Q

What question are we trying to answer with parameter recoverability and what steps does it involve?

Answer

A

if this cognitive process/behavior works like I think it works, will my
experiment provide sufficient information to recover parameters
with the desired precision and without bias?

Steps of Parameter recoverability
1. Use your model and known parameter values to generate a synthetic data set
2. Simulate the experiment you plan to conduct (# of replicates, ect.)
3. Fit your model to the simulated data set
4. Compare the true and fitted parameter values
5. Repeat many times and evaluate the distribution of fitted parameter estimates compared to the true value that generated the dataset (to estimate precision and bias

When we are uncertain about what the likely range of parameter values is, we can do parameter recover over a range of parameter values to see under what range of parameter values we get precise, unbiassed estimates

Question 20

Q

Why must we use probability density instead of probability for continous distributions

Answer

A

If a random variable is continuous, there are an infinite number of values it could take
The probability of observing any specific value exactly is 0
We can only define probabilities of observing values that fall within a specific range (e.g. between 0 and 1)

Question 21

Q

3 continuous distributions other than gaussian (normal)

Answer

A

Exponential distribution
Time to event, or time between events, when events happen at constant rate
Weibull distribution
Time to event, prob of event increases or decrease with time
Inverse-Gaussian
(.e.g) expected first-passage time distribution for drift diffusion with one boundary

Question 22

Q

Probability density function

Answer

A

A probability density function describes the relative likelihood that a value of a random variable would be equal to a specific value
If we draw many random numbers from a continuous probability distribution, the probability density tells us the relative likelihood of drawing values near a specific value
Example: If the prob density of x = 0 is 0.4, and the prob density of x = 1 is 0.2, we should expect to see twice as many observations near 0 compared to near 1.
Integrates to 1
Probability of observing a value of x between two bounds is equivalent to the area under the curve between the two bounds

Question 23

Q

What type of distribution is usually used to model RTs?

Answer

A

Non-normal continous probability distributions are often used to model reaction times, which tend to be possitivly skewed
These distributions generally have multiple parameters that inlfuence both the mean and shape of the distribution
We can model how differences in the parameters differ acorss treatments and groups

Question 24

Q

Standard error and CIs (in normal distribution)

Answer

A

Standard error – because the sampling distribution is normally distributed, we can
quantify the shape of the sampling distribution by the estimated value of the parameter,
and the the standard deviation of the sampling distribution which we call a standard
error
Confidence intervals – we can use standard errors to construct confidence intervals. By
definition, we expect the true parameter value to fall within 95% confidence intervals
95% of the time

Question 25

Q

Sampling distribution & central limit theorem

Answer

A

Sampling distribution – the distribution of parameter estimates we would get if we
repeated our study many times
The width of this distribution depend on the model/parameter, how noisy our
data is, and sample size
Central Limit Theorem – if our sample size is large, the sampling distribution of
parameter estimates is normally distributed

Question 26

Q

How does estimation of standard errors differ between linear and non-linear models?

Answer

A

Estimating standard errors from one sample:
In general linear models, we can estimate the standard deviation of the sampling distribution for each parameter directly from our one dataset (standard errors).
Like with parameter estimation, standard errors y need to be computed computationally in non-linear likelihood profile or BOOTSTRAPPING

Question 27

Q

Bootstrapping steps

Answer

A

For i in range(n_bootsraps):
Resample data with replacement: individual observations can be sampled
more than once
Sample the row indices of your df with replacement
Number of sample rows should equal the # of rows in the df
Fit model, estimate parameters
Store fitted parameters
Calculate standard error/CI based on the distribution of parameter values
0.025 and 0.975 quantiles of the bootstrapped parameters for 95% CI
Standard deviation of bootstrapped parameters for standard errors

Question 28

Q

Why and how use hierarchical botstrapping?

Answer

A

In hierarchical data, observations are not independent.
Systematic variability across individuals or groups can lead to violations of
the assumption of uncorrelated errors.
Treating each data point as independent can result in overly narrow confidence intervals for parameter estimates, misrepresenting uncertainty.

Hierarchical bootstrapping: A method to estimate model uncertainty for hierarchical
data while preserving its structure.
* Example: Measuring recall as a function of time for 10 subjects, each
tested at 20 different delay periods.
* Steps:
1.Resample Groups: Randomly sample 10 subjects with replacement.
2.Resample Observations: For each sampled subject, randomly sample 20
observations with replacement.
3.Fit the Model: Fit the model on this resampled dataset.
4.Repeat: Perform this process many times to generate a bootstrapped
sampling distribution.

Question 29

Q

2 main Methods for model comparison

Answer

A

Two methods of model comparison:
LOOCV
Akaike information criterion (AIC)

Question 30

Q

What is likelihood ratio and when can we use it to compare models

Answer

A

only when models have the same number of parameters

Likelihood ratios as a simple way for comparing models
Fit two alternative models to data
Calculate the log-likelihood of each model
* Likelihood ratio: is the ratio of the likelihoods of each model:
L(m1)/L(m2)
Or exp(logL(m1) - logL(m2) )

Question 31

Q

What is the purpose of cross-validation?

Answer

A

Adding complexity to models will almost always improve the ability to fit a data set
But in general, the goal isn’t fitting the data we have as well as possible, we want our models to generalize to new data
One way of assessing this is to fit model to part of data, and test the predictions against another part of the data (cross-validation)

Question 32

Q

LOOCV

Answer

A

Leave-one-out cross validation (LLOCV)
1. Randomly select one observation as the test data, and the rest will be used to for the model (training data)
2. Use model to predict y in the test data, and calculate the difference between observed and predicted values

Do this for every data point

Question 33

Q

Why do complex models tend to be worse at generalizing?

Answer

A

In general performance is worse on out of sample data (data that was not used to fit the model), because our models fit noise
Models that are too complex are worse as generalizing to new data because the are more flexible to fit noise in our sample

Question 34

Q

Bias-variance tradeoff

Answer

A

The bias error is an error from erroneous assumptions in the model. For example, important variables are left out of the model that leads to underfitting.
The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from a model fitting the random noise in the training data (overfitting).

Question 35

Q

What is AIC

Answer

A

AIC is a way of comparing and ranking models of different
complexity
Depends on the log-likelihood + penalty for number of parameters
AIC score decreases with log-likelihood (how well model fits data)
AIC increases with the number of parameters
Goal is to find models that minimize AIC scor
AIC, based on information theory, aimed at scoring models by their ability to generalize to new data
Models with lower AIC scores are generally considered better because they strike a balance between goodness-of-fit and model complexity.

Question 36

Q

AIC weights

Answer

A

AIC is a way of comparing and ranking models of different
complexity
Depends on the log-likelihood + penalty for number of parameters
AIC score decreases with log-likelihood (how well model fits data)
AIC increases with the number of parameters
Goal is to find models that minimize AIC scor
AIC, based on information theory, aimed at scoring models by their ability to generalize to new data
Models with lower AIC scores are generally considered better because they strike a balance between goodness-of-fit and model complexity.