Data Analysis Flashcards

1
Q

What is Maximum Likelihood, and how is the log-likelihood often utilized in statistical modeling?

A

Maximum Likelihood is a statistical method used to estimate the parameters of a model by maximizing the likelihood function. The log-likelihood is often used because it simplifies calculations and allows for convenient optimization. It involves finding the parameter values that make the observed data most probable given the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Explain the criteria for testing the suitability of a Poisson distribution. Why is it important to examine both the variance and the mean?

A

To test the suitability of a Poisson distribution, both the variance and the mean need to be examined. If the variance is higher than the mean, the Poisson distribution may not be suitable for describing the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When is the Binomial distribution used, and what characteristic defines the datasets suitable for this distribution?

A

The Binomial distribution is used when there are only two possible outcomes for each trial. This includes scenarios like black or white, alive or dead, true or false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Provide an example of a situation where the negative binomial distribution is applicable. How does it describe the occurrence of events before a specific count is reached?

A

The negative binomial distribution describes the number of trials needed for a specified number of successes to occur. It provides a model for situations where you are interested in the time it takes for an event to happen a certain number of times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In what context might the Beta distribution be employed, and what are the limitations on the values it can take?

A

The Beta distribution is used to model random variables that are constrained to the interval [0, 1]. An example scenario is mean seeding establishment rates among populations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the type of data that the Gamma distribution is suitable for, and explain its characteristics in handling continuous, positive data.

A

The Gamma distribution is suitable for describing continuous, positive data. It is often used in situations where the waiting time for a Poisson-distributed event is of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a zero-inflated distribution entail, and how does it combine elements of both the binomial and integer-generating distributions?

A

A zero-inflated distribution is a mixture of a binomial distribution and one that generates integers. It is employed when there is an excess of zero values in the data, combining aspects of both binomial and count distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does the Multinomial distribution differ from the Binomial distribution, and in what scenarios would it be used to describe data?

A

The Multinomial distribution is used to describe data with a limited number of categories, where each observation falls into one of several categories. It extends the Binomial distribution to more than two outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the significance of the Central Limit Theorem and its role in shaping the distribution of sums of independent random variables.

A

The Central Limit Theorem states that when independent random variables are added up, the resulting sums tend to form a bell-shaped ‘normal’ distribution, even if the underlying variables are not normally distributed. It is a fundamental concept in probability and statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how to fit a regression line?

A

The regression line is fitted by minimizing the sum of squared differences between the actual response variable values and the predicted values from the model (the red lines). Residuals are these differences, and by examining their distance and distribution, we assess the fit of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Steps of Fitting Regression Models:

A

The steps include data exploration, data transformation, model selection, and model validation. If the residuals are not normally distributed, a linear model may not be suitable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the assumptions in Regression Modeling?

A

Assumptions include the normal distribution of residuals, no heterogeneity, certainty in the values of explanatory variables, and independence of residuals with explanatory variables and response values among themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Model Selection:

A

Model selection involves considering biological relationships, exploring numerical output, and using methods like dropping variables based on hypothesis testing or information criteria (e.g., AIC).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

“AIC”:

A

Akaike’s Information Criterion, is a measure of relative goodness of fit for a statistical model, balancing the number of parameters (k) and the goodness of fit (L). The preferred model has the lowest AIC and is considered more parsimonious.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what does Interaction in Modeling mean?

A

Interaction in modeling refers to the combined effect of two variables that is different from the sum of their individual effects. Interaction is detected by examining the significance of interaction terms in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Model Output Examination:

A

Numerical outputs include summary statistics and the results of anova(). These outputs provide information about the fit and significance of variables in the regression model.

16
Q

when can we compare AIC values?

A

when the compared models
- have the exact same response variable
- are fitted to exactly the same dataset

17
Q

what is overfitting?

A

Overfitting refers to a situation where a model learns the training data too well, capturing noise and random fluctuations rather than the true underlying patterns. This can lead to poor performance on new data

18
Q

“PCA”

A

Principle Component Analysis, is a statistical technique used for reducing the dimensionality of data while retaining most of the important information.

19
Q

“CCA”

A

Canonical Correspondence Analysis, we can identify new variables that maximize the inter-relationships between two data sets

20
Q

“CA”:

A

Cellular automata, are discrete computational models used to simulate complex systems, often composed of many simple, interacting components. These systems are represented as grids of cells, with each cell having a state that evolves over discrete time steps based on a set of rules.

21
Q

Random Effect

A
  • We’re interested in the
    variance they explain.
  • Assume that factor levels are
    drawn from a (normal)
    distribution.
  • Influence only the variance of
    the response variable in
  • Mixed effect models
22
Q

Fixed Effect

  • We’re interested in model
    parameter values.
  • “estimate fixed effects
    from data”
  • Influence only the mean
    of the response variable
    in Mixed effect models
A
23
Q
A