Data Analysis Flashcards

Question 1

Q

What is Maximum Likelihood, and how is the log-likelihood often utilized in statistical modeling?

Answer

A

Maximum Likelihood is a statistical method used to estimate the parameters of a model by maximizing the likelihood function. The log-likelihood is often used because it simplifies calculations and allows for convenient optimization. It involves finding the parameter values that make the observed data most probable given the model.

Question 2

Q

Explain the criteria for testing the suitability of a Poisson distribution. Why is it important to examine both the variance and the mean?

Answer

A

To test the suitability of a Poisson distribution, both the variance and the mean need to be examined. If the variance is higher than the mean, the Poisson distribution may not be suitable for describing the data.

Question 3

Q

When is the Binomial distribution used, and what characteristic defines the datasets suitable for this distribution?

Answer

A

The Binomial distribution is used when there are only two possible outcomes for each trial. This includes scenarios like black or white, alive or dead, true or false.

Question 4

Q

Provide an example of a situation where the negative binomial distribution is applicable. How does it describe the occurrence of events before a specific count is reached?

Answer

A

The negative binomial distribution describes the number of trials needed for a specified number of successes to occur. It provides a model for situations where you are interested in the time it takes for an event to happen a certain number of times.

Question 5

Q

In what context might the Beta distribution be employed, and what are the limitations on the values it can take?

Answer

A

The Beta distribution is used to model random variables that are constrained to the interval [0, 1]. An example scenario is mean seeding establishment rates among populations.

Question 6

Q

Describe the type of data that the Gamma distribution is suitable for, and explain its characteristics in handling continuous, positive data.

Answer

A

The Gamma distribution is suitable for describing continuous, positive data. It is often used in situations where the waiting time for a Poisson-distributed event is of interest.

Question 7

Q

What does a zero-inflated distribution entail, and how does it combine elements of both the binomial and integer-generating distributions?

Answer

A

A zero-inflated distribution is a mixture of a binomial distribution and one that generates integers. It is employed when there is an excess of zero values in the data, combining aspects of both binomial and count distributions.

Question 8

Q

How does the Multinomial distribution differ from the Binomial distribution, and in what scenarios would it be used to describe data?

Answer

A

The Multinomial distribution is used to describe data with a limited number of categories, where each observation falls into one of several categories. It extends the Binomial distribution to more than two outcomes.

Question 9

Q

Explain the significance of the Central Limit Theorem and its role in shaping the distribution of sums of independent random variables.

Answer

A

The Central Limit Theorem states that when independent random variables are added up, the resulting sums tend to form a bell-shaped ‘normal’ distribution, even if the underlying variables are not normally distributed. It is a fundamental concept in probability and statistics.

Question 10

Q

how to fit a regression line?

Answer

A

The regression line is fitted by minimizing the sum of squared differences between the actual response variable values and the predicted values from the model (the red lines). Residuals are these differences, and by examining their distance and distribution, we assess the fit of the model.

Question 11

Q

Steps of Fitting Regression Models:

Answer

A

The steps include data exploration, data transformation, model selection, and model validation. If the residuals are not normally distributed, a linear model may not be suitable.

Question 12

Q

what are the assumptions in Regression Modeling?

Answer

A

Assumptions include the normal distribution of residuals, no heterogeneity, certainty in the values of explanatory variables, and independence of residuals with explanatory variables and response values among themselves

Question 13

Q

Model Selection:

Answer

A

Model selection involves considering biological relationships, exploring numerical output, and using methods like dropping variables based on hypothesis testing or information criteria (e.g., AIC).

Question 14

Q

“AIC”:

Answer

A

Akaike’s Information Criterion, is a measure of relative goodness of fit for a statistical model, balancing the number of parameters (k) and the goodness of fit (L). The preferred model has the lowest AIC and is considered more parsimonious.

Question 15

Q

what does Interaction in Modeling mean?

Answer

A

Interaction in modeling refers to the combined effect of two variables that is different from the sum of their individual effects. Interaction is detected by examining the significance of interaction terms in the model.

Question 16

Q

Model Output Examination:

Answer

Study These Flashcards

A

Numerical outputs include summary statistics and the results of anova(). These outputs provide information about the fit and significance of variables in the regression model.

Question 17

Q

when can we compare AIC values?

Answer

Study These Flashcards

A

when the compared models
- have the exact same response variable
- are fitted to exactly the same dataset

Question 18

Q

what is overfitting?

Answer

Study These Flashcards

A

Overfitting refers to a situation where a model learns the training data too well, capturing noise and random fluctuations rather than the true underlying patterns. This can lead to poor performance on new data

Question 19

Q

“PCA”

Answer

Study These Flashcards

A

Principle Component Analysis, is a statistical technique used for reducing the dimensionality of data while retaining most of the important information.

Question 20

Q

“CCA”

Answer

Study These Flashcards

A

Canonical Correspondence Analysis, we can identify new variables that maximize the inter-relationships between two data sets

Question 21

Q

“CA”:

Answer

Study These Flashcards

A

Cellular automata, are discrete computational models used to simulate complex systems, often composed of many simple, interacting components. These systems are represented as grids of cells, with each cell having a state that evolves over discrete time steps based on a set of rules.