General Flashcards

Question 1

Q

Define bias

Answer

A

The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)

Question 2

Q

Define variance

Answer

A

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

Question 3

Q

Define bias-variance tradeoff

Answer

A

It is the compromise choose a model that both accurately capture the regularities in its training data, but also generalises well to unseen data. High-variance learning methods represent their training set well but overfit to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don’t tend to overfit but may underfit their training data, failing to capture important regularities.

Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.

Question 4

Q

How to overcome overfitting

Answer

A

Reduce the model complexity (fewer features)

- Regularization (features contribute less)

Question 5

Q

What is a vector norm

Answer

A

A way of measuring the length of a vector

Question 6

Q

Give examples of vector norms

Question 7

Q

Define length of L2 norm ||B||_2

Answer

A

√B_0^2 + B_1^2

Question 8

Q

Define length of L1 norm ||B||_1

Answer

A

|B_0|+|B_1|

Question 9

Q

Sketch the ||B||_2 = 2 and ||B||_1 = 2

Answer

A

https://en.wikipedia.org/wiki/File:L1_and_L2_balls.jpg where crosses axes at 2

Question 10

Q

Describe Ordinary Least Squares

Answer

A

OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function. Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression line

Question 11

Q

What is iid?

Answer

A

a sequence or other collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum (or average) of IID variables with finite variance approaches a normal distribution.

Question 12

Q

What is the problem with highly correlated explanatory variables in OLS?

Answer

A

Very high variance between different samples, so feature weights get abnormally big

Question 13

Q

What is C in Ridge Regression (L2)?

Answer

A

C^2 is the radius of the CIRCLE in LP space,

where you define ||B||^2_2 <= C^2

Question 14

Q

What is the main difference in outcome between using L1 and L2 space for regularisation?

Answer

A

Given the L1 diamond shape as opposed to the L2 circle, you’re more likely to hit a corner which zeros coefficients.

Question 15

Q

Which regularisation gives a spares response

Answer

A

L1, as it zeros some coefficients

Question 16

Q

What is a generative model?

Answer

A

A generative model describes how data is generated, in terms of a probabilistic model.

In the scenario of supervised learning, a generative model estimates the joint probability distribution of data P(X, Y) between the observed data X and corresponding labels Y

Question 17

Q

Give examples of generative models

Answer

A

Naive Bayes
Hidden Markov Models
Latent Dirichlet Allocation
Boltzmann Machines

Question 18

Q

Why would you choose a discriminative model?

Answer

A

Because you didn’t have enough data to estimate the density f, so variance is massive.
Generative
p(x,y) = f(x|y)p(y)

Question 19

Q

Generative versus discriminative, discuss

Answer

A

Discriminative is probability of class given observation P(C|x), generative is probability of observation given class P(x|C). For generative, given data, you model whole distribution. For discriminative, given data, you model decision boundary.
https://www.youtube.com/watch?v=OWJ8xVGRyFA

Question 20

Q

Pros and cons of discriminative model

Answer

A

Pros: easy & fewer observations
Cons: Can classify but not generate the data/obs back

Question 21

Q

Pros and cons of generative model

Answer

A

Pros: get the underlying idea of what the classifier is built on
Cons: Very expensive - lots of parameters
Need lots of data

Question 22

Q

Define SVM

Answer

A

A non-probabilistic binary linear classifier which separates the categories by a clear gap that is as wide as possible with a hyperplane or set of hyperplanes, defined so that the distance between the hyperplane and the nearest point x_i from either group is maximised

Question 23

Q

How do SVM perform non-linear classification?

Answer

A

the kernel trick

Question 24

Q

Describe the kernel trick

Answer

A

The idea is that data that isn’t linearly separable in n dimensional space may be linearly separable in a higher dimensional space. But, because of Lagrangian magic, we need not compute the exact transformation of our data, we just need the inner product of our data in that higher dimensional space.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78

Question 25

Q

Advantages and disadvantages of SVM

Answer

A

Advantages

it has a regularisation parameter, which makes the user think about avoiding over-fitting.
it uses the kernel trick, so you can build in expert knowledge about the problem via engineering the kernel and it’s linearly separable.
an SVM is defined by a convex optimisation problem (no local minima) for which there are efficient methods (e.g. SMO).
it is an approximation to a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea.

Disadvantages

the determination of the parameters for a given value of the regularisation and kernel parameters and choice of kernel. In a way the SVM moves the problem of over-fitting from optimising the parameters to model selection
not great with multiclass

Question 26

Q

Describe Gradient boosting

Answer

A

Have data
Fit simple single-layer decision tree regressor (simple step function - one transition point)
plot out error residuals from first fit
fit single-layer decision tree regressor two to error residuals
combine models one and two for marginally more complex fit (two transition points) for model three
plot out error residuals from fit three
fit single-layer decision tree regressor four to error residuals from fit three
combine
etc

Question 27

Q

Relative pros and cons random forest versus gradient boosting

Answer

A

Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node.
GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow.
GBDT training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests, but GBDTs are prone to overfitting if not handled carefully.

Question 28

Q

Bayes’ theorem

Answer

A

P(A|B)= P(B|A)P(A) / P(B)

Question 29

Q

Conditional probability of P(B) 9in terms of P(A))

Answer

A

P(B) = P(B|A)P(A) + P(A|B)P(B)

Question 30

Q

What is a prior?

Answer

A

Probability before you run a test

Question 31

Q

What is posterior?

Answer

A

It is the probability of the outcome given the prior and the evidence from the test.

Question 32

Q

What’s the difference between a prob density function and a prob mass function?

Answer

A

Density is for continuous distributions, mass is for discrete

Question 33

Q

Is the Dirichlet distribution discrete or continuous?

Answer

A

Continuous

Question 34

Q

Is the Dirichlet distribution discrete or continuous?

Answer

A

Continuous

Question 35

Q

Describe the process of calculating the ROC AUC

Answer

A

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. The larger the area under the roc curve, the better the classifier is at separating the classes.
https://www.youtube.com/watch?v=OAl6eAyP-yo

Question 36

Q

Why is ROC AUC useful for imbalanced classes?

Answer

A

So long as the ordering of observations by predicted probability remain the same, the ROC and AUC would be identical invariant of scale. Given all it cares about is how well you’ve separated your classes, it is only sensitive to rank ordering, so robust to unbalanced classes.

Question 37

Q

What’s Gauss–Markov theorem?

Answer

A

Gauss–Markov theorem states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator, provided it exists.

Question 38

Q

Cost function

Answer

A

Describes how well a response surface fits the available data

Question 39

Q

What is the major downside to a least squares cost function?

Answer

A

It’s dominated by outliers

Question 40

Q

Describe a cost function where some regularisation exists

Answer

A

cost function = some data reconstruction error + regularisation penalty

Question 41

Q

Define autocorrelation

Answer

A

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them.

Question 42

Q

Explain Cohen’s Kappa

Answer

A

The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance)
k = p_o - p_e / (1 - p_e)

Question 43

Q

Explain Mutual information

Answer

A

A measure of the mutual dependence between the two variables, it determines how similar the joint distribution p(X,Y) is to the products of factored marginal distribution p(X)p(Y).

Question 44

Q

Explain entropy (information theory)

Answer

A

The average amount of information produced by a stochastic source of data. The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value. Thus, when the data source has a lower-probability value (i.e., when a low-probability event occurs), the event carries more “information” (“surprisal”) than when the source data has a higher-probability value. The amount of information conveyed by each event defined in this way becomes a random variable whose expected value is the information entropy.

Question 45

Q

Online learning setting

Answer

A

a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step

Question 46

Q

Why would you use stochastic gradient descent over batch gradient descent?

Answer

A

Batch uses every data point to calculate each step, so is computationally expensive

Question 47

Q

Why would you use stochastic gradient descent over batch gradient descent?

Answer

A

Batch uses every data point to calculate each step, so is computationally expensive

Question 48

Q

Downside of user-user collaborative filtering

Answer

A

systems performed poorly when they had many items but comparatively few ratings (sparsity)
computing similarities between all pairs of users was expensive
user profiles changed quickly and the entire system model had to be recomputed

Question 49

Q

limitations of pca

Answer

A

Scaling features arbitrarily (for example converting height in feet to inches) skews the PCs considerably
Depending on task, the strongly predictive information may lie in directions of small variance, which gets removed by PCA
Assumes underlying Gaussian distribution of data

Question 50

Q

Bhattacharyya distance

Answer

A

measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations.

Question 51

Q

Mean Absolute Error

Answer

A

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. (L1 loss)

Question 52

Q

Root Mean Square Error

Answer

A

RMSE is the square root of the average of squared differences between prediction and actual observation.

Question 53

Q

Difference (in outcome) between RMSE and MAE

Answer

A

Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable.
RMSE does not necessarily increase with the variance of the errors. RMSE increases with the variance of the frequency distribution of error magnitudes.
RMSE has a tendency to be increasingly larger than MAE as the test sample size increases.

Question 54

Q

Why is least squares so commonly used for regression?

Answer

A

When solving the statistical linear regression problem, a very common modeling assumption is that for every possible value of “x”, the quantity “y” is normally distributed with a mean that is linear in “x”. Therefore, the likelihood function is essentially a product of PDFs of the normal distribution. As stated above, you estimate the unknown parameters (and therefore find the best fitting line) by maximizing the likelihood function. If you look at what the product of normal PDFs looks like, you will notice that maximizing this expression happens to be equivalent to minimizing the sum of squared errors.

Question 55

Q

R-squared (regression metric)

Answer

A

the percentage of the response variable variation that is explained by a linear model.
R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.

Question 56

Q

User-User collaborative filtering

Answer

A

S(u, i) = rhat_u + SUM((r_vi - rhat_v) . w_uv) / SUM(w_uv)

S -> pred score
u -> user
v -> different user
i -> item
r-> rating (want to normalise to user v's range, for user u)
w -> weighting describing similarity between users u &amp; v. Commonly Pearson's correlation or cosine similarity.
hat -> average
note here SUM is over (v e U)

Question 57

Q

Common modifications to generic user-user

Answer

A

obvs v =/= u
limit size of neighbourhood (top n neighbours)
Limit minimum similarity between people, i.e. SUM is over (v e V)

Question 58

Q

Benefits of item-item CF

Answer

A

works quite well
efficient implementation in the common case where you have many more users than items
un-dynamic relationships between items lends itself to precomputability

Question 59

Q

Assumptions/lmiitations of item-item CF

Answer

A

item-item relationships need to be stable (most issues are temporal)
resulting recommendations have lower serendipity

Question 60

Q

Item-Item collaborative filtering

Answer

A

S(u, i) = SUM(w_ij . r_uj) / SUM(|w_ij|)

S -> pred score
u -> user
i -> item
j -> other item
r-> rating (want to normalise to user v's range, for user u)
w -> weighting describing similarity between users u &amp; v. Commonly Pearson's correlation or cosine similarity.
hat -> average
note here SUM is over (j e N)
N -> Neighbourhood

Question 61

Q

Gradient boosting parameters; learning_rate

Answer

A

a weight between 0 & 1 that controls the contribution of each tree
For example, if the current prediction for a particular example is 0.2 and the next tree predicts that it should actually be 0.8, the correction would be +0.6. At a learning rate of 1, the updated prediction would be the full 0.2+1(0.6)=0.8, while a learning rate of 0.1 would update the prediction to be 0.2+0.1(0.6)=0.26.

Question 62

Q

Gradient boosting parameters; max_tree_depth

Answer

A

The maximum number of edges from the node to the tree’s root node

Question 63

Q

What is an embedding?

Answer

A

embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

Question 64

Q

The fundamental theory of linear programming

Answer

A

If the max/min exists for the linear programming problem, it occurs at the vertex of the feasible region

Answer 64

A

Define the variables
Write the objective function and state whether the goal is to minimise or maximise the function
Write the constraints (which gives you the system of inequalities)
Graph the constraints to determine the feasible region (the solution to the system of inequalities)
Identify the vertices of the feasible region
Test the vertices in the objective function to determine the maximum or minimum function value.

Answer 65

A

You lose the sense of order. I.e. if you bin columns 1-10, into 1-5 then 6-10, you lose the concept that the values 1-5 are smaller than 6-10.

Answer 66

A

grep -rn “string” ~/location/

Answer 67

A

Curl is “a command line tool for getting or sending files using URL syntax.”

Answer 68

A

GNU Wget is a computer program that retrieves content from web servers. It is part of the GNU Project

Answer 69

A

Adam is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.
Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

Answer 70

A

maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).

Answer 71

A

that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Answer 72

A

In frequentist inference, a likelihood function (often simply the likelihood) is a function of the parameters of a statistical model, given specific observed data. Likelihood functions play a key role in frequentist inference, especially methods of estimating a parameter from a set of statistics. Probability in this mathematical context describes the plausibility of a random outcome, given a model parameter value, without reference to any observed data. Likelihood describes the plausibility of a model parameter value, given specific observed data.

Let X be a discrete random variable with probability mass function p depending on a parameter θ. Then the function

L(θ∣x) = p_θ(x) = P_θ (X = x)
considered as a function of θ, is the likelihood function (of
θ), given the outcome x of the random variable X. Sometimes the probability of “the value x of X for the parameter value θ” is written as P(X = x | θ); it is often written as P(X = x; θ), to emphasise that it is not a conditional probability.

Answer 73

A

A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.

Answer 74

A

A Boltzmann machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm (Hinton & Sejnowski, 1983) that allows them to discover interesting features that represent complex regularities in the training data. The learning algorithm is very slow in networks with many layers of feature detectors, but it is fast in “restricted Boltzmann machines” that have a single layer of feature detectors. Many hidden layers can be learned efficiently by composing restricted Boltzmann machines, using the feature activations of one as the training data for the next.
Boltzmann machines are used to solve two quite different computational problems. For a search problem, the weights on the connections are fixed and are used to represent a cost function. The stochastic dynamics of a Boltzmann machine then allow it to sample binary state vectors that have low values of the cost function.
For a learning problem, the Boltzmann machine is shown a set of binary data vectors and it must learn to generate these vectors with high probability. To do this, it must find weights on the connections so that, relative to other possible binary vectors, the data vectors have low values of the cost function. To solve a learning problem, Boltzmann machines make many small updates to their weights, and each update requires them to solve many different search problems.

Answer 75

A

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.
Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features as well as classify data, it is not practical to apply this architecture to images. A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10000 weights for each neuron in the second layer. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters.

Answer 76

A

a class of feedforward artificial neural network. An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

Answer 77

A

a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used to train deep neural networks, a term referring to neural networks with more than one hidden layer.

Backpropagation is a special case of a more general technique called automatic differentiation. In the context of learning, backpropagation is commonly used by the gradient descent optimization algorithm to adjust the weight of neurons by calculating the gradient of the loss function. This technique is also sometimes called backward propagation of errors, because the error is calculated at the output and distributed back through the network layers.

Backpropagation requires the derivative of the loss function with respect to the network output to be known, which typically (but not necessarily) means that a desired target value is known. For this reason it is considered to be a supervised learning method, although it is used in some unsupervised networks such as autoencoders. Backpropagation is also a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. It is closely related to the Gauss–Newton algorithm, and is part of continuing research in neural backpropagation. Backpropagation can be used with any gradient-based optimizer, such as L-BFGS or truncated Newton

Answer 78

A

n mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation,[1][2] is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.

Answer 79

A

In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represents. For example, the expected value in rolling a six-sided die is 3.5, because the average of all the numbers that come up in an extremely large number of rolls is close to 3.5. Less roughly, the law of large numbers states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity.

More practically, the expected value of a discrete random variable is the probability-weighted average of all possible values. In other words, each possible value the random variable can assume is multiplied by its probability of occurring, and the resulting products are summed to produce the expected value. The same principle applies to an absolutely continuous random variable, except that an integral of the variable with respect to its probability density replaces the sum

The expected value is a key aspect of how one characterizes a probability distribution; it is one type of location parameter. By contrast, the variance is a measure of dispersion of the possible values of the random variable around the expected value. The variance itself is defined in terms of two expectations: it is the expected value of the squared deviation of the variable’s value from the variable’s expected value.

Answer 80

A

In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.

Answer 81

A

A distribution of random functions

Answer 82

A

Given random variables X, Y, …, that are defined on a probability space, the joint probability distribution for X, Y, … is a probability distribution that gives the probability that each of X, Y, … falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

Answer 83

A

I n probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.

In the case of a continuous distribution, it gives the area under the probability density function from minus infinity to x. Cumulative distribution functions are also used to specify the distribution of multivariate random variables.

Answer 84

A

In the context of learning, say you have a classification problem with data set {(X1,Y1),…,(Xn,Yn)}, where Xn are your features and Yn are your true labels.

Given a hypothesis function h(x); the loss function l:(h(Xn),Yn)→ℝ takes your hypothesis functions prediction, i.e. h(Xn) as well as the true label for that particular input and returns a penalty. Now, a general goal is to find a hypothesis such that it minimizes the empirical risk (the chances of being wrong!):

Rl(h) = E_empirical[l(h(X),Y)]
= 1/m ∑^m_i l(h(X_i,Y_i).

In the case of binary classification, a common loss function that is used is the 0−1 loss function:

l(h(X),Y) = {0 Y=h(X)
1 otherwise
In general the loss function that we care about cannot be optimized efficiently. For example, 0−1 loss function is discontinuous. So, we consider another loss function that will make our life easier, which we call it the surrogate loss function.

An example of a surrogate loss function could be ψ(h(x))=max(1−h(x),0) (hinge-loss in SVM), which is convex and easy to optimize using conventional methods. This function acts as a proxy, for the actual loss we wanted to minimize in the first place. Obviously, it has its disadvantages, but in some cases a surrogate loss function actually results in being able to learn more. By this I mean that, once your classifier achieves optimal risk (i.e. highest accuracy), you can still see the loss decreasing, which means that it is trying to push the different classes even further apart to improve its robustness.

Answer 85

A

linear models Y_i ~N(mu_i, sigma^2)

GLMs Y_i ~ exponential

Answer 86

A

linear models nu_i = alpha + beta x_i

GLMs e.g. nu_i = alpha + beta x_i + gamma x_i^2

Answer 87

A

linear models nu_i = mu_i -> mu_i = alpha + beta x_i

GLM e.g. nu_i = ln(mu_i) -> mu_i = e ^ { alpha + beta x_i + gamma x_i^2}

Answer 88

A

provides the relationship between the linear predictor and the mean of the distribution function

Brainscape's Knowledge GenomeTM

General Flashcards

Brainscape's Knowledge Genome^TM