Evern_ote_ Flashcards

Question 1

Q

Discrete distribution

Answer

A

A discrete distribution is one in which the “data can only take on certain values, for example integers (finite)”.
For a discrete distribution, probabilities can be assigned to the values in the distribution - for example, “the probability that the web page will have 12 clicks in an hour is 0.15.”

Question 2

Q

continues distribution

Answer

A

A continuous distribution is one in which “data can take on any value within a specified range (which may be infinite).”
- the probability associated with any particular value of a continuous distribution is null.

Therefore, continuous distributions are normally “described in terms of probability density”, which can be converted into the probability that a value will fall within a certain range.

Question 3

Q

Discrete and continuous data

Answer

A

Discrete datainvolves round, concrete numbers that are determined by counting.

Continuous datainvolves complex numbers that are measured across a specific time interval

Question 4

Q

conditional probability

Answer

A

Conditional probability is the “probability of one event occurring with some relationship to one or more other events.” For example:

Event A is that it is raining outside, and it has a 0.3 (30%) chance of raining today.
Event B is that you will need to go outside, and that has a probability of 0.5 (50%).

A conditional probability would look at these two events in relationship with one another, such as the probability that it is both raining and you will need to go outside.

The formula for conditional probability is:

P(B|A) = P(A and B) / P(A)

Question 5

Q

Bayes theorem

Answer

A

The fundamental idea of Bayesian inference is to become “less wrong” with more data.

The process is straightforward: we have an initial belief, known as a prior, which we update as we gain additional information.

P(A|B) = P(B|A) P(A) / P(B)

Note: The conclusions drawn from the Bayes law are logical but anti-intuitive. Almost always, people pay a lot of attention to the posterior probability, but they overlook the prior probability.

Question 6

Q

Hypothesis testing and confidence interval estimation:

Answer

A

UsingHypothesis Testing, we try to interpret or draw conclusions about the population using sample data.
AHypothesis Testevaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.
Whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results in applied machine learning, we must rely on statistical hypothesis tests.

Question 7

Q

Random variable

Answer

A

Arandom variable, usually writtenX, is a variable whose possible values are numerical outcomes of a random phenomenon.

Question 8

Q

Discrete random variable

Answer

A

A discrete random variable is one which may take on only a “countable number of distinct values” such as 0,1,2,3,4,……..

Discrete random variables are usually (but not necessarily) counts.

If a random variable can take only a finite number of distinct values, then it must be discrete.

Examples of discrete random variables include the number of children in a family, the Friday night attendance at a cinema, the number of patients in a doctor’s surgery, the number of defective light bulbs in a box of ten.

Question 9

Q

Continuous random variable

Answer

A

A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

Question 10

Q

cumulative distribution function.

Answer

A

All random variables (discrete and continuous) have a cumulative distribution function.

For a discrete random variable, the cumulative distribution function is found by summing up the probabilities.

Question 11

Q

Simple linear regression

Answer

A

Simple linear regression is used to estimate the relationship between two quantitative variables.
Regression problem

Range = -inf to inf

                          y = B0 + B1 X + e

Linear regression finds the “line of best fit” line through your data by searching for the regression coefficient (B1) that minimizes the total error (e) of the model.

Cost function: Mean square error

yis the predicted value of the dependent variable (y) for any given value of the independent variable (x).
B0is theintercept, the predicted value ofywhen thexis 0.
B1is the regression coefficient – how much we expectyto change asxincreases.
xis the independent variable ( the variable we expect is influencingy).
eis theerrorof the estimate, or how much variation there is in our estimate of the regression coefficient.

Question 12

Q

Simple logistic regression

Answer

A

Uses sigmoid function
Better suited for classification problem

Range within 0-1

     y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

cost function: has its own refer below
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

Question 13

Q

Types of logistic regression

Answer

A

Binary Logistic Regression
The categorical response has only two 2 possible outcomes. Example: Spam or No
Multinomial Logistic Regression
Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan)
Ordinal Logistic Regression
Three or more categories with ordering. Example: Movie rating from 1 to 5

Question 14

Q

Mean square error

Answer

A

https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a

Question 15

Q

logistic regression cost function

Answer

A

https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

Question 16

Q

gradient descent

Answer

A

Gradient descent is an “optimization algorithm” that’s used when training a machine learning model.

It’s “based on a convex function” and “tweaks its parameters iteratively to minimize a given function to its local minimum.”

Question 17

Q

Gradient:

Answer

A

“A gradient measures how much the output of a function changes if you change the inputs a little bit.” — Lex Fridman (MIT).

Question 18

Q

Generalised linear models

Answer

A

In statistics, a generalized linear model is a “flexible generalization of ordinary linear regression” that allows for the response variable (Y) to have an “error distribution other than the normal distribution”

Applicable when the relationship between X and Y are not linear and exponential

https://www.mygreatlearning.com/blog/generalized-linear-models/

Question 19

Q

Components of GLM

Answer

A

There are 3 components in GLM.

Systematic Component/Linear Predictor:
It is just the linear combination of the Predictors and the regression coefficients.

β0+β1X1+β2X2

Link Function:
Represented as η or g(μ), it specifies the link between a random and systematic components. It indicates how the expected/predicted value of the response relates to the linear combination of predictor variables.

Random Component/Probability Distribution:
It refers to the probability distribution, from the family of distributions, of the response variable.

The family of distributions, called an exponential family, includes normal distribution, binomial distribution, or poisson distribution.

Question 20

Q

Exponential family of distribution

Probability Distribution, and their corresponding Link function

Answer

A

Probability Distribution Link Function
Normal Distribution Identity function
Binomial Distribution Logit/Sigmoid function
Poisson Distribution Log function (aka log-linear, log-link)

Question 21

Q

Regularisation

Answer

A

To reduce overfitting of a model

“It is a form of regression that shrinks the coefficient estimates towards zero.” In other words, this technique forces us not to learn a morecomplex or flexible model, to avoid the problem of overfitting.

Question 22

Q

Regularisation type

Answer

A

Ridge Regression

Lasso regression

Question 23

Q

Lasso regression

Answer

A

Lasso regression is another variant of the regularization technique used to reduce the complexity of the model. It stands forLeast Absolute and Selection Operator.

👉 It is similar to the Ridge Regression except that the penalty term includes the absolute weights instead of a square of weights. Therefore, the optimization function becomes:

Fig. Cost Function for Lasso Regression
Image Source:link

👉 In statistics, it is known as theL-1 norm.

👉 In this technique, the L1 penalty has the eﬀect of forcing some of the coeﬃcient estimates to be exactly equal to zero which means there is a complete removal of some of the features for model evaluation when the tuning parameter λ is suﬃciently large. Therefore, the lasso method also performsFeature selectionand is said to yieldsparse models.

👉Limitation of Lasso Regression:

Problems with some types of Dataset:If the number of predictors is greater than the number of data points, Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.
Multicollinearity Problem:If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of our model.

Key Differences between Ridge and Lasso Regression

👉Ridge regression helps us to reduce only the overfitting in the model while keeping all the features present in the model. It reduces the complexity of the model by shrinking the coefficients whereas Lasso regression helps in reducing the problem of overfitting in the model as well as automatic feature selection.

👉 Lasso Regression tends to make coefficients to absolute zero whereas Ridge regression never sets the value of coefficient to absolute zero.

Question 24

Q

What does Regularization achieve?

Answer

A

What does Regularization achieve?

👉 In simple linear regression, the standard least-squares model tends to have some variance in it, i.e. this model won’t generalize well for a future data set that is different from its training data.

👉 Regularization tries to reduce the variance of the model, without a substantial increase in the bias.

Question 25

Q

How λ relates to the principle of “Curse of Dimensionality”?

Answer

A

“As the value of λ rises, it significantly reduces the value of coefficient estimates and thus reduces the variance.”

Till a point, this increase in λ is beneficial for our model as it is only reducing the variance (hence avoiding overfitting), without losing any important properties in the data. But after a certain value of λ, the model starts losing some important properties, giving rise to bias in the model and thus underfitting. Therefore, we have to select the value of λ carefully. To select the good value of λ, cross-validation comes in handy.

Question 26

Q

Important points about λ the tuning parameter:

Answer

A

Important points about λ:

λ is the tuning parameter used in regularization that decides how much we want to penalize the flexibility of our model i.e,controls the impact on bias and variance.
Whenλ = 0, the penalty term has no eﬀect, the equation becomes the cost function of the linear regression model. Hence, for the minimum value of λ i.e, λ=0, the model will resemble the linear regression model. So, the estimates produced by ridge regression will be equal to least squares.
However, asλ→∞(tends to infinity), the impact of the shrinkage penalty increases, and the ridge regression coeﬃcient estimates will approach zero.

Question 27

Q

machine learning

Answer

A

Machine learning can be summarized as learning a function (f) that maps input variables (X) to output variables (Y).

Y = f(x)

Question 28

Q

Parametric Machine Learning Algorithms

Answer

A

“Algorithms that simplify the function to a known form are called parametric machine learning algorithms.”

A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model.
No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.

Question 29

Q

Nonparametric Machine Learning Algorithms

Answer

A

“Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms.”

By not making assumptions, they are free to learn any functional form from the training data.

Question 30

Q

Benefits of Parametric Machine Learning Algorithms:

Answer

A

Simpler: These methods are easier to understand and interpret results.
Speed: Parametric models are very fast to learn from data.
Less Data: They do not require as much training data and canwork well even if the fit to the data is not perfect.

Question 31

Q

Limitations of Parametric Machine Learning Algorithms:

Answer

A

Constrained: By choosing a functional form these methods are highly constrained to the specified form.
Limited Complexity: The methods aremore suitedto simpler problems.
Poor Fit: In practice the methodsare unlikely to match the underlying mapping function.

Question 32

Q

examples of popular nonparametric machine learning algorithms are:

Answer

A

k-Nearest Neighbors
Decision Trees like CART and C4.5
Support Vector Machines

Question 33

Q

Pros and cons of non-parametric models

Answer

A

Benefits of Nonparametric Machine Learning Algorithms:

Flexibility: Capable of fitting a large number of functional forms.
Power: No assumptions (or weak assumptions) about the underlying function.
Performance: Can result in higher performance models for prediction.
Limitations of Nonparametric Machine Learning Algorithms:
More data: Require a lot more training data to estimate the mapping function.
Slower: A lot slower to trainas they often have far more parameters to train.
Overfitting: More of a risk to overfit the training data and it isharderto explain why specific predictions are made.

Question 34

Q

Causal models

Answer

A

Causal models are mathematical models representing causal relationships within an individual system or population. They facilitate inferences about causal relationships from statistical data.