Hypothesis Testing, PCA & t-SNE Flashcards

Question

Are the system's eigenvectors essentially the primary components?

Answer 1

Yes, they are eigenvectors of the covariance matrix.

Answer 2

Variance means the amount of information carried from the original features. The PCA tries to capture the maximum amount of variance.

Answer 3

PC1 and PC2 will often represent a "significant" percentage of variability, but never 100%. Each principal component is a linear combination of the original features and we can plot the plot the 2 or 3 principal components (using 2D or 3D plots) to visualize the same.

Answer 4

PCA stands for Principal Component Analysis It is used to visualize data in lower dimension and reduce the number of dimensions.

Answer 5

Each principal component uses all the features. It is a linear combination of all the features available.

Answer 6

Normalization is the process of converting the values of numeric columns in a dataset to a similar scale without distorting the ranges of values. Every dataset does not need to be normalized for machine learning. It is only required when the ranges of characteristics are different.

Answer 7

By examining available features to identify what makes a "normal" class, the PCA-Based Anomaly Detection component solves the problem. After that, the component uses distance metrics to detect anomalous cases. This method allows you to train a model with data that is already skewed.

Answer 8

If N variables are highly correlated then they will all be a part of the same Principal Component (Eigenvector), not different ones. This is how you identify them as being highly correlated.

Answer 9

Because the probability density graph of the normal distribution resembles a bell, it is commonly referred to as the bell curve. The Gaussian distribution is named after the German mathematician Carl Gauss, who first characterized it.

Answer 10

A probability bell curve is referred to as a normal distribution. The mean and standard deviation are the parameters of the normal distribution that defines its shape and the center. The mean of a 'standard normal distribution' is 0 and the standard deviation is 1. It has a kurtosis of 3 and zero skew. Although all symmetrical distributions are normal, not all normal distributions are symmetrical.

Answer 11

The t-SNE algorithm does not keep track of distances, but it does estimate probability distributions. The t-SNE algorithms, in theory, map the input to a two- or three-dimensional map space. The mapped space is supposed to be a t-distribution, while the input space is considered to be a Gaussian distribution. The KL Divergence between the two distributions is employed as the loss function, which is reduced using gradient descent.

Answer 12

You must use the t-distribution table when working problems when the population standard deviation (?) is not known and the sample size is small (n<30). General Correct Rule: If ? is not known, then using t-distribution is correct.

Answer 13

The t-SNE technique may not necessarily provide similar results on subsequent runs, and the optimization process has additional hyperparameters.

Answer 14

While it is technically possible to use PCA on discrete variables, or categorical variables that have been one hot encoded variable, you should not. Simply put, if your variables don't belong on a coordinate plane, then do not apply PCA to them.

Answer 15

It changes the nature of t-distribution used and might give slightly different results but we don't have to worry about it during implementation.

Answer 16

Ordered pairs are used to describe locations on the coordinate plane. An ordered pair relates a point's location along the x-axis (the first value of the ordered pair) and along the y-axis (the second value of the ordered pair) to determine its location (the second value of the ordered pair).

Answer 17

The t-SNE technique calculates a similarity measure between pairs of instances in high and low dimensional space. It then uses a cost function to try to maximize these two similarity measures. If the number of features is very high, it is highly advised to utilize another dimensionality reduction method like PCA to reduce the number of dimensions to a tolerable number.

Answer 18

They are used in production level models too, they're essentially a means to understand which features or transformed features would have the most predictive power in a future model.

Answer 19

To remove the redundancy. If the Principal Components are not orthogonal they may represent the same information.

Answer 20

There is a significant difference between the two strategies for lowering the number of features in a dataset. Feature selection is just picking and choosing which features to include and exclude without changing them. Dimensionality reduction changes the dimensions' characteristics.

Answer 21

Not exactly. In A/B testing we have two drugs A or B, and test which of these drugs could yield better results, however in this scenario we are testing whether "offering" mammography to people will reduce the death rate or not.

Answer 22

The Breast cancer death rate is the death rate per 1000 randomized women. It is calculated by multiplying the the ratio of number of women died to the total number of women in that group by 1000. Among people who were screened, 23 died due to breast cancer and the group size is 20,200. Hence, the breast cancer death rate for the screened group is (23/20200)*1000.

Answer 23

When we are trying to gather samples from the population for our experiment, we have to make sure there is enough randomization and enough samples. Randomization means that the subjects/people that are chosen to be part of the study are assigned to test and control groups in a 'random' manner, without following any other criteria. It is done to ensure that test and control are similar to each other and therefore an unbiased/fair comparison of results/measures is possible.

Answer 24

Sometimes it is not possible to do so for ethical reasons. Here the question is "offering" mammography or not. It is not about taking mammography. We can not force someone to do it. You may ask whether smoking increases the probability of lung cancer or not? Here we cannot ethically perform a randomized control trial, since we cannot force someone to smoke.

Answer 25

No, we want to look at the effect of mammography just being offered and not receiving mammography. People have the liberty to accept or refuse the offer.

Answer 26

Yes. In general, any experiment is conducted on a sample of the population. The sample is always derived from the population hoping our sampling was random enough to capture statistics closer to population statistics.

Answer 27

After a subject has been chosen for this experiment, they were either put into the treatment group or in the control group, so yes they are mutually exclusive and couldn't be a part of both the groups.

Answer 28

We calculate the p-value for this data set using hypothesis testing. If the p-value is less than the significance level then the result is statistically significant and we can reject the null hypothesis. Here, the p-value for mammography study is 0.012 which is less than significance level of 0.05 and hence we can conclude that the result is statistically significant and offering mammography reduces the death rate due to breast cancer.

Answer 29

It could be, or could not be, but since we are gathering samples with randomization. Here we are not concerned with age. That would be a different problem statement with a different data set.

Answer 30

Yes, this can happen. This is called the placebo effect. The placebo effect is when an improvement of symptoms is observed, despite using a non-active treatment. The placebo effect can add bias to our results and hence it is important to control it in experiments/trials.

Answer 31

It is not necessary and we can do our experimentation with a different number in each group too, but the way this experiment was conducted, a certain number of people were picked and half of them were in the treatment group, the other half in the control group.

Answer 32

Bernoulli distribution focuses on the outcome of a trial for a single time, whereas a binomial distribution is used if we need the outcome of that event for a certain number of repetitions.

Answer 33

No, but since null hypothesis is the statement of no change, it must contain the equality (=, <=, >=). The alternative hypothesis is the complement of null hypothesis i.e. (not equal, >, <).

Answer 34

Yes, when we do hypothesis testing, we only deal with interval estimates. Hence we approve/reject null/alternate hypotheses based on inequalities rather than equalities.

Answer 35

The probability of making a type I error is alpha, which is the level of significance you set for your hypothesis test. An alpha of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis.

Answer 36

It is the trade-off between business risks and costs associated with it. For example, a drug manufacturer will take a lower significance level like 0.01 because even a small error could lead to huge financial losses or health effects on patients.

Answer 37

The sample size is related to how large the effect is that we want to be able to detect. For example, in the HIP study, if we want to detect a very small effect, say the death rate is only changing from 2.0 to 1.9, then we need a very large sample size to test whether it's statistically significant or not. In general, any effect size can be statistically significant with a large enough sample.

Answer 38

Yes, there might be a case where the p-value is much greater than alpha, and in that case, we accept the null hypothesis. If it is much greater than the alpha, we are much more confident with our null hypothesis.

Answer 39

Yes, we set a particular significance level and decide to reject or not to reject the null hypothesis depending on the p-value we get from that set.

Answer 40

The corrections are to see if all of our null/alternate hypotheses stand valid.

Answer 41

In PCA, our main objective is to reduce dimensionality so that we can further use this data for computation easily. Hence a PCA would pre-process the data for certain machine learning tasks, where we would minimize the loss. Otherwise, stand-alone PCA would not require us to minimize loss.

Answer 42

Yes, each principal component is an eigenvector.

Answer 43

Being perpendicular means that each component must lie at 90 degree angle of any other component. Every dimension in our data set should be orthogonal to each other. Hence if we can visualize our data, principal components should always be perpendicular to each other like X, Y, and Z axis in cartesian coordinate system.

Answer 44

Geometrically, an eigenvector, corresponding to a real nonzero eigenvalue, points in a direction in which it is stretched by the transformation, and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative, the direction is reversed.

Answer 45

Yes, the eigenvector with the largest eigenvalue is the first principal component that explains the maximum variance in the data.

Answer 46

In case, the variables are on the same scale/unit, we need not normalize it.

Answer 47

No, there are other linear methods (such as factor analysis), but PCA is the most popular linear dimensionality reduction technique.

Answer 48

t-SNE is a very effective nonlinear dimensionality reduction technique. It can handle outliers well. It works by minimizing the distance between the points in a gaussian and tries to preserve the local structure of the data. It is relatively computationally expensive. PCA is a linear dimensionality reduction technique, which gets affected by outliers. It finds the eigenvectors of the covariance matrix for preserving variance and tries to preserve the global structure of the data. It is very computationally efficient. Not as good as t-SNE to visualize the data in lower dimension.

Answer 49

Each principal component is an eigenvector of the covariance matrix of the features. The covariance matrix is always a symmetric matrix and eigenvectors of any symmetric matrices are always perpendicular. Hence, any two principal components are perpendicular to each other.

Answer 50

Normalization of the data means rescaling the data to fit the range [0,1]. Normalization is done by dividing each data point by the value of the largest observation. Standardization, on the other hand, is reshaping the data to have a mean of 0 and a standard deviation of 1. It is done by subtracting the mean from each data point and dividing each by the standard deviation.

Answer 51

PCA does not change the distribution of the data. It projects and represents the data on new axes which are called principal components.

Answer 52

Typically t-SNE is used to visualize the data in a lower dimension like a 2-dimensional or a 3-dimensional plane. If the number of dimensions is 4 or 5, then we simply get the embedding of the lower dimension but we cannot visualize them.

Answer 53

Yes, for PCA we must always perform scaling first because normalized/scaled data points will always have standard deviations on the same scale and hence it gets easier to identify the principal components.

Answer 54

As important as it is to scale data for PCA, it is not important to scale and /or standardize for t-SNE because it is a distance-based algorithm and re-centering wouldn't possibly bring a change.

Answer 55

Yes, if the interpretation of original features is not very important, then we can use reduce the dimension using PCA and use the top principal component that explain the most variance in the data and use them to train the model.

Answer 56

The sample size is related to how large the effect is that we want to be able to detect. For example, in the HIP study, if we want to detect a very small effect, say the death rate is only changing from 2.0 to 1.9, then we need a very large sample size to test whether it's statistically significant or not. In general, any effect size can be statistically significant with a large enough sample.

Answer 57

Double-blind is used in different studies so that neither the patient nor the doctor knows what treatment is given. This is to avoid or minimize bias in the final result of the study. The placebo effect is when an improvement of symptoms is observed, despite using a non-active treatment. The placebo effect can add bias to our results and hence it is important to control it in experiments/trials.

Answer 58

It depends on the hypothesis test that we want to conduct i.e. the question we are asking. In the HIP study, we test whether offering mammography reduces the number of deaths. In this case, there is no need to consider the cancer stage.

Answer 59

Sometimes it is not possible to do so for ethical reasons. Here the question is offering mammography or not. It is not about taking mammography. We can not force someone to do it. You may ask whether smoking increases the probability of lung cancer or not? Here we cannot ethically perform a randomized control trial, since we cannot force someone to smoke.

Answer 60

Randomization means that the subjects/people that are chosen to be part of the study are assigned to test and control groups in a 'random' manner, without following any other criteria. It is done to ensure that test and control are similar to each other and therefore an unbiased/fair comparison of results/measures is possible. For example, in the HIP Mammography experiment, you have a certain number of people that will be part of the study, in this case, 62,000. Randomization means that for every person, we just flip a coin; if it's heads, we'll assign them to the treatment group; if it's tails, they'll be assigned to the control group. So, with 50% probability, we'll put a person in either the treatment or the control group. Therefore, randomization means, that you don't evaluate any participant on any parameter and 'randomly' assign them to either test or control, to ensure that the two groups are very similar to each other.

Answer 61

This is the death rate per 1000 randomized women. Breast cancer rate explains the death due to breast cancer and all other rate explains the death due to other reasons

Answer 62

Yes. The study tests whether offering mammography to people will reduce the death rate or not. We can only test the impact of "offering" the treatment, as we cannot force anyone to take the treatment. That would not be random or even ethical in most cases.

Answer 63

In this study, from the 700K population, 62000 samples are chosen randomly. Randomization ensures that there is at least no selection bias in the sample selected from the population.

Answer 64

The study tests whether offering mammography will "reduce" the death rate or not. Here, the Death rate from breast cancer in the control group is 0.002. So, we are performing a one-sided hypothesis test with the alternative hypothesis - the death rate is less than 0.002 if we offer mammography.

Answer 65

It is the trade-off between business risks and costs associated with it. For example, a drug manufacturer will take a lower significance level like 0.01 because even a small error could lead to huge financial losses or health effects on patients.

Answer 66

The sample size should depend on how large the effect is that we want to be able to predict. We repeat the experiment, to be able to generalize the results. For example, if the hypothesis is that a drug is independent of geographical factors, then you may need to conduct the experiment in different countries to obtain a generalized result.

Answer 67

Tail heaviness is determined by a parameter of the t-distribution, called degrees of freedom (dof), where the smaller values give heavier tails, and higher values make the t-distribution resemble a standard normal distribution. As the degrees of freedom approaches infinity, it approaches a standard normal distribution. The t-distribution has a greater chance for extreme values than normal distributions, hence the larger tails as a result variance is more.

Answer 68

If you're in a business that benefits from rare events - say, an astronomical observatory with a grant to study Earth-orbit-crossing asteroids you would naturally be more interested in outliers than in the bulk of the data.

Answer 69

We use PCA/t-SNE to visualize the data when it has a large number of features. It can be considered the first step in the model building process.

Answer 70

Covariance signifies the direction of the linear relationship between the two variables, whereas Correlation signifies the direction as well as the magnitude of the linear relationship between them.

Answer 71

PCA does not involve a dependent variable: All the variables are treated the same. It is a dimension reduction method. It uses an orthogonal transformation to form the principal components or linear combinations of the variables. However, linear regression is more about finding a straight line that best fits the data, depending on the internal data relationships.

Answer 72

There is no such thing as a good or bad PCA. It is simply an unsupervised technique.

Answer 73

We can decide the optimal number of principal components by plotting the cumulative sum of eigenvalues. If you divide each value by the total sum of eigenvalues before plotting, then your plot will show the fraction of total variance retained vs. the number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).

Hypothesis Testing, PCA & t-SNE Flashcards

(97 cards)