Hypothesis Testing, PCA & t-SNE Flashcards

1
Q

How does a test become reliable?

A

By examining the consistency of results across time, among various observers, and throughout different sections of the test. A valid measurement is an accurate result from a test which should be reproducible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a control group?

A

A control group is a group that is subjected to normal conditions, that is a test has been performed on a population, the control group was under normal conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where does random assignment fail to deal with?

A

Random assignment is sometimes impossible because the experimenters cannot control the treatment or independent variable. You can’t assign subjects to these groups at random, for example, if you want to see how people with and without depression do on a test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Do we need to randomize the treatments while experimenting?

A

When we are conducting the test, we expect the population to have as much randomness and equal representation of all kinds of people as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Differentiate between control group and treatment group?

A

The treatment group receives the treatment whose effect the researcher is interested in. The control group receives either no treatment, a standard treatment whose effect is already known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For the medical domain, how do we adjust the alpha value?

A

For instance, a medical scientist who develops a new treatment that may revolutionize the management of an illness and replace a standard therapy must be very certain that the new approach is superior to the old one. Due to the potential impact on the field and the negative consequences of making wrong decisions, it’s very important to take a conservative approach before claiming a difference. In this case, reducing the chance of making a type 1 error is more important and making a type 2 error is more acceptable because this would suggest no change in medical treatment. This can be accomplished by the use of a more stringent alpha level, such as 0.01 or 0.001.
For less critical research decisions, decreasing the chance for a type 2 error is more appropriate. This can be accomplished by using a more liberal alpha level, say 0.1 which makes it easier to reject the null hypothesis. For example, let’s say that a researcher wants to compare two hand soaps, both known to work, to see which one cleans better. Does it really matter if it is concluded that one is better than the other if in fact there is no major difference between them? Probably not, so in this case a type 1 error is more acceptable. In summary, it’s the responsibility of the researcher to decide which error is the less important and to see the alpha level accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we know if we have the proper sample size for the experiment?

A

The effect size (typically the difference between two groups), the population standard deviation (for continuous data), the required power of the experiment to detect the postulated effect, and the significance level are all parameters that must be known or calculated to calculate the sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens if the random selection is skewed?

A

Our results may be harmed if our data is biased. To use skewed data, we must apply a log transformation to the entire set of values to discover patterns and develop the data useable for the statistical model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Is it true that all patients have the same level of cancer/stage, or is that, too, randomized?

A

Even though each person’s condition is unique, cancers of the same sort and stage generally have comparable outlooks. When doctors discuss a patient’s cancer, the cancer stage is also a means for them to define the degree of the disease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is it necessary to know how many persons refused to participate in the study?

A

Our objective is to monitor the changes if mammography is introduced as a part of health insurance. So when it is introduced, some people will refuse it. Hence we have to take everything into account and see the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Do you always need equal sample sizes in control and treatment? or what is the threshold of how different they can be?

A

It is not necessary to have an equal number of samples in each group, we can easily take percentages and ratios with any number but it is easier to work with equal sample sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we know if your experiment is biased or not?

A

There are so many ways a bias can be introduced and few of them are different measurements, sampling, etc. We use multiple hypothesis testing and corrections to avoid it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the p-value?

A

We calculate the p-value for a data set using hypothesis testing. If the p-value is less than the significance level then the result is statistically significant and we can reject the null hypothesis. Here, the p-value for the mammography study is 0.012 which is less than the significance level of 0.05 and hence we can conclude that the result is statistically significant and offering mammography reduces the death rate due to breast cancer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the distinction between the T and Z statistics?

A

When you don’t know about the population standard deviation, you use a T-Test using a T Statistic instead of a Z score. The main difference between a Z score and a T statistic is that the population standard deviation must be estimated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Should Null and Alternative be mutually exclusive and complementary?

A

Not necessarily, they will be mutually exclusive, but they don’t have to be exhaustive and complimentary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is there any way to anticipate how often a type 1 error will occur given a significance threshold of 0.05?

A

A significance level of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Is it true that some variations are too minor to matter if your sample size is large and you can detect a small difference?

A

The hypothesis test develops higher statistical power to identify tiny effects as the sample size grows. With a big enough sample size, the hypothesis test can identify effects that are so small that they are almost of no significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

We reject null, but is there proof that the treatment works?

A

Hypothesis testing is used to determine whether there is sufficient evidence to reject the null hypothesis. To put it another way, we’re looking to see if there’s enough evidence to rule out the null hypothesis. We cannot reject the null hypothesis if there is insufficient evidence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Should we compare the results of changing alpha to 0.01 vs. 0.05?

A

Reducing the alpha level from 0.05 to 0.01 lowers the risk of a false positive (also known as a Type I error), but it also makes it more difficult to identify differences using a t-test. As a result, any important results you acquire would be more reliable, but there would be fewer of them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is there a chance that we don’t have any substantial false positives?

A

If our test gives a p-value of 0, it means the test is statistically significant and the null hypothesis is rejected (for example the differences between your groups are significant).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Whatis the definition of covariate?Is it the same as having variables that are confounding?

A

Confounders are variables that are related to both the intervention and the outcome, Covariates are variables that explain a part of the variability in the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What’s the distinction between PC1 and PC2?

A

Each principal component is an eigenvector. PC1 (Principal Component 1) captures the majority of the variability in the data, while PC2 (Principal Component 2), which is orthogonal (independent) to PC1, captures the less variability than PC1 and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How are the principal components formulated?

A

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix of the original data to determine the principal components of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Are PC1 and PC2 always pointing in opposite directions?

A

They are always orthogonal to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Are the system’s eigenvectors essentially the primary components?

A

Yes, they are eigenvectors of the covariance matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does the term “greatest variation” mean?

A

Variance means the amount of information carried from the original features. The PCA tries to capture the maximum amount of variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Will you be able to identify PC1 and PC2? What do they represent in the dataset?

A

PC1 and PC2 will often represent a “significant” percentage of variability, but never 100%. Each principal component is a linear combination of the original features and we can plot the plot the 2 or 3 principal components (using 2D or 3D plots) to visualize the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does the PCA stand for?

A

PCA stands for Principal Component Analysis It is used to visualize data in lower dimension and reduce the number of dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Do we know, which variables are included in PC 1?

A

Each principal component uses all the features. It is a linear combination of all the features available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is normalization?

A

Normalization is the process of converting the values of numeric columns in a dataset to a similar scale without distorting the ranges of values. Every dataset does not need to be normalized for machine learning. It is only required when the ranges of characteristics are different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Can PCA be used for finding the anomalies in the data?

A

By examining available features to identify what makes a “normal” class, the PCA-Based Anomaly Detection component solves the problem. After that, the component uses distance metrics to detect anomalous cases. This method allows you to train a model with data that is already skewed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

In PCA, how do you account for variables with a lot of interacting effects?

A

If N variables are highly correlated then they will all be a part of the same Principal Component (Eigenvector), not different ones. This is how you identify them as being highly correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why is the normal distribution referred to as a Gaussian distribution?

A

Because the probability density graph of the normal distribution resembles a bell, it is commonly referred to as the bell curve. The Gaussian distribution is named after the German mathematician Carl Gauss, who first characterized it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

For Gaussians, what will mean and standard deviationbe?

A

A probability bell curve is referred to as a normal distribution. The mean and standard deviation are the parameters of the normal distribution that defines its shape and the center.
The mean of a ‘standard normal distribution’ is 0 and the standard deviation is 1. It has a kurtosis of 3 and zero skew. Although all symmetrical distributions are normal, not all normal distributions are symmetrical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the t-SNE algorithm?

A

The t-SNE algorithm does not keep track of distances, but it does estimate probability distributions. The t-SNE algorithms, in theory, map the input to a two- or three-dimensional map space. The mapped space is supposed to be a t-distribution, while the input space is considered to be a Gaussian distribution. The KL Divergence between the two distributions is employed as the loss function, which is reduced using gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

When do we use t-distribution?

A

You must use the t-distribution table when working problems when the population standard deviation (?) is not known and the sample size is small (n<30). General Correct Rule: If ? is not known, then using t-distribution is correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Is it possible to restore a previously run t-SNE?

A

The t-SNE technique may not necessarily provide similar results on subsequent runs, and the optimization process has additional hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How to deal with categorical variables in PCA?

A

While it is technically possible to use PCA on discrete variables, or categorical variables that have been one hot encoded variable, you should not. Simply put, if your variables don’t belong on a coordinate plane, then do not apply PCA to them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What role does the t distribution degrees of freedom have?

A

It changes the nature of t-distribution used and might give slightly different results but we don’t have to worry about it during implementation.

40
Q

What do we mean by ordered pairs in the coordinate plane?

A

Ordered pairs are used to describe locations on the coordinate plane. An ordered pair relates a point’s location along the x-axis (the first value of the ordered pair) and along the y-axis (the second value of the ordered pair) to determine its location (the second value of the ordered pair).

41
Q

What is t-SNE and when to use PCA over t-SNE

A

The t-SNE technique calculates a similarity measure between pairs of instances in high and low dimensional space. It then uses a cost function to try to maximize these two similarity measures. If the number of features is very high, it is highly advised to utilize another dimensionality reduction method like PCA to reduce the number of dimensions to a tolerable number.

42
Q

Can PCA and t-SNE be used for models in production?

A

They are used in production level models too, they’re essentially a means to understand which features or transformed features would have the most predictive power in a future model.

43
Q

What is the significance of orthogonality?

A

To remove the redundancy. If the Principal Components are not orthogonal they may represent the same information.

44
Q

Is it possible to employ feature selection and dimensionality reduction interchangeably?

A

There is a significant difference between the two strategies for lowering the number of features in a dataset. Feature selection is just picking and choosing which features to include and exclude without changing them. Dimensionality reduction changes the dimensions’ characteristics.

45
Q

Is hypothesis testing the same as A/B testing?

A

Not exactly. In A/B testing we have two drugs A or B, and test which of these drugs could yield better results, however in this scenario we are testing whether “offering” mammography to people will reduce the death rate or not.

46
Q

Why is the cancer rate for screened women calculated by 23 / 20200?

A

The Breast cancer death rate is the death rate per 1000 randomized women. It is calculated by multiplying the the ratio of number of women died to the total number of women in that group by 1000. Among people who were screened, 23 died due to breast cancer and the group size is 20,200. Hence, the breast cancer death rate for the screened group is (23/20200)*1000.

47
Q

Can family history bias the experiment?

A

When we are trying to gather samples from the population for our experiment, we have to make sure there is enough randomization and enough samples. Randomization means that the subjects/people that are chosen to be part of the study are assigned to test and control groups in a ‘random’ manner, without following any other criteria. It is done to ensure that test and control are similar to each other and therefore an unbiased/fair comparison of results/measures is possible.

48
Q

Why did we not add the refused people to the control since they are not receiving the treatment anyways?

A

Sometimes it is not possible to do so for ethical reasons. Here the question is “offering” mammography or not. It is not about taking mammography. We can not force someone to do it. You may ask whether smoking increases the probability of lung cancer or not? Here we cannot ethically perform a randomized control trial, since we cannot force someone to smoke.

49
Q

There’s a difference between “offering mammography” and “receiving mammography”. So, do we want to look at receiving mammography?

A

No, we want to look at the effect of mammography just being offered and not receiving mammography. People have the liberty to accept or refuse the offer.

50
Q

Is the data we are working on, a sample of the population?

A

Yes. In general, any experiment is conducted on a sample of the population. The sample is always derived from the population hoping our sampling was random enough to capture statistics closer to population statistics.

51
Q

Are Treatment and Control groups mutually exclusive?

A

After a subject has been chosen for this experiment, they were either put into the treatment group or in the control group, so yes they are mutually exclusive and couldn’t be a part of both the groups.

52
Q

What is the statistical significance for this particular set of information in mammography case study?

A

We calculate the p-value for this data set using hypothesis testing. If the p-value is less than the significance level then the result is statistically significant and we can reject the null hypothesis. Here, the p-value for mammography study is 0.012 which is less than significance level of 0.05 and hence we can conclude that the result is statistically significant and offering mammography reduces the death rate due to breast cancer.

53
Q

Could age be a factor for refusal?

A

It could be, or could not be, but since we are gathering samples with randomization. Here we are not concerned with age. That would be a different problem statement with a different data set.

54
Q

Being in the treatment group might increase their awareness in general and change their lifestyle (healthy diet etc.) Can they still refuse to screen?

A

Yes, this can happen. This is called the placebo effect.
The placebo effect is when an improvement of symptoms is observed, despite using a non-active treatment. The placebo effect can add bias to our results and hence it is important to control it in experiments/trials.

55
Q

Do we need to have the same number of samples in both treatment and control groups?

A

It is not necessary and we can do our experimentation with a different number in each group too, but the way this experiment was conducted, a certain number of people were picked and half of them were in the treatment group, the other half in the control group.

56
Q

How would you evaluate whether to use Binomial or Bernoulli?

A

Bernoulli distribution focuses on the outcome of a trial for a single time, whereas a binomial distribution is used if we need the outcome of that event for a certain number of repetitions.

57
Q

Is the null hypothesis Always an equality? And the alternative is Always an inequality?

A

No, but since null hypothesis is the statement of no change, it must contain the equality (=, <=, >=). The alternative hypothesis is the complement of null hypothesis i.e. (not equal, >, <).

58
Q

We’re considering interval estimate here not point estimate right?

A

Yes, when we do hypothesis testing, we only deal with interval estimates. Hence we approve/reject null/alternate hypotheses based on inequalities rather than equalities.

59
Q

What does the 5% alpha mean?

A

The probability of making a type I error is alpha, which is the level of significance you set for your hypothesis test. An alpha of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis.

60
Q

If alpha is subjective, can it lead to incorrect conclusions?

A

It is the trade-off between business risks and costs associated with it. For example, a drug manufacturer will take a lower significance level like 0.01 because even a small error could lead to huge financial losses or health effects on patients.

61
Q

How do you determine how large your sample needs to be in order to give you statistically significant results?

A

The sample size is related to how large the effect is that we want to be able to detect. For example, in the HIP study, if we want to detect a very small effect, say the death rate is only changing from 2.0 to 1.9, then we need a very large sample size to test whether it’s statistically significant or not. In general, any effect size can be statistically significant with a large enough sample.

62
Q

Can I have a p-value&raquo_space; alpha?

A

Yes, there might be a case where the p-value is much greater than alpha, and in that case, we accept the null hypothesis. If it is much greater than the alpha, we are much more confident with our null hypothesis.

63
Q

Do you calculate the rejection for each test based on the p-value for that test?

A

Yes, we set a particular significance level and decide to reject or not to reject the null hypothesis depending on the p-value we get from that set.

64
Q

Do these corrections reject null hypotheses altogether or accept some and reject the others?

A

The corrections are to see if all of our null/alternate hypotheses stand valid.

65
Q

In PCA, do we also want to minimize the loss?

A

In PCA, our main objective is to reduce dimensionality so that we can further use this data for computation easily. Hence a PCA would pre-process the data for certain machine learning tasks, where we would minimize the loss. Otherwise, stand-alone PCA would not require us to minimize loss.

66
Q

Is PC a vector?

A

Yes, each principal component is an eigenvector.

67
Q

What does “ being perpendicular” mean here?

A

Being perpendicular means that each component must lie at 90 degree angle of any other component. Every dimension in our data set should be orthogonal to each other. Hence if we can visualize our data, principal components should always be perpendicular to each other like X, Y, and Z axis in cartesian coordinate system.

68
Q

What’s an eigenvector?

A

Geometrically, an eigenvector, corresponding to a real nonzero eigenvalue, points in a direction in which it is stretched by the transformation, and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative, the direction is reversed.

69
Q

Is PC1 the eigenvector with the biggest magnitude, correct?

A

Yes, the eigenvector with the largest eigenvalue is the first principal component that explains the maximum variance in the data.

70
Q

Even if the variables are on the same unit, is it recommendable to normalize using the correlation?

A

In case, the variables are on the same scale/unit, we need not normalize it.

71
Q

Is PCA the linear dimensionality reduction method?

A

No, there are other linear methods (such as factor analysis), but PCA is the most popular linear dimensionality reduction technique.

72
Q

Can you quickly compare PCA and SNE (advantages disadvantages)?

A

t-SNE is a very effective nonlinear dimensionality reduction technique. It can handle outliers well. It works by minimizing the distance between the points in a gaussian and tries to preserve the local structure of the data. It is relatively computationally expensive.
PCA is a linear dimensionality reduction technique, which gets affected by outliers. It finds the eigenvectors of the covariance matrix for preserving variance and tries to preserve the global structure of the data. It is very computationally efficient. Not as good as t-SNE to visualize the data in lower dimension.

73
Q

How do you find a PC2 that is perpendicular to PC1?

A

Each principal component is an eigenvector of the covariance matrix of the features. The covariance matrix is always a symmetric matrix and eigenvectors of any symmetric matrices are always perpendicular. Hence, any two principal components are perpendicular to each other.

74
Q

Whatt is the difference between Scaling methods: standardization versus normalization of data in the pre-processing stage of the data?

A

Normalization of the data means rescaling the data to fit the range [0,1]. Normalization is done by dividing each data point by the value of the largest observation. Standardization, on the other hand, is reshaping the data to have a mean of 0 and a standard deviation of 1. It is done by subtracting the mean from each data point and dividing each by the standard deviation.

75
Q

Does PCA create different data distribution or keep the same as the original dataset?

A

PCA does not change the distribution of the data. It projects and represents the data on new axes which are called principal components.

76
Q

How many dimensions are typically used for t-SNE for real problems? We used 2 instead of 64 for digits, which runs in seconds. So if we used say 4 or 5 will it be better if it still runs in minutes

A

Typically t-SNE is used to visualize the data in a lower dimension like a 2-dimensional or a 3-dimensional plane. If the number of dimensions is 4 or 5, then we simply get the embedding of the lower dimension but we cannot visualize them.

77
Q

For PCA, we should always do the scaling first, right?

A

Yes, for PCA we must always perform scaling first because normalized/scaled data points will always have standard deviations on the same scale and hence it gets easier to identify the principal components.

78
Q

As a standard approach, should we scale all the variables (as opposed to standardizing) before PCA and t-SNE?

A

As important as it is to scale data for PCA, it is not important to scale and /or standardize for t-SNE because it is a distance-based algorithm and re-centering wouldn’t possibly bring a change.

79
Q

Can Principal Components be used as features in supervised machine learning?

A

Yes, if the interpretation of original features is not very important, then we can use reduce the dimension using PCA and use the top principal component that explain the most variance in the data and use them to train the model.

80
Q

What are the considerations for determining the size of the sample?

A

The sample size is related to how large the effect is that we want to be able to detect. For example, in the HIP study, if we want to detect a very small effect, say the death rate is only changing from 2.0 to 1.9, then we need a very large sample size to test whether it’s statistically significant or not. In general, any effect size can be statistically significant with a large enough sample.

81
Q

What is a double-blind experiment and placebo effect? How does the placebo effect change our final result?

A

Double-blind is used in different studies so that neither the patient nor the doctor knows what treatment is given. This is to avoid or minimize bias in the final result of the study.
The placebo effect is when an improvement of symptoms is observed, despite using a non-active treatment. The placebo effect can add bias to our results and hence it is important to control it in experiments/trials.

82
Q

How do we ensure randomness from the baseline conditions, for example, the cancer stage say 1, 2, or 3? Do we consider this condition during random sampling?

A

It depends on the hypothesis test that we want to conduct i.e. the question we are asking. In the HIP study, we test whether offering mammography reduces the number of deaths. In this case, there is no need to consider the cancer stage.

83
Q

Could we have created tests and controls after we had screened and refused results? Test group - Screened, Control group - based on the matches criteria?

A

Sometimes it is not possible to do so for ethical reasons. Here the question is offering mammography or not. It is not about taking mammography. We can not force someone to do it. You may ask whether smoking increases the probability of lung cancer or not? Here we cannot ethically perform a randomized control trial, since we cannot force someone to smoke.

84
Q

How do you ensure randomization of the population sample?

A

Randomization means that the subjects/people that are chosen to be part of the study are assigned to test and control groups in a ‘random’ manner, without following any other criteria. It is done to ensure that test and control are similar to each other and therefore an unbiased/fair comparison of results/measures is possible. For example, in the HIP Mammography experiment, you have a certain number of people that will be part of the study, in this case, 62,000. Randomization means that for every person, we just flip a coin; if it’s heads, we’ll assign them to the treatment group; if it’s tails, they’ll be assigned to the control group. So, with 50% probability, we’ll put a person in either the treatment or the control group. Therefore, randomization means, that you don’t evaluate any participant on any parameter and ‘randomly’ assign them to either test or control, to ensure that the two groups are very similar to each other.

85
Q

In HIP data, what is meant by the term ‘rate’ in death rate?

A

This is the death rate per 1000 randomized women. Breast cancer rate explains the death due to breast cancer and all other rate explains the death due to other reasons

86
Q

Is HIP study testing the effect of “offering” mammographies?

A

Yes. The study tests whether offering mammography to people will reduce the death rate or not. We can only test the impact of “offering” the treatment, as we cannot force anyone to take the treatment. That would not be random or even ethical in most cases.

87
Q

In the HIP study, how does one ensure that there is no bias in the samples selected from the population?

A

In this study, from the 700K population, 62000 samples are chosen randomly. Randomization ensures that there is at least no selection bias in the sample selected from the population.

88
Q

In the HIP study, why are the hypothesis statements not complementary?

A

The study tests whether offering mammography will “reduce” the death rate or not. Here, the Death rate from breast cancer in the control group is 0.002. So, we are performing a one-sided hypothesis test with the alternative hypothesis - the death rate is less than 0.002 if we offer mammography.

89
Q

How do we decide the ‘alpha’ in hypothesis testing?

A

It is the trade-off between business risks and costs associated with it. For example, a drug manufacturer will take a lower significance level like 0.01 because even a small error could lead to huge financial losses or health effects on patients.

90
Q

What is the use of conducting multiple tests over different populations? And how large should the sample size be?

A

The sample size should depend on how large the effect is that we want to be able to predict.
We repeat the experiment, to be able to generalize the results. For example, if the hypothesis is that a drug is independent of geographical factors, then you may need to conduct the experiment in different countries to obtain a generalized result.

91
Q

Is the variance of t-distribution more than that of the normal distribution?

A

Tail heaviness is determined by a parameter of the t-distribution, called degrees of freedom (dof), where the smaller values give heavier tails, and higher values make the t-distribution resemble a standard normal distribution. As the degrees of freedom approaches infinity, it approaches a standard normal distribution.
The t-distribution has a greater chance for extreme values than normal distributions, hence the larger tails as a result variance is more.

92
Q

When should we be interested in outliers?

A

If you’re in a business that benefits from rare events - say, an astronomical observatory with a grant to study Earth-orbit-crossing asteroids — you would naturally be more interested in outliers than in the bulk of the data.

93
Q

What is the place of PCA/visualization in the hierarchy of building models or processing data?

A

We use PCA/t-SNE to visualize the data when it has a large number of features. It can be considered the first step in the model building process.

94
Q

What is the difference between covariance and correlation?

A

Covariance signifies the direction of the linear relationship between the two variables, whereas Correlation signifies the direction as well as the magnitude of the linear relationship between them.

95
Q

Is linear regression similar to PCA?

A

PCA does not involve a dependent variable: All the variables are treated the same. It is a dimension reduction method. It uses an orthogonal transformation to form the principal components or linear combinations of the variables. However, linear regression is more about finding a straight line that best fits the data, depending on the internal data relationships.

96
Q

Is there a way to measure the performance of PCA i.e. what is a good PCA or bad?

A

There is no such thing as a good or bad PCA. It is simply an unsupervised technique.

97
Q

How do we decide on the optimal number of principal components?

A

We can decide the optimal number of principal components by plotting the cumulative sum of eigenvalues. If you divide each value by the total sum of eigenvalues before plotting, then your plot will show the fraction of total variance retained vs. the number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).