Statistical Methods For Business Analysis Flashcards

1
Q

What type of random variable are there?

A

Discrete (whole numbers)
Continuous (values with in a range)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the variance?

A

How much the values differ from the expected value or average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is standard deviation?

A

The typical distance between each data point and the mean or expected value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the requirements of the standard normal distribution?

A

Expected value = 0
Standar deviation = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What means if the null hypothesis is true?

A

There is no relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What means if the null hypothesis is false?

A

There is a relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the T statistic

A

It’s a parameter that measures the distance between the average x and the value of the null hypothesis. How far is x from de expected value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What we use the P value for?

A

To say how confident we are and decide if reject the null hypothesis or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does a P value of 0 or 1 mean?

A

If 0 there is a relationship, the null hypothesis is false.
If 1 there is nothing going on, the null hypothesis is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the unsupervised statistical learning?

A

We have a lot of variables and we want to learn something about them without a guide.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is supervised statistical learning?

A

We have a target variable (y) and our objective is to learn about the relationship between x and y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the regression about?

A

Learning about the relationship between Y and one or more Xs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does it mean if the relationship between X and Y is deterministic.

A

Y is completely dependent on X, no allow to error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does it mean if the relationship between X and Y is probabilistic?

A

There is allowance for random error and unexplained variation. Error is independent from x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is correlation?

A

Describes the strength and direction of the relationship between variables. How much one changes when the other one changes. Could be from -1 to 1

-1 means perfect negative linear relationship
1 means perfect linear relationship
0 means no linear relationship at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does linear regression mean?

A

That the relationship between the variables is linear. The change in x results in a PROPORTIONAL change in y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is MSE?

A

The mean square error, mesurares the quality of prediction of our model in regression. We want it to be low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the residuals?

A

The difference between the predicted response variable and the actual response
variable .

They need to be around 0 without being equal between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does the LS. Least square criterion?

A

Minimizes the Residual Sum Squares, trying to make the residuals as smalls as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is covariance?

A

Mesures to what extent two variables change together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is RSE?

A

The residuals standard error, the variance of the residuals.

Should be small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is R^2?

A

The coefficient of determination. Tells us how accurately is the model making predictions by measuring the proportion of variance explained by x. Ranges from 0 to 1. 0 bad, 1 good.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In multiple linear regression how can we tell if there is a relationship or not.

A

If lo the parameters (B) = 0 there are no relationships. If AT LEAST one is not equal to 0 there is a relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is F statistic?

A

Compares the model explained variance with the unexplained one. We want it to be high. (More than 1).

25
Q

What is the difference between F statistic and the P value?

A

The F statistic measures the model’s OVERALL significance and the P Value measures the significance of INDIVIDUAL Xs

26
Q

How many possible models are there for number of variables?

A

2^p models

27
Q

What can be used to asset the accuracy of a model in linear regression?

A

RSE residuals standard error. (Should be small)

R2 coefficient of determination 💖

28
Q

What is a classifier?

A

A modded of classification

29
Q

What is a dummy variable?

A

Is a bolean assignation in the function. The Xi.The amount of dummy variables is K-1 (k being the number of categories)

30
Q

What is collinearity?

A

When 2 or more variables are strongly correlated.

31
Q

What is logistic regression for?

A

Used to model binary categorical variables used to classify, saying the probability that Y is equal to one of the option given X.

32
Q

What is Z statistic?

A

A mesuras to quantify how many standard deviation is a specific observation from the mean.

If it is positive the observation is above the mean
Negative below
0 is equal to the mean.

33
Q

What is the (t) threshold for?

A

Helps to determine which category we will predict.

34
Q

What are the resampling methods for and how do they work?

A

To evaluate the model’s performance and select a model. Drawing samples many times and refit a model to each sample . Because we don’t have always the test data.

35
Q

How we validate the data?

A
  1. Training data to fit the models NOT TO PREDICT.
  2. Test / validation data to do the predictions, fitted on the training data.
36
Q

How does the LOOCV works?

A

The leave one out cross validation is used to evaluate the model to say how well will it perform on unseen data:

  1. splitting the data leaving “i” out as VALIDATION data and the reminding data (n-1) as the training data

2.iteratively for i, we predict the outcome of ¥i for the the “ith” observation

  1. We calculate de MSE for all the predictio.

4.CV is the Average of all the MSEs

37
Q

How does the K-fold works?

A

The k-fold cross validation evaluates the model to see how well is expected to generalize new data by:

  1. Divine the data in a K number of equal groups (folds) (normally 5 to 10).
  2. For each interaction of “k” we use the rest of the data (k-1) as TRAINING DATA and the separated fold as the VALIDATION DATA. We predict on the validation data for all of the folds.
  3. Then we calculate the MSE for the validation data
  4. We do the CV estimating an average of all the MSEs
38
Q

What is the AIC and BIC?

A

To improve the performance and interpretability of the model.

39
Q

What is the Adjusted R^2 for?

A

To see if the model explains the data variance. Needs to be close to 1.

40
Q

What is PCA?

A

Th principal component analysis, transforms a number of possible correlated variables into smaller set of uncorrelated variables called principal,component. That are the direction in which the data varies the most, UNCORRELATED TO THE OTHER PCs

The # of PCs is = to the number of variables

Before performing the PCA the data should be normalized

41
Q

What does it mean to normalize the data?

A

Mean = 0
Standard deviation = 1

42
Q

What are the score vectors ?

A

The length of n. (The states 🇺🇸)

43
Q

What are the loading vectors?

A

The length of p, the # of variables (murders…)

44
Q

What is the PVE, proportion of variance explained?

A

Is what % of the variance in the data is explained by each PCA.

45
Q

What is the cumulative proportion?

A

Is the sum of each PCA with the previews one (la PVE acumulada)

46
Q

What’s the difference between PCA and clustering?

A

Is that PCA looks to find a low dimensional representation of the observations that explains good fraction of the variance.

Clustering looks to find homogeneous subgroups among the observations

47
Q

What are K means

A

Seek to partition the observations in a pre specified # of clusters, where all observations belong to one cluster. We want the variation to be small as possible.

  1. We assign randomly a # from1 to k to each observation
  2. The alogorith will determine a mean for each cluster and assign each observation to the cluster that the center is the closest.
  3. Iterate until the cluster assignment stops changing
  4. Do it multiple times and select the one with the smallest with in cluster variation.
48
Q

What is a dendogram?

A

A tree like visual representation of the clustered observations

49
Q

When we do hierarchical clustering?

A

When we don’t know how many clusters we want, when we have a lot of variables with possible non linear relationships.

50
Q

What means when the fusion in a hierarchical cluster happens lower in the tree?

A

The groups are more similar to each other.

51
Q

How does the heriarchical dendogram works?

A
  1. We treat each data point as a cluster
  2. Iteratively merge clusters cased on similarity until all the clusters are merged into a single cluster or until the stopping criterion is met.
52
Q

What is linkage and what types of linkage are there?

A

The similarity or dissimilarity criterion to determine how the clusters are formed.

  1. Single (minimum distance) sensitive to outliers
  2. Complete (maximum distance) compact clusters, less sensitive
  3. Average (average distance) less sensitive
  4. Centroids (distance between centroids)
  5. Ward’s (minimize ps the variance within cluster when merging them. Similar clusters, robust outliers.
53
Q

What is shrinkage for?

A

Is to improve the performance and interpretability of a model penalizing the coefficients toward 0 reducing the variance and mitigating overfitting, multicollinearity, and impact of irrelevant variables. We can do it through ridge regression or LASSO. (Lasso forsimpler models)

54
Q

What is dimension reduction?

A

Aims to reduce the # of variables. PCA is a dimension reduction technique.

55
Q

What is tree pruning for?

A

Technique to reduce overfitting by removing leaves that provide less power on unseen data ?

56
Q

When should we use a decision tree?

A

Non linearity, but they tend to overfit and are sensitive to variations.

57
Q

What is bootstrap?

A

It’s a resampling technique. We randomly sample data points with replacements of the original data.

58
Q

What is bagging?

A

Bootstrap aggregating. To improve stability and accuracy of high variance modes. It like a multiple bootstrap samples trained independently of each bootstrap sample, the we make predictions from all individual models to combine them into a final prediction.

59
Q

What is a random forest?

A

Multiple decision trees to improve accuracy and reduce over fitting. (Accurate, less sensitive to outliers)

  1. Build multiple decision trees during the training.
  2. Each tree is trained in a random subset of data (bootstrap sample) & random subset of features.
  3. The final prediction is made by averaging (for regression) or voting (for classification), overall individual trees.