quantitaive methods Flashcards

1
Q

what is random sampling?

A

Every single dot will have an equal chance of being selected, difficult to achieve unless done by an instittuion with loads of resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is stratified sampling?

A

Random within groups, divided into overall groups, based on common homogenous characteristics e.g. gender. Comes from a selection of each of these groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is cluster sampling?

A

similar to stratified, individuals are divided into groups based on dfferent characteristics (Heterogenerous). Have a good mixture of everything in one group, select entire cluster as a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the types of sampling?

A

random, stratified, cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the sampling biases ?

A

Convenience sample - individuals are easily accesible are more likely to be included in the sample.
Non-Responsive: if only a (non-random) fraction of the randomly sampled people respond to a survey such that sample is no longer representative of the population.
Voluntary Response: occurs when the sample consists of people who volunteer to respond becasue they have strong opinions on the issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the types of data?

A

1) Numerical (quantitative) 2) categorical (qualitative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the types of modality?

A

unimodal, bimodal, multimodal and uniform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is skewness? and what are the different types?

A

left skewed, symetric, right skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a box plot and explain IQR?

A

Box plot is a computation of quartiles and IQR. IQR is the difference between the upper and lower quartiles i.e. Q3 - Q1. / the range of the central 50% of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are the types of statisitics?

A

inferential statistics- methods used to estimate, predict andd generalise a property of a population on the basis of a sample.
Descriptive statisics- methods of organising, summarising, presenting data in an informative way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the types of descriptive statisitcs?

A

measure of central location - mean, median, mode

measure of dispersion - range, variance, standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what are the meanings of mean, median and mode?

A

mean - set of values divided by the no of items
median - middle irem of the data
mode - value that occurs the most often

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the unusual observations? and how do you deal with them?

A

1) errors - value not equal to the true/actual value. double entry check/delete it.
2) outliers- cannot be eliminated! - analyse data without it/treat it seperately

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

explain the mean and trimmed mean and their +/-ve?

A

mean - can be distorted by extreme values, often quoted to several decimals places, doesnt correspind to an actual value.
trimmed mean- remvoes the effect of unsual values, eliminate a small proportion of the lowest/highest observations. Its -ve is that its quoted to several decimal places.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

explain the modes -ve/+ve

A

+ve - only sensible measure for categorical data

-ve - may not be representative, usntbae due to sensitiveness to the number of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

compare skewness and measure of centre

A

zero skewness = mode =median = mean
positive skewness = mode < median < mean
negative skewness = mode > median > mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

which way skewed is a) positive b)negative

A

a) right b)left

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is variance ?

A

the arithmetic mean of the squared deviations from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is standard deviation?

A

it is the sqaure root of the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

why is standard deviation useful?

A

as the units associated with the variance are squared, by taking the square root the units are the same as the units used to calculate the mean = make direct comparisons witht the sample mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is the coefficient of variation? why is it useful?

A

it is an indication of how large the standard deviation is in relation to the mean. It is useful when we want to compare the variablity of the variables that have different means and standard deviations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what do we mean by lying with graphs ?

A

ranges used on a axis can distort the same data on two graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

benefit of measure of dispersion? and the types ?

A

can be more importnat than the mean and average.

Range, IQR, Variance and standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

positives and negatives of range?

A

+ve - simplest measure of dispersion ( R= Max - Min), broad spread useful to spot typing errors.
-ve - only takes into account the two most extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

positives of IQR?

A

+ve - not influenced by extreme values, stable measure= doesnt change a lot if we keep adding observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

why do we use variance?

A

to get rid of negatives, so the -/+ve values dont cancel each other out when added together.
Also increases larger deviations more than smaller ones so that theyare weighted more heavily.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

characteristics of variance?

A
  • non-negative
  • for observations who values near the mean, the variance will be small
  • values dispersed from the mean, variance will be large
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

postives and negatives of variance?

A

+ve - uses all observations in the data set to measure the variation in the samle (vs range)
-ve - variance measures squared value = intepretation sint straightforward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

what is standard deviation and an advantage of it?

A

it is the most common/useful measure of dispersion = average distance of each obeservation from the mean.
Advantage - uses all values of the data set, expresed in the same unit of measure as the observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

what kind of relations can a scatter plot show?

A

linear (positive relationship) and non-linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

types of linear relationships and their meanings ?

A

linear positive - one variable increases so does the other one
linear negative - one variable increases the other decreases
non-linear association - e.g. hours studied and test score
no association - no. of people who go gymvs no. of tickets sold at museum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

factors involved in evaluation the plot

A

direction (positive/negative), shape (linear/curved), strength (strong/weak), outliers

33
Q

what is the correlation coefficient?

A

it is a way to measure the strength and direction of the linear association between two variables - takes values between +1 and -1

34
Q

what does +1 -1 and 0 mean?

A

+1 = perfect postive relationship
-1 = perfect negative relationship
0 =no linear relationship

35
Q

types of linear regression…

A

scatter diagram= shows relationship between two variables, useful first step in correlation/regression analysis
correlation analysis= relationship between two vatriables to measure the strength of their association
regression analysis= relationship between variables with aim to ascertain the dependent effect of one variable upon another

36
Q

characterisitcs of linear regression and its uses

A
  • the dependant vairable is a continuos numerical value
  • use LR to predict the value of a specific variable by using another variable
  • variable being predicted = dependent variable (y) and variable being used to predict the DV is called the independent variable (x)
37
Q

what is the regression model?

A

y= a + bx + e
where y = dependent variable, x= IV, e= residuals,
a/b= estimate parameters (a=intercept, b=slope)

38
Q

B on spss = direction and magnitude of relationship. define slope parameter (b) and intercept (a)

A
b= slope parameter, defines the impac that a unit of change of the IV (x) had on DV (y)
a= is the intercept and the average value of y when x is equal to zero (when the IV has no effect on the dependent)
39
Q

what is the regression equation?

A

y(hat) = a+bx

40
Q

what is the residual (e) ?

A

difference between an observed and predicted y is called the residual i.e. e=y-y(hat)
it reflects the factors that are not considered in the model that have an influence on the DV. can be psoitve or negative.

41
Q

What is the least squares principle? and the method for using it ?

A

the LSP minimises the deviation of observation from the regression lines i.e. residuals
Method - rarely can see scatter plot all in straight line, so we use least squares to find that line that best fits the data.
LSM - finds the line with those a and b values that gives you the smallest possible overall vertical distance between the line and the points in the scatter diagram in relation to any other line that can be drawn

42
Q

what is significance and the P-value?

A

a change in the IV = change in the DV - whether this is significant we use 0.05 as the threshold.
Low P-value - likely to be meaningful to the mdoel because changes in the predictor value are related to changes in the response variable
large P-value - insignificant, suggests changes in the predictor are not associated with chnages in the response.

43
Q

what is the T-value ? why is it used/useful

A

there is a correspondance between the t-value and p-value. standardised coefficients can be used to compare the relative impact of multiple IV’s = a larger beta value indicates a higher impact, only useful when we have multiple IV’s

44
Q

what is the coefficient of determination ? and why is it useful

A

R- squared, ranges from 1-0 and gives us info about the relationship between variables.
R2 = proportion of the total vairbales in the DV that is explained or accoutned for by the variation in the IV
- it is the squared coefficient of correlation.
e.g. .951 is 95%

45
Q

what is the multiple predictor model ?

A

y=a +b1x1+b2+x2…+bn+xn+ e

where y= DV, e= residule, a,b1…= parameters, x1,x2…=IV’s

46
Q

what is the standardised coefficient? and when to use it ?

A

refers to how many standard deviations a DV will change if an IV increases by one standard deviations. Use this when trying to investigate which IV exerts the highest effect on the DV = cannot directly estimate it because variables are measured in different units.

47
Q

what is goodness of fit?

A

how well it fits a set of observations, typically summarises the discrepancy between observed values and the valus expected under the model in question. In the linear regression analysis we use R2 measure (coefficient of determination). Low R2 predictors may help improve it.

48
Q

why isnt R2 appropriate? what should you use instead for multiple regression?

A

R2 in multiple regression analysis tends to increase with the number of variables in the modela nd it adds a ‘fake’ percentage of the difference in the values of the DV explained. Therefore, it is preferable to estimate the % of variation explained by the model = adjusted R2

49
Q

what is the model for adjusted R2?

A

Adj R2 = 1 - residual mean square/total residual square

50
Q

what do we use a) R2 for b)adj R2 for?

A

R2 is for regression using single predictors, adj R2 is for regression using multiple predictors.

51
Q

why are multiple predictors useful?

A

can allow us to make a more accurtae prediction about the values of the DV as it can allows us to explain a higher % of the difference in values of the DV

52
Q

write out the multiple regression model and its features

A

y= a +b1x1+b2x2+b3x3+…+e
where x = IV, a= Y intercept, b1= the net change in y for each unit change in x1, holding x2 constant (partial regression coefficient)
The least squares criterion is used to develop this equation, determing b1, b2 etc is very tedious you need software

53
Q

what is linear assumption?

A

needs relationship between the I and DV to be linear, improtant to check for outliers since LR is senstivie to outlier effects, best tested with scatter plots between x and y

54
Q

what are the five assumptions?

A

1) multivariate Normality, 2)lack of multicollinearity, 3)no autocorrelation 4) homoscedasticity (constant variance)

55
Q

what is multivariate normality ?

A
  • linear regression requies all variables to be multivariate normal or errors are normally distributed (bell curve)
  • Can be best checked with a histogram of residuals or a P-P plot which is a probability plot for assessing how closely two data sets agree, plots two cumulative distribution functions against each other.
  • when data is not normally distributed a non-linear transformation e.g. log transformation might fix this issue
56
Q

what is lack of multicollinearity?

A
  • occurs when the IVs are not independent from each other
  • multicollinearity might be tested with the following criteria:
    correlation matrix (computing the matric of the pearsons correlation coefficients among all the IV’s)
    or
    with the Variance Inflation factor (VIF) = defined as VIF=1/(1-R2), with VIF>10 there is an indication for multicollinearity, if >100 it is certain.
57
Q

what do you do if you find multicollinearity ?

A

if it is found in the data, centering the data, that is deduting the mean score, might help to solve the problem.
- need to remove similarities by conducting mean centre of the two numbers, after centre correction if we dont find such high correlations we are safe to continue with LR to ensure linear is the right model to use.

58
Q

what is no autocorrelation ?

A
  • independence of residuals and errors
  • occurs when the residuals are not independent from each other
  • typically occurs in stock prices
  • can be tested with the durbin-watson test
  • d is between 0 and 4, rule of thumb 1.5
59
Q

what is homoscedasticity?

A
  • it means the error terms (residuals) along the regression are equal or have constant variance
  • the scatter plot between y(hat) and e is a good way to check whether homoscedasticity is given
60
Q

what are binary variables?

A
  • a particular type of regular categorical variable

- have two values (0,1) and are often used to indicate that an event has occured or that some characteristics is present

61
Q

natural starting point for LR is y= a+bx+e, what if its binary? what is the problem with this and what is the solution…

A

problem: is that y only takes values 0 and 1, so LR always return meaningless results of y(hat).
solution: make some changes to y that allows meaningful interpretation on the parameters and regression outcomes.

When y is binary, the LR model becomes the Linear probability model (LPM):

  • a+bx is the probability that y=1, given x (Pr(y=1/x)
  • the predicted value y(hat) is the predicted probability that y=1 (Pr(y=1/x)) for a given x, by changing x to x + Δx, the probability that y=1 changes to b
62
Q

what are the problems with LPM?

A

1) Unbounded Predicted Probabilities
fundamental law of probability - states that the probability of an event occuring must be contained within the interval (0,1)
BUT the nature of a LPM doesnt ensure this fundamental law of probability is satisified
- some prohibited probabilties may have non-sensical values that are less than 0 or greater than 1.

2) Non-normality of the errors
- the errors/residuals of an LPM do not have a normal distribution (since y only takes the values of 0 and 1)
- the error has one of two possible values for a given x value (e= y-a-bx):
if y=1, then e= 1-a-bx
if y=0, then e= -a-bx

3) heteroscedasticity
- the variance of the errors depends on the independent variables and is not constant

4) non-linear relationship
- model is linear, a unit increase in x resuls in a constant change of b in the probabiltiy of an event, holding all other variables constant
- the increase is the same regardless of the current value of x

63
Q

what are the advantages and disadadvantages of LPM?

A

it is the models probabiltiy as a linear function of X.
+ve: simple to estimate and to interpret, inference is the same as for multiple regression
-ve: unbounded predicted probabilties, non-normality of the errors, heteroscedasticity, non-linear relationship
THESE DISADVANTAGES CAN BE studied by using a non-linear probability model= LOGIT MODEL

64
Q

why is the Linear probability model not as useful as logit?

A

problem with the linear probability model is that it models the probability of y=1 as being linear and unbounded.
Pr(y=1/x) = a=bx (not sufficient)
Instead we want a non-linear transformation of a + bx

65
Q

what is the non-linear transformation?

A

the target is to find a function form of a + bx that will only take values between 0,1:
a+bx E(-infinity+infinity) -> exp(a+bx)E(o,+infinity) -> exp(a+bx)/1+exp(a+bx) -(0,1)

66
Q

what is the logit model?

A

Pr(y+1/x) = exp(a+bx) / 1+exp (a+bx)

67
Q

non-linear transformation (2)

A
let p = Pr (y=1/x)
target = transform p E(0,1) into f (p)E (-infinity,+infinity), wehre f (p) means a function of p
odds: p E(0,1) -> odds ratio p/1-p E(0,+inifinity) -> logit = ln(p/1-p) E(-infinity,+inifnity)
68
Q

what is the logit model formula?

A

ln (p/1-p)=a +bx or exp (a+bx)/ 1+exp (a+bx)

69
Q

what does the logit model do?

A

it measures the relationshop between the dummy categorical dependent variable and one/more independent variables by estimating probabilities using a logit link function.

70
Q

which model puts the constant a) on top and b)on the bottom?

A

a)linear b)logit

71
Q

what is unconstrained?

A

unconstrained (Mu) includes all predictors of interest

72
Q

what is constrained?

A

constrained model (Mc) is an intercept only model, excluding all predictors (i.e. constrain b=0)

73
Q

how do you check the goodness of fit on the logit model? and how is it similar to linear regression?

A

using the likelihood ratio test which is equivalent to R2 for linear, which gives information on variables

74
Q

why do you use the likelihood ratio test? and how?

A

whether there is evidence of the needs to move from a simple model to a more complicated one.
You compare the likelihood function of a constrained and unconstrained model and examine whether adding predictors in the unconstrained model significantly improves the understanding of the changes in the dependent (categorical) variable. Essentially you take MC and minus MU = if the difference is greater than the critical value given it is valuable to include as it helps us understand it better.

75
Q

difference between linear and logit in terms of appropriateness?

A

linear - must satisfy all LR assumptions

logit - disadvantage of LPM

76
Q

difference between linear and logit in terms of statisitical model?

A

linear - regression model

logit - regression model (use odds ratio - non linear regressions model

77
Q

difference between linear and logit in terms of interpretation?

A

linear - magnitude and direction significance
logit - magnitude and direction (non-linear effect, clarify the starting value) significance (changes in x will lead to changes in logit p)

78
Q

difference between linear and logit in terms of goodness of fit?

A

linear- Rsquare, adjusted Rsquare, evaluate the model

logit - likelihood ratio test, evaluate the model

79
Q

difference between linear and logit in terms of applications?

A

NONE- MAKE RECOMMENDATIONS BASED ON THE RESULTS AND/OR SUGGEST HOW TO FURTHER IMPROVE THE MODEL.