DSE1101 Flashcards

1
Q

What is a variable

A

characteristics observed in a study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When does variable become categorical
U

A

observation belongs to a set of categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When does variable become quantitative

A

observations take on numerical values that represent different magnitudes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is also called independent variable

A

Explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is also called dependent vairable

A

Response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is mean

A

“average, is one way to measure the center
of a distribution.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is sample mean

A

The sample mean is a sample statistics and serve as a point estimate of the population mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What kind of variable does histogram show/

A

distribution of a continuous variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is modality

A

associated with the numner of peaks your data have. If have one peak, only talking about a general pattern and data is called unimodal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is unimodal?

A

1 peak

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is 2 peaks

A

bimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is more than 2 peaks

A

multimodal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is it called when all have same peask

A

uniform data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Where is the peak on negatively skewed data

A

“Long tail on left
Peak on right”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give an example of negatively skewed data

A

“GPA
Age of death”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the peak on positively skewed data

A

“Longer tail on right
Peak on left”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If question ask wheterh left or right skewed, do we remove outliers first?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When you find data of some people who spend $1000 in super market, is it an error?

A

No, take them aside to be analysed separately

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why use median over mean?

A

More robust to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the cons of using median

A

“MEAN IS EASIER TO COMPUTE THAN MEDIAN, REQUIRE MORE COMPUTING POWER

No need to sort”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

If question ask wheterh left or right skewed, do we remove outliers first?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

If distribution is skewed or has some extreme values, where is the center

A

median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

If distribution is left skewed, where is median in relation to mean

A

“mean smaller than median

Median is always closer to the PEAK”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is variance?

A

the average squared deviation from the sample mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the formula for variance?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why we dont use absolute value but square for variance

A

less computatoinal power, get rid of negative value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the interquartile range

A

Q1 to Q3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Where does the whiskers of box plot extend up to

A

1.5 x IQR away from lower and upper quartile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is tukey rule

A

outliers are values more than 1.5 times the IQR from the quartiles — either below Q1 - 1.5IQR, or above Q3 + 1.5IQR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Where are outliers?

A

more than 1.5 times the IQR from the quartiles — either below Q1 - 1.5IQR, or above Q3 + 1.5IQR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are robust statistics for variance

A

Median and IQR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What to do to extremely skewed data?

A

natural log transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

horizontal axis of histogram is ____

A

discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is denoted by omega

A

sample space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What does a probability model describe

A

the uncertainty of a random process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is an outcome

A

mutually exclusive and collectively exhaustive results of a random process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is an event

A

collection of one or more outcomes. It is a subset of the sample space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the probaility distribution ?

A

lists all possible outcomes and the probabilities with which each of them occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is cumulative probability distribution

A

“probability that a variable is less than or equal to a particular value.

P(X<=2)”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is disjoint outcomes?

A

cannot happen at the same time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What does it mean for 2 variables to be independent

A

occurrence of B provides no information about A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is P(AnB)?

A

P (B) × P (A|B)

43
Q

“In 2013, SurveyUSA interviewed a random sample of 500 residence in North Carolina asking them whether the think widespread gun ownership protects law-abiding citizens from crime, or makes society more dangerous.
58% of all respondents said it protects citizens. 67% of White respondents, 28% of Black respondents, and 64% of Hispanic respondents shared the same view.
Based on the probabilities above, opinion on gun ownership and race ethnicity are most likely
complementary
disjoint
independent
dependent”

A

Dependent (need to calculate using the given that…)

44
Q

How to express joint probability in X and Y

A

“P (X = x, Y = y)

eg: P (Rain, Long commute) = P (X = 0, Y = 0) = 0.15”

45
Q

What is a random variable?

A

“numeric quantity whose value depends on the outcome of a random process.

Smaller letters denote the values of variable”

46
Q

What is the difference be DISCRETE RANDOM variable and CONTINUOUS RANDOM VARIABLE

A

“DISCRETE: takes integer values

Continuous: takes real decimal values”

47
Q

What is covariance?

A

extent to which 2 variables move in the same direction

48
Q

What is correlation?

A

covariance between two variables divided by the product of their standard deviations.

49
Q

What is bernoulli distribution?

A

”- for discrete variables

  • binary, with only 2 possible outcomes (0 or 1)”
50
Q

How to express Bernoulli distribution?

A

“X ∼ Bernoulli(p)
p is for prob that value is 1”

51
Q

How to express normal distribution?

A

N (µ, σ2).

52
Q

What is error?

A

= true value of population parameter - point estimate

53
Q

What is bias?

A

the systematic tendency to over or under-estimate the true population parameter.

54
Q

What is sample variability

A

how much an estimate will tend to vary from one sample to the next.

55
Q

Sample average is…

A

a estimator of population MEAN

56
Q

What does Y bar stand ofr?

A

sample mean, y bar is a random variable

57
Q

What is population parameter?

A

fixed feature of a particular population

  • usually unknown in real life
58
Q

What is sample statistic?

A

quantity that vary from one sample to another

  • easy to compute, as it is statistic of sample from simple random sampling
59
Q

What kind of distribution is it when parameters and exact distributions are not known?

A

Asymptotic distribution (use approx on asmple)

“Tending to a distribution”

60
Q

What do we rely on when following asymptotic distribution?

A

Law of large numbers

central limit theorem

61
Q

What is law of large numbers?

A

sample mean approaches population mean as the sample size increases

62
Q

What is central limit theorem?

A

using sample mean and sample variance to approximate distribution of sample mean

63
Q

What is the law of central limit theorm?

if population variance sigma^2 is known

A

When n is large, the sampling distribution of Y¯ is approximately normal, regardless of the distribution of the underlying population.

sample mean approx normally distributed with mean miu and variance (sigma^2)/n

random sample size=n

64
Q

If population variance is unknown, what does sample mean follow?

A

student t distribution with n-1 degrees of freedom

tails are higher than normal distribution

variance is s^2/n

65
Q

If you want to conduct hypo testing on whehter coin is fair, what is variance?

A

sigma^2 = p(1-p) (assuming the coin is fair)

= 0.25

By clt, sample mean is approx normally distributed with :
var(p hat)= sigma^2 / n = 0.0025

2 tail test

66
Q

waht is confidence interval

A

plausible range of values for the population parameter.

67
Q

What is 95% confidence interval?

A

1.96 +/- Standard error

Suppose we take many samples and build a confidence interval from each sample, then about 95% of these intervals would contain the true population parameter

68
Q

Standard error

A

standard deviation

69
Q

What is margin of error?

A

width of CI

70
Q

Linear Regression is ____.

supervised
unsupervised

A

supervised learning

71
Q

What is a charcteristic of the y variable for linear regression?

A

continuous dependent

72
Q

can linear regression be used to predict discrete outcomes ?

A

Yes (credit card default)

73
Q

What does hat denote /

A

estimate, a predicted value

74
Q

what is the typical equation of a linear regression model?

A

Y = β0 + β1X + ϵ

75
Q

What does ϵ represent in the model linear regression

A

residual term/ erorr term

DIFFERENCE BETWEEN THE REGRESSION LINE AND THE ACTUAL OBSERVED DATA

76
Q

What is the equaiton for residual?

A

= yi − yˆi = yi − (β0 + β1xi)

= vertical distance between each point to purported line

77
Q

What is the residual sum of squares?

A

SUM( residuals) for all observations

ALSO CALLED LEAST SQUARES

the variance in
Y that is left unexplained after fitting the regression model.

78
Q

What is model supoposed to minimise in linear regression? How?

A

RSS

  1. sum all residuals , with variables b0 and b1 etc.
  2. Take the derivative wrt b0 and b1
79
Q

The regerssion line always passes through which point?

A

(x bar, y bar)

b0 = y hat - b1(x bar)
sub into eqn y= b0 +b1 x

y bar= y bar - b1 x hat + b1 x hat

b1 x hat CANCEL OFF!!!!

80
Q

What does best fit line do?

A

Minimises the square deviation to the proposed line ( least squares fit for the regression line)

81
Q

How to interpret the y intercept for the y axis?

A

If there is 0 of x, then ON AVERAGE, able to have y amount

82
Q

How to interpret the slope of a regression plot?

A

change of Y when X increases/decreases by one unit

83
Q

What is residual standard erorr?

A

estimate of the standard deviation of the residual terms

measures the lack of fit of a model to the data

84
Q

How many degrees of freedom are there for RSE?

A

N-2 (scale down)

85
Q

What is TSS?

A

total variance in Y

can be explained by model(RSS) + cannot be explained

86
Q

What is R^2?

A

measures the goodness of fit

variance in y that can be explained (larger the R^2, the bigger the goodness of fit)

Formula:
(TSS- RSS)/ TSS

87
Q

What is the purpose of hypo testing for linear regression?

A

how close the estimatoed b0 and b1 hat are to the true values of b0 and b1

88
Q

how ot find standard error of an estimator?

A

repeated sampling, and see what values you get for b0 and b1

89
Q

How do we conduct hypothesis testing for b0 and b1?

A

T test with n-2 degree of freedom, where n is sample size(cause estimate b0 and b1)

t= (b1-0 )/ se(b1 hat)

90
Q

What are the assumptions for the leeast squares line?

A
  1. Relationship between X and Y should be linear
  2. Residual nearly normal
  3. Residual shave constant variability (homoscedaticity)
91
Q

What graph should we use to check whether X and Y are linear?

A

Residuals vs Fitted plot

RED LINE SHOULD BE HORIZONTAL

92
Q

How to check whether nearly normal residual?

A

Normal Q-Q plot

points should be roughly along straight diagonal line

93
Q

What is hte formula for standardised residual?

A

(ei -e hat )/ SE(e)

94
Q

How to check for constant variability?

A

Scale-Location plot (YOU WANT OT HAVE NO PATTERN IN RESIDUAL)

red line is roughly horizontal

95
Q

How ot check for influential values?

A

Residual vs leverage plot

check for outlyying vales at upper-right or lower right

If they fall outside of cook distance, then it is influential(should remove points)

96
Q

How to improve model?

A

transforming variables(scaling)

seeking additional variables to explain Y

Using more advanced methods

97
Q

How to read data in R?

A

read.csv(“file”, head=True)

98
Q

How to create a linear model in R?

A

lm1= lm(y var~ x var, data= Advertising)

99
Q

How to show the coefficients?

A

summary(lm1)$coefficients

100
Q

When to reject null hypo with 95% confidence that b1 is more than 0?

A

when |t| for b1 greater than 1.96

There is relationship between variables

101
Q

How to obtain confidence interval for b0 and b1 in R?

A

confit(lm1).

By default 95%

102
Q

How to find confidence interval of 90% in R for b0 and b1?

A

confit(lm1, level=0.90)

103
Q

How to specify that you use a column in dataset?

A

data$column name

104
Q

How much of the dataset lies within:

1sd

2sd

3sd

A

1sd: 68%

2sd: 95

3sd: 99.7