STATA Flashcards

Learn definitions and formulas!

1
Q

Systematic measurement error

A

“Mismeasurement bias”, it produces a systematic mismeasurement of the intended characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Random Measurement Error

A

It introduces chaotic distortion in the measurement process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Reliable Measure

A

A measure free of random measurement errors. A measure that is a consistent measure of the concept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Valid Measure

A

A measure that records the TRUE VALUE of the intended concept. It does not measure any UNINTENDED characteristics. It is free of any SYSTEMATIC measurement error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Over-time consistency

A

To assess RELIABILITY. * TEST RE-TEST METHOD, repeating the same test hoping to get the same results. *ALTERNATIVE FORM METHOD, the test is administered again in a roughly equivalent form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Internal Consistency

A

Assessing reliabikity. * SPLIT-HALF METHOD: half of the questions are administered to a group, half to another. *CRONBACH’S ALPHA: statistical measure of internal consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Assessing Validity

A

*FACE VALIDITY: informed judgement is used to determine whether a measurement strategy is measuring what it should. *CONTSTRUCT VALIDITY: assessment of an association between the measured concept with other concepts as we would expect it to be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The question to address for “Validity”

A

Are we aiming at the correct target?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The question to address for Reliability

A

How close have we got to the target?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variable

A

The result of the measurement process. The empirical measurement of a concept. Each question in a survey gives birth to a variable. A variable has a name, and at least two values. Nominal, Ordinal, Internal Level Variables. (Dummies)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Nominal and Ordinal variables also referred to as?

A

Categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Nominal Variables

A

They take on values that are not numbers and cannot be ranked.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ordinal Variables

A

They take on values that are not numbers, but there is a criterion allowing us to RANK them. Ordinal variables communicate the RELATIVE AMOUNT of the characteristic being measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Interval Level Variables

A

They take on numerical values, providing the most precise measurement of the amount of an observed characteristic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Decision Tree for Variable Types

A

Are the values numerical? Yes (Interval level variable), No —> Can we rank the values? Yes (Ordinal), No(Nominal)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data file

A

.dta, this is the dataset with all its information. No analysis or results are recorded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Do file

A

.do, analysis. It records commands only. It is a file that runs all the commands that you do for your analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Log file

A

.smcl, RESULTS. It records commands and results of your analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

describe varname

A

n°observations, n°var, size, date of creation. For each variable it provides a description in term kf storage type, format and variable label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

codebook varname

A

information on each variable: type, RANGE, number of MISSINGS, value LABELS and counts for each value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Creating value labels: eg. Likert

A

label define likert 1 “Strongly Agree” 2 “Agree” … … 5 “Strongly Disagree” ******* label values var_1 var_2 var_3 likert

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

First step of creating value labels

A

label define label_nane numeric_code1 “label1” numeric_code2 “label2”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Second step of creating value labels

A

label values var_1 var_2 .. label_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Missing values coding, es. gov_int, 6

A

mvdecode gov_int, mv(6=.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Variable collapsing (recode), eg. gov_inc

A

recode gov_inc (1/2=1 “agree”) (3/5=0 “disagree”) (.=.), gen(gov_inc_agree)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Automatic creating indicator variables from a categorical variable

A

tab var_name, gen(var_name_dum)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Collapsing interval level variables (upper bounds)

A

generate var_new=recode(var_old, max1, max2, max3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Mathematical transformatione

A

generate var_new=function(var1, var2, … )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Additive index

A

generate var_sum=var1+var2+var3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Log of a variable

A

generate var_log=ln(old_var)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

tab {frequency table}

A

it displays: the observed distinct values of the variable, the raw frequencies/counts as the number of units falling within each category, percentages & relative frequencies, cumulative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Measures of Central Tendency

A

Mode, Median, Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Mode

A

ALL variables with manageable number of values. It is the unique measure that can be used for a categorical nominal variable. (single value with highest frequency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Median

A

Internal level (numerical), categorical ordinal. 50%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Mean

A

Only numerical values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

mean computation

A

mean var_name ***** sum var_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Mean vs. Median

A

The median is more robust than the mean against outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Measures of dispersion or variability

A

Interquartile range, range, variance and standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

An assessment of the dispersion or variability of a variable gives an answer to questions such as…

A

How are the freq. distributed across the values of the variable? Are the unts polarized into few categories? is there heterogeneity in the variable distribution?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Measuring dispersion for a categorical ordinal variable

A

Interquartile range, UQ-LQ=IR. It measures the spread of the 50% central part of the distribution. The higher the IR, the higher the dispersion of the variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Measuring dispersion for interval level variables

A

The range (max-min. obs. val). For the range, sum varname.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Deviation

A

The difference between each variable value and the mean. (Positive=above the mean, Negative=below). The deviation can be seen as a measure of distance between the value and the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Variance

A

The average of the squared deviation. The variance measures the dispersion of a variable as the spread around its mean. Always positive; if all values are equal, variance=0. The hugher the variance, the higher the dispersion of a variable around its mean. The variance is not expressed in the same unit of measurement of the variable, but in its square.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Standard Deviation

A

The standard deviation is the square root of the variance and therefore expressed in the same unit of measurement of the variable ** sum var_name, d *** delivers std, variance, percentiles for IR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Bar chart

A

histogram var_name, d percent ******* A bar chart is a graphical representation of a frequency distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Histogram

A

hist var_name, percent ** It describes internal level variables taking several distinct values. The range of the variable is divided into a given numer of intervals (bins) of the same width. On each bin, a bar is drawn havung as heught the percentage of units falling in the interval. hist var_name, percent bin(25)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Boxplot

A

graph box var_name **** min, max, LQ, UQ, media , whiskers (1.5 of the IR), any value above upper whisker/lower wh. is flagged as an outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Shape

A

We can assess the shape of a variable by plotting a bar chart or histogram. Tail on the right: positive skew, mean>median. Tail on the left: negative skew, mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Goals of a quantitative research

A

1 - Measuring and describing concepts *** 2 - Suggesting and assessing explanationd for (political) concepts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Dependent variable

A

The variable that measures the concept we want to explain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

İndependent/explanatory variable

A

The variable that measures the concept identified as possible determinant of the observed differences in the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

An hypotesis should be formulated so to…

A

Tell us that when we compare units of analysis having different values of the independent variable we will observe difference in the dependent variable, and specify the tendency of the relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Template for formulating an hypothesis

A

In a comparison (units of analysis), those having (one or more values of the independent variable) will be more likely to have (one value of the dependent variable) than those having (a different value of the independent variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Running comparisons when both the dependent and independent variables are CATEGORICAL

A

Cross-tabulations ******* tab dependent_var independent_var, col

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Running comparisons when the dependent variable is interval level and the independent variable is categorical

A

Mean comparisons ******* tab var_independent, sum(var_dependent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Running comparisons when both dependent and independent variables are interval level

A

Scatterplot *** scatter var_dependent var_independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Cross-tabulation

A

tab dependent_var independent_var, col **** A Cross-tabulation is a table delivering the frequency distributions of the dependent variable within the groups determined by the levels of the independent variable (compare PERCENTAGES, not counts)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Graphs if dependent variable frequency distributions

A

hist dependent_variable, d percent by(independent_var)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Graphs of dependent variable frequency distributions (worked out within each group determined bt the levels of the independent variable)

A

hist dependent_var, d percent by(independent_var)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Check recoding of a dummy variable

A

tab var var_dummy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Mean comparison

A

tab var_independent, sum(var_dependent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Graphical representation of group means

A

graph bar (mean) var_dependent, over(var_independent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Boxplot by groups

A

graph box var_independent, over(var_dependent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Controlled Comparisons

A

Controlled Comparisons make it possible to establish whether the association between the dependent and the independent variable is spurious or addictive or whether there is an interaction between independent and control variables. If the control and independent variables are both categorical, we divide the sample units into groups according to thr values of the conttol variable, and then for each grouo we compare the behavior of the dependent variable across the groups identified by the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Spurious Relationship

A

Once that we control for a third variable in assessing the relationship between a dependent variable and an independent variable, the relationship at first detected becomes weak or disappears altogether. In a spurious relationship, the detected empirical relationship is completely coincidental.

66
Q

Additive relationship

A

Once that we control for a third variable in assessing the relationship between a dependent variable and an independent variable, the relation at first detected remains unchanged. The independent variable contributes to the explanation of the dependent variable and the control variable helps as well.

67
Q

Interaction (Relationship)

A

Once that we control for a third variable in assessing the relationship between a dependent variable and an independent variable, the relation at first detected is not the same for all values of the control variable: it has different strenght or direction depending on the control variable values. There is an interaction between the independent variable and the control variable.

68
Q

Controlled cross tabulations

A

bysort var_control: tab var_dependent var_independent, col

69
Q

Detecting a spurious association

A

The relationship between the dependent and the independent variable weakens or disappears for ar least one level of the control variable

70
Q

Detecting an interaction (relationship)

A

The direction of the relationship between the dependent and the independent variable varies across values of the control variable

71
Q

Detecting an additive relationship

A

The strenght of the relationship between the dependent and the independent variable is the same or similar for all values of the control variable

72
Q

Controlled mean comparisons

A

bysort var_control: tab var_independent, sum(var_dependent)

73
Q

Charts for Controlled Comparisons

A
  • graph bar (mean) var_dependent, over(var_independent) over(var_control) ******* - graph box var_dependent, over(var_independent) over(var_control)
74
Q

Population

A

The set of all units object of our investigation. The set of all statistical units on which we might measure concepts or characteristics of interest.

75
Q

Sample

A

A subset of statistical units drawn from the population. The sample should be representative of the population, without bias.

76
Q

Inferential Statistics

A

Inferential Statistics is a collection of techniques that make it podsible to draw conclusions concerning the whole population from the analysis of a portion of the population, that is a sample drawn from it.

77
Q

Random Sampling

A

Random sampling is a sampling scheme that prevente from biases, ensuring the extraction of representative samples. In random sampling, the units are drawn at random from the population one at a time in such way that: units have the same probability to be drawn / samples of given size n have all the same probability to be drawn. Random samoling is also the building block for more complex sampling schemes (stratified sampling, cluster sampling)

78
Q

Random Variable

A

The random variable is a probabilistic tool that models the ourcome of a random sampling experiment. It is called “variable” because, before drawing the unit, we know that different values can be observed; it is called “random” because we do not know the value we are going to observe.

79
Q

Probabilistic behaviour of a discrete random variable

A

The probabilistic behaviour of a discrete random variable is described by means of a probability function that associates to each of the possible values taken by the random variable the probability of observing it.

80
Q

Probabilistic behaviour of a continuous random variable

A

The probabilistic behaviour of a continuous random variable is described by means of a DENSİTY FUNCTION. A density function is *** always non-negative, **** the area under its graph is equal to 1, ** the probability that the random variable takes value in an interval is equal to the area of the region delimited by the graph of the function and the interval.

81
Q

Bernoulli random variables

A

Discrete random variables taking only 2 values. It models the occurrence or not of a given outcome. It takes value 1 when the outcome occours, 0 when it doesn’t. The probability of observing value 1 is denoted by (Pi) and it entirely describes the behaviour of the random variable.

82
Q

How can the behaviour of a random variable be described?

A

It can he described by means of synthetic measures. {mean} {variance} {std}

83
Q

i.i.d. random variables

A

independent and identically distributed variables, they make up a random sample. (same distribution, same mean, same std)

84
Q

What are the three inferential problems?

A
  • Point Estimation Problem: approximation of an unknown parameter. *** - Interval Estimation Problem: interval of values containing an unknown parameter with a predetermined level of confidence. ****** - Hypothesis Testing Problem: determination whether to reject or not a given hypothesis on a parameter
85
Q

Sample Statistics

A

Functions of the sample that summarize sample information. When functions of random variables, they are random as well. Inferences on the population mean are based on two statistics: sample mean, sample std.

86
Q

Properties of the sample mean (i.i.d. sample of size n)

A

1) the sample mean distribution has mean equal to the pop. mean (Mu). ** 2) the samp. mean distribution has std=pop. std ÷ square root of the sample size. ***** 3) for a large sample size n, the distribution of the sample mean can be approximated by means of the Normal/Gaussian distribution (central limit theorem)

87
Q

Is the sample mean an unbiased estimator of the population mean? Why?

A

Yes. If we use the sample mean as an estimator of the unknown population mean, we might under(or)over-estimate it, but we do not make a systematic error of over or under approximation.

88
Q

Standard Deviation of the Sample Mean

A

“Standard Error of the Mean”. It can be seen as an overall measure of the error we make by estimating the unknown mean using the sample mean.

89
Q

Normal/Gaussian random variable

A

A Normal/Gaussian random variable has a density that is symmetric and bell-shaped. Due to symmetry, the mean and the median coincide. The density is specified by two parameters: the mean and the standard deviation of distribution. For given mean,&raquo_space;»std = flatter, wider bell-shaped vurve. ««&laquo_space;{common threshold for the Normal approximation for the Central Limit Theorem: n=50)

90
Q

Sample Variance

A

The sample variance is the average of the squared deviations of each X1,…,Xn from the sample mean X_`. Inferences on the population variance are based on the sample variance.

91
Q

Sample Standard Deviation

A

The samole standard deviation is given by the square root of the sample variance and denoted by S.

92
Q

Estimation of the Standard Error of the Mean

A

The standard error of the mean, (sigma)/{n, depends on the unknown population standard deviation (sigma). We can then estimate the std error through the sample standard deviation S. The standard error of the mean is then estimated by S/{n.

93
Q

Standard Error

A

The standard error measures the spread of the estimates around the parameter, and therefore provides a first evaluation of the accuracy of the estimation procedure. The lower the standard error, the higher the accuracy of the estimator.

94
Q

Confidence Interval Estimation

A

A confidence interval estimation for an unknown parameter provides us with an approximation of the parameter and, at the same time, with a probabilistic assessment of the approximayion error that we make. **** A confidence interval for an unknown parameter of given level 100(1-alpha)% is an interval of values we are 100(1-alpha)% sure the unknown parameter belongs to it. ** Typical levels of confidence are 95%, 90% and 99%, corresponding to an alpha equal respectively to 0.05, 0.1 or 0.01.

95
Q

Confidence Intervals İnterpretation

A

Whn giving an interpretation to a (1-alpha)% confidence interval, we claim that we are (1-alpha)% confident that the unknown parameter belongs to it. We are confident that in only alpha% lf cases the population mean falls outside the interval.

96
Q

Confidencd intervals for the population mean in Stata

A

mean var_name (you get 95%) ********* mean var_name, level(99)

97
Q

Margin of Error

A

The margin of error is given by the std error of the estimator multiplied by a posituve constant that depends on the probability distribution of the estimator and on the level of confidence. The margin of error is a building block of the confidence interval. The lower and upper bounds of a confidence interval for any parameter are obtained by respectively subtracting and adding the margin of error to the parameter estimate. parameter estimate(+-) margin of error.

98
Q

Confidence intervals derivation

A

A confidence interval for any parameter is centered at the estimate and the lower and upper bounds are obtained by respectively subtracting and adding the margin of error to the parameter estimate. ** parameter estimate(+-) margin of error.

99
Q

Margin of error

A

standard error of the mean × z×alpha/2

100
Q

Standardization of a Variable

A

Standardization is a variable transformation consisting in subtracting from the variable its mean and dividing by the std deviation.

101
Q

Length of the Confidence Interval

A

The length of the confidence interval measures the accuracy of the estimation. The longer the interval, the lower thr accuracy of the estimation.

102
Q

Mean of a Bernoulli random variable

A

The arithmetif average of numbers equal to 0 or 1 is equal to the percentage of 1s. If we consider a Bernoulli random variable, the mean of its distribution reduces to the probability (Pi) of observing the outcome of interest. The estimation of the population percentage of units on which we observe a given outcome reduces to the estimation of the unknown population mean of a random variabke taking values 1 when the outcome is observed and 0 in all other cases.

103
Q

Hypothesis

A

A statement on a population parameter that specifies a value or an entire range of values. An hypothesos is said to be SIMPLE if it specifies ONE single value of the parameter; COMPOSITE if it specifies an entire RANGE of values. ** Null hypothesis H0: it specifies a single value of the parameter. Alternatve Hypothesis Ha: an alternative range of values.

104
Q

Hypothesis Testing Problem

A

An hypothesis testing problem consists in testing a null hypothesis against an alternative one. The true hypotheses are formulated so that only one of the two is true; we will never know which one. An hypothesis testing problem is a DECISION problem: we have to decide whether to reject the null hypothesis H0 or not to reject it, on the basis of the empirical evidence provided by the data. When H0 is rejected, we decide to act as if H0 were false and Ha true.

105
Q

Falsification Approach in hypothesis testing

A

The null hypothesis is presumed to be true unless the data provide a strong evidence against iy. The alternative hypothesis is the analyst’s research hypothesis. The bured of proof falls on the researcher who claims that the alternative hypothesis is true.

106
Q

Type I Error

A

Type I Error: to reject H0 when it is true. Court room analogy of convicting an innocent: type I error as the worse one. **** Type II Error: failing to reject H0 when it is false.

107
Q

Significance level

A

The probability of Type I error is called significant level of the hypothesis testing and it is denoted by alpha. The significance level is fixed by the analyst: usual choices are 0.05, 0.01, 0.001.

108
Q

Significance Test of level alpha

A

A significance test of level alpha is a testing procedure such that the probability of making Type I error is alpha and, among all test having the same level alpha, it minimizes the probability of making Type II error.

109
Q

What questions do we ask during significance tests?

A

During significance tests we ask questions such as: are the estimate values plausibile under the null hypothesis?

110
Q

One-sided test

A

Test the null hypothesis H0:(mu)=(mu)0 against thr alternative Ha:(mu)>(mu)0 (…or, agains thr alternative mu

111
Q

Test Statistics

A

The decision to reject or not to reject the null hypothesis is based on a function of the sample referred to as test statistics. Any test statistics compares the estimate of the parameter against the value specified by the null hypothesis. Any test statistics measures the DISTANCE between the estimate and the value of the parameter specified by the null hypothesis in terms of NUMBER OF STANDARD ERRORS the estimate is falling above or below the value specified by the null hypothesis. The test statistics in the case of tests on the mean has the following expression: X-(mu)0/se (X with a line above). The test statistics is a result of a standardization of the sample mean under H0 and replacint the std error by its estimator. Under H0, assuming that H0 is true the test statistics has a standardized Normal distribution.

112
Q

P value

A

The P value is a probability statement on whose basis we decide to reject or not the null hypothesis in an hypothesis testing problem. By definition, the p-value is the probability, under the null hypothesis, of observing a value of the test statistics MORE EXTREME of the observed value, in the direction specified by the alternative hypothesis. Assuming that the unknown population mean is equal to (mu)0, the p value returns the probability of observing a value of the sample mran falling above (mu)0 by a higher number of std errors compared to the one that is observed. The p-value returns the probability of observing a value of sample mean that provides an even stronger evidencd against the null hypothesis than the observed one.

113
Q

Decision Rule

A

For a given significance level alpha, in testing the null hypothesis against the alternative one, we reject the null hypothesis IF P VALUE < ALPHA.

114
Q

Test on the mean

A

ttest var_name = test_value

115
Q

Two-sided test

A

Test the null hypothesis (mu)=(mu)0 against =/=. The decision is taken in the basis of the test statistic (X-(mu)0/se) [x con linea sopra]. In a two-sided test, the p-value is a two-tail probability. A two-sided test can be solved by working out a CONFIDENCE INTERVAL. We work out a (1-alpha)% confidence interval for (mu) and reject the null hypothesis if the value specified by the null hypothesis (mu)0 is not an element of the confidence interval

116
Q

Group means

A

tab var_dependent, sum(var_independent). ** “The two means are different. At the level of the sample there is association between the two variables”

117
Q

Significant Association (“Is there a significant association between the two variables?”)

A

ttest var_dependent, by(var_independent). *** look at the p-value of Ha: diff !=0, if it is smaller than alpha “There is enough empirical evidence to conclude that the two means are significantly different. The association betwee the two variables is significant.” /confidence intervals of different levels: option level()/

118
Q

How to investigare population association between two categorical variables

A
  • Chi-square Test of Independence **** - Cramér’s V Measure of Association
119
Q

Statistical Independence

A

Two random variables are said to be independent if the (conditional) probability distributions of one of the two in the sub-populations determined by the other are all the SAME. İndependence is assessed through the Chi2 Test of İndependence.

120
Q

Chi2 Test of Independence

A

The Chu2 Test of independence makes it possible to assess whether we can asshme that at the population level two categorical variables are associated. In the Chi2 Test of Independence, the evidence against the null hypothesis (the two variables are independent) is summarized by a statistics that provides an overall measure of the distance between the observed situation and the situation that we would observe (expected) under the null hypothesis. *** tab var_dependent var_independent, chi2 *** Pearson chi2 = chi2 test statistics value. Pr = p-value of the test.

121
Q

Expected Counts

A

The exoected counts are the countd we would observe under the null hypothesis if independence of the two variablesm If the variables are independent, the frequency distributions of Y un the groups determined by X are the same, equal to the frequency distribution of Y at the level of the sample. The expected counts are then obtained by applying to the size of each group the sample percentages of the response variable. ****** tab var_dependent var_independent, col expected

122
Q

Chi-square Test Statistics

A

The Chi2 Test Statistics value for a given samole is obtajned through the following steps: 1) Work out rhe difference between the observed and expected counts; 2) Square each difference 3) Divide each square difference by the corresponding expected count 4) Add up all the obtained numbers. The higher is the difference between observed and expected counts, the higher is the value of the Chi-square test statistics. ***** Under the null hypothesis, the chi2 test statistics has a known probability distribution referred to as Chi2 Distribution. ** The higher is the value of the chi2 test statistic, the higher is the evidence against the null hypothesis

123
Q

Cramer’s V

A

Cramer’s V Index of Association is worked out through the Chi-square statistics through a non linear transformation. Cramer’s V takes value between 0 and 1. The higher it is, the stronger js the association between the two variables. **** tab var_dependent var_independent, V chi2 *** es. “Cramér’s V index is equal to 0.2729, there is a moderate association between the two variables.”

124
Q

Scatterplot

A

Plotting a scatterplot provides a first description of the association between two NUMERICAL variables at the level of the sample. X:expl/indep, Y:dep. Each point on the graph identifies a pair of values, one for the exp var one for the dep var. ** scatter var_dependent var_independent ** scatter var_dependent var_independent, mlabel(Country?)

125
Q

Correlation (Linear Association)

A

POSITIVE LINEAR ASSOCIATION: for two numerical variables, high values of one variable tend to occur with high values of the other variable / low-low. ***** NEGATIVE L.A.: high-low, low-high

126
Q

Pearson’s Correlation Index

A

Pearson’s Correlation Index measures the DIRECTION and the STRENGTH of correlation between two NUMERICAL variables. PCI takes values between -1 and 1. Pos. value, pos. correlation. The closer Pearson’s correlation is to -1 or 1, the strongest is the correlation. A Pearson’s Correlation equal to 0 tells us that there is no correlation (linear association) *** pwcorr var_dep var_indep “each cell provides the correlation coefficient value for yhe corresponding pair of variables”. The correlation matrix is symmetric. eg. 0.6816 “Pearson’s correlation index between var_dep var_indep is positive and closer to 1 than to 0. There is a moderately high positive linear correlation between the two variables.

127
Q

Linear Association

A

Linear Association is denoted by a Pearson’s correlation equal to 0, equal to no correlation.

128
Q

Slopes approximation in correlation index

A

closer to +-1: the points in the scatterpiot fall closer to a straight line. High positive correlation: the overall trend in the dsta can be approximated by a line w/ a positive slope.

129
Q

Significant Correlation

A

“Is the correlation between the two variables observed at the level of the sample also present at the level of the population?” Test on the significancd of the correlation: pwcorr var_dep var_indep, sig. 3rd row: p-value. es. 0.000 “There is enough empirical evidence to conclude that the correlation is significant”

130
Q

“Is the correlation significant?”

A

pwcorr dep_var indep_var, sig

131
Q

Scopes of Regression Analysis

A

A regression model makes it possible to EXPLAIN the dependent variable by means of the covariate. It makes it possible to PREDICT new, not observed values of the dependent variable by means of the covariate. A multiple linear regression model makes it possible to assess the RELATIONSHIP between a numerical dependent variable, and one or more variables (independent variables, explanatory variables, covariates) that can be numerical or categorical through a model-based approach.

132
Q

Linear Regression Models

A

SİMPLE linear regression model: regression model with only ONE INDEPENDENT variable. **** MULTIPLE linear regression model: regression model with MORE than one independent variable.

133
Q

Correlation vs. Regression

A

CORRELATION: Measures the degree to which two variabkes are inter-related. Same correlation coefficient if you swap X with Y. It does not quantify the change in one cariable associated with the change in another. ****** REGRESSION: The relationship is evaluated through a dependency model. The regression of x on y is not the same of y on x. The impact of one variable on another is quantified.

134
Q

Our aim of the regression analysis for the variables Internet_use and Fb_use

A

* explaining Fb_use through Internet_use **** Predicting thr Facebook penetration for countries not in the database for which we might know the level of internet penetration

135
Q

Measure of the goodness of the prediction provided by the sample mean (Prediction Evaluation)

A

The overall distance of the points from the horizontal line can be used as a measure of the goodness of the prediction provided by the sample mean. An overall measure of distance of the points from the horizontal line is given by the sum of the squared difference between the observed values of dep_var and the value predicted by the horizontal line, that is constant and equal to the sample mean.

136
Q

Regression Line

A

Among the several lines that we can draw in the scatterplot, the regression line is the one having the SMALLEST overall distance from the points, as measured in terms of the sum of squared vertical distances from each point to the line

137
Q

Simple Linear Regression Model

A

Consider a random sample of size n and let denote bt y(i) and x(i) the values taken respectively by the dependent ane the independent variable on the i-th unit of the sample. A simple linear regression model has the following general mathematical expression: y(i)=(beta)0+(beta)1x(1)+(3e)i. (beta)0 and (beta)1 are referred to as MODEL COEFFICIENTS. They are unknown parameters object of inference. (3e)i is a RANDOM variable. ** The model claims that the value of thr dependent variable observed on the i-th unit is given by the sum of two components: - One dependent on the independent variable theough the function (beta)0+(beta)1x(1) ******* - One not depending on the explanatory variable, being a random error term, that summarizes thr error that we can make in explaining and predicting the dependent variable only on the basis of the considered independent variable.

138
Q

Regression Model Assumptions

A

y(i)=(beta)0+(beta)1x(1)+(3e)i * The model is studied and used under the following assumptions on the error terms. * The errors are assumed to: have mean 0 — have the same std deviation (sigma) — be not correlated. *** A Normality assumption is needed for running inferential procedures when the sample size is small. ****** Interpretation of the assumptions: (3e)i has mean 0 and constant std deviation (sigma) and Normal distribution. ** y(i) has mean equal to (beta)0+(beta)1x(1).

139
Q

Parameters in Regression Analysis

A

The parameters to make inferences on are: the intercepr (beta)0 ** the slope (beta)1 ***&& the standard deviation (sigma) referred to as “standard error of the model”

140
Q

Residual

A

The difference between the observed value of the dependent and the independent variable

141
Q

Least Squares Method

A

Intercept and Slope are estimated by mean of the Least Squares Method. The estimates of the coefficients are those values which minimize the sum of the squared residuals, as a measure of the overall distance between the regression line and the points.

142
Q

İnterpretation of the slope (Regression)

A

(beta)1 slope represents the average change if the response as associated to a unit increase of the explanatory variable.

143
Q

Regression Formula

A

regress var_dependent var_independent

144
Q

Plot of the Regression Line

A

twoway (scatter var_dependent var_independent) (1fit var_dependent var_independent)

145
Q

Goodness of fit of the regression model

A

“How better can we explain and predict the dependent variable knowing the independent variable rather than not knowing it?” - The goodness of fit of the regression model can be assessed through the R-SQUARED index. Total Sum of Squares: sum of the squared deviations of the dep var, measures the total variability of the dependent variable

146
Q

Total Sum of Squares

A

TSS=RSS+ESS. RSS: Regression Sum of Squars, the amount of total variability explained by the independent variable, the amount of total variability that the independent variable accounts for. ESS: Amount of the total variability that is not accounted for by the independent variable mean. **** The higher is the RSS with respect to the ESS, the higher is the amount of the total variability that as the explanatory variable as sourcr / the higher is the goodness of fit of the model.

147
Q

R-Squared

A

Eatio of RSS over TSS. R-Squared is the percentage of the overall variability of the dependent variable that is accounted for by the independent variable. R-Squared ranges between 0 and 1, the higher it is the higher is the goodness of fit of the model. * In the case of one single explanatory variable, the R-squared is equal to the square of Pearson’s correlation index. **** R-Squared never decreases when a new covariate is added to the model. *** R-Squared depends on the sample size and this can be a disadvantage when comparing models having a different number of covariates or fitted on samples gaving different size. For this, Adjusted R-Squared

148
Q

Analysis of the Regression Table. (read)

A

Model: RSS (Regression Sum of Squares). ** Residual: ESS (Residual Sum of Squares). ** Total: TSS. ** eg. R-Squared: 0.4646, “Internet_use explaings 46.46% of the total variability of Fb_use.” ** Root MSR: Model Standard Error. ** Std. Err: This column delivers the standard error for each coefficient estimate as a measure of the accuracy of the estimation procedure. ** [95% Conf. Int.]: eg. “We are 95% confident that the unknown slope falls between 0.279 and 0.641”. 90% conf. int: option “level(90)”.

149
Q

Significance Test on thr Slope

A

“Is the sverage change in the dependent variable associated with a unit change in the explanatory variable?” “Does the explanatory variable provide a signiifcant contribution to the explanation of the dependent variable?” “Is the simple regression model significant?” **** we reject the null hypothesis if p-value < alpha. If we reject the null hypothesis we can conclude that: - the explanatory variable provides a significant contribution to the explanation of the dependent variable —-the average change in the dependent variable associated with a unit change in the dependent variable is significant —-the regression line is significant. ** P-value: P>|t| *** (theta): value of the test statistics. “The slope estimates fall 5.19 std errors above the mean.”

150
Q

T- column in Regression analysis

A

Value of the test statistics. “The slope estimates fall 5.19 std errors above the mean.”

151
Q

Multiple Linear Regression Model

A

A multiple linear regression model studies the associstion between a numerical variable and a collection of variables that can be of any kind. *** Model assumptions: the numerical explanatory variables are assumed to be not correlated. The errors are assumed to have mean 0, same std dev (sigma), to be not correlated, a Normality assumption is needed for running an inferential procedure when the sample size is small.

152
Q

Parameters (regression)

A

The parameters to make inferences on are: - the dataa coefficients (beta)0, (beta)1, …, (beta)k —- the errors’ std deviation (sigma), the “std error of the model” root mse.

153
Q

Wording for a regression significance test ehere p-value > alpha.

A

(If multiple: controlling for variable x), variable y is not significant. The significant relationship that can be found by regressing z only on y is SPURİOUS.

154
Q

Goodness of Fit in a multiple linear regression model

A

A multiple linear regression model’s goodness of fit can be assessed through: –descriptive indexes such as R-Square and Adjusted R-Square —— Hypothedis testing procedures such as the F Test

155
Q

Afjusted R-Squared

A

The adjusted R-Squared is obtained from the R-Square, by adjusting for the effect of sample size and n* of covariates. The Adjusted R-Squared can be used for Model Comparison (R-Squared cannot because it depends on the sample size)

156
Q

F Test

A

In the F test, under the null hypothesis none of the independent variables contributes to the explanation of the dependent variable. Under the alternative hypothesis, at least one explanatory variable contributes to the explanation of the dependency variable. If we reject the null hypothesis we claim that: - there is enough empirical evidence to conclude that at least ONE explanatory value is significant (we do not know which); - - there is enough empirical evidence to conclude that the variables altogether provide a significant explanation to the dependent variable. - - - there is enough empirical evidence to conclude that the model is significant. If we do not reject the null hypothesis, there is not enough empirical evidence to concludethat the variables altogether provide a significant explanation to the dependent variable, and that there is not enough empirical evidence to conclude that the model is significant. **** The F test statistics is worked out from the ratio of the variability explained by the model over the variability that the model doesn’t explain. The higher it is, the higher is the evidence against the null hypothesis. Under the null hypothesis, F has a known probability distribution referred to as F Snedecor distribution.

157
Q

F Test Statistics

A

The F test statistics is worked out from the ratio of the variability explained by the model over the variability that the model doesn’t explain. The higher it is, the higher is the evidence against the null hypothesis. Under the null hypothesis, F has a known probability distribution referred to as F Snedecor distribution.

158
Q

Different kinds of explanatory variables

A

In a multiple linear regression model, we have no constraints on the side of the explanatory variable. They can be numerical, categorical binary or multinomial. CATEGORICAL explanatory variables are entered by means of DUMMY variables.

159
Q

Dummy coefficient estimate interpretation

A

eg. coef: 1.978. “1.978 is the estimate of the average difference between participatory_index in the two sub-populations of those who show interest and those who do not show interest in politics.”

160
Q

Test on the significance of the dummy’s coefficient

A

The test on the significance of the dummy’s coefficient reduces to a test on the significance of the mean difference in the dummy in the two sub-populations identified by it.

161
Q

Multinomial Covariates

A

In order to enter as explanatory variable into the model a categorical variable taking morr thsn teo levels (multinomial) we need to create a collection of dummt variables. // Let’s assume that the explanatory variable takes on m different levels. - a level is selected as reference level or CONTROL group. - a dummy variable is associated with the remainig levels dummy 2 dummy3 dummy4. - the m-1 dummy variables are entered into the model as explanatory variables.