STATA Flashcards

Question

Variable collapsing (recode), eg. gov_inc

Answer 1

recode gov_inc (1/2=1 "agree") (3/5=0 "disagree") (.=.), gen(gov_inc_agree)

Answer 2

tab var_name, gen(var_name_dum)

Answer 3

generate var_new=recode(var_old, max1, max2, max3)

Answer 4

generate var_new=function(var1, var2, ... )

Answer 5

generate var_sum=var1+var2+var3

Answer 6

generate var_log=ln(old_var)

Answer 7

it displays: the observed distinct values of the variable, the raw frequencies/counts as the number of units falling within each category, percentages & relative frequencies, cumulative

Answer 8

Mode, Median, Mean

Answer 9

ALL variables with manageable number of values. It is the unique measure that can be used for a categorical nominal variable. (single value with highest frequency)

Answer 10

Internal level (numerical), categorical ordinal. 50%.

Answer 11

Only numerical values.

Answer 12

mean var_name ************* sum var_name

Answer 13

The median is more robust than the mean against outliers.

Answer 14

Interquartile range, range, variance and standard deviation.

Answer 15

How are the freq. distributed across the values of the variable? Are the unts polarized into few categories? is there heterogeneity in the variable distribution?

Answer 16

Interquartile range, UQ-LQ=IR. It measures the spread of the 50% central part of the distribution. The higher the IR, the higher the dispersion of the variable.

Answer 17

The range (max-min. obs. val). For the range, sum varname.

Answer 18

The difference between each variable value and the mean. (Positive=above the mean, Negative=below). The deviation can be seen as a measure of distance between the value and the mean.

Answer 19

The average of the squared deviation. The variance measures the dispersion of a variable as the spread around its mean. Always positive; if all values are equal, variance=0. The hugher the variance, the higher the dispersion of a variable around its mean. The variance is not expressed in the same unit of measurement of the variable, but in its square.

Answer 20

The standard deviation is the square root of the variance and therefore expressed in the same unit of measurement of the variable **** sum var_name, d ***** delivers std, variance, percentiles for IR.

Answer 21

histogram var_name, d percent *************** A bar chart is a graphical representation of a frequency distribution.

Answer 22

hist var_name, percent ********** It describes internal level variables taking several distinct values. The range of the variable is divided into a given numer of intervals (bins) of the same width. On each bin, a bar is drawn havung as heught the percentage of units falling in the interval. hist var_name, percent bin(25)

Answer 23

graph box var_name ************ min, max, LQ, UQ, media , whiskers (1.5 of the IR), any value above upper whisker/lower wh. is flagged as an outlier.

Answer 24

We can assess the shape of a variable by plotting a bar chart or histogram. Tail on the right: positive skew, mean>median. Tail on the left: negative skew, mean

Answer 25

1 - Measuring and describing concepts ******* 2 - Suggesting and assessing explanationd for (political) concepts

Answer 26

The variable that measures the concept we want to explain

Answer 27

The variable that measures the concept identified as possible determinant of the observed differences in the dependent variable.

Answer 28

Tell us that when we compare units of analysis having different values of the independent variable we will observe difference in the dependent variable, and specify the tendency of the relationship.

Answer 29

In a comparison (units of analysis), those having (one or more values of the independent variable) will be more likely to have (one value of the dependent variable) than those having (a different value of the independent variable)

Answer 30

Cross-tabulations ********* tab dependent_var independent_var, col

Answer 31

Mean comparisons ********* tab var_independent, sum(var_dependent)

Answer 32

Scatterplot ******* scatter var_dependent var_independent

Answer 33

tab dependent_var independent_var, col ************** A Cross-tabulation is a table delivering the frequency distributions of the dependent variable within the groups determined by the levels of the independent variable (compare PERCENTAGES, not counts)

Answer 34

hist dependent_variable, d percent by(independent_var)

Answer 35

hist dependent_var, d percent by(independent_var)

Answer 36

tab var var_dummy

Answer 37

tab var_independent, sum(var_dependent)

Answer 38

graph bar (mean) var_dependent, over(var_independent)

Answer 39

graph box var_independent, over(var_dependent)

Answer 40

Controlled Comparisons make it possible to establish whether the association between the dependent and the independent variable is spurious or addictive or whether there is an interaction between independent and control variables. If the control and independent variables are both categorical, we divide the sample units into groups according to thr values of the conttol variable, and then for each grouo we compare the behavior of the dependent variable across the groups identified by the dependent variable.

Answer 41

Once that we control for a third variable in assessing the relationship between a dependent variable and an independent variable, the relationship at first detected becomes weak or disappears altogether. In a spurious relationship, the detected empirical relationship is completely coincidental.

Answer 42

Once that we control for a third variable in assessing the relationship between a dependent variable and an independent variable, the relation at first detected remains unchanged. The independent variable contributes to the explanation of the dependent variable and the control variable helps as well.

Answer 43

Once that we control for a third variable in assessing the relationship between a dependent variable and an independent variable, the relation at first detected is not the same for all values of the control variable: it has different strenght or direction depending on the control variable values. There is an interaction between the independent variable and the control variable.

Answer 44

bysort var_control: tab var_dependent var_independent, col

Answer 45

The relationship between the dependent and the independent variable weakens or disappears for ar least one level of the control variable

Answer 46

The direction of the relationship between the dependent and the independent variable varies across values of the control variable

Answer 47

The strenght of the relationship between the dependent and the independent variable is the same or similar for all values of the control variable

Answer 48

bysort var_control: tab var_independent, sum(var_dependent)

Answer 49

- graph bar (mean) var_dependent, over(var_independent) over(var_control) *************** - graph box var_dependent, over(var_independent) over(var_control)

Answer 50

The set of all units object of our investigation. The set of all statistical units on which we might measure concepts or characteristics of interest.

Answer 51

A subset of statistical units drawn from the population. The sample should be representative of the population, without bias.

Answer 52

Inferential Statistics is a collection of techniques that make it podsible to draw conclusions concerning the whole population from the analysis of a portion of the population, that is a sample drawn from it.

Answer 53

Random sampling is a sampling scheme that prevente from biases, ensuring the extraction of representative samples. In random sampling, the units are drawn at random from the population one at a time in such way that: units have the same probability to be drawn / samples of given size n have all the same probability to be drawn. Random samoling is also the building block for more complex sampling schemes (stratified sampling, cluster sampling)

Answer 54

The random variable is a probabilistic tool that models the ourcome of a random sampling experiment. It is called "variable" because, before drawing the unit, we know that different values can be observed; it is called "random" because we do not know the value we are going to observe.

Answer 55

The probabilistic behaviour of a discrete random variable is described by means of a probability function that associates to each of the possible values taken by the random variable the probability of observing it.

Answer 56

The probabilistic behaviour of a continuous random variable is described by means of a DENSİTY FUNCTION. A density function is ******* always non-negative, ******** the area under its graph is equal to 1, ********** the probability that the random variable takes value in an interval is equal to the area of the region delimited by the graph of the function and the interval.

Answer 57

Discrete random variables taking only 2 values. It models the occurrence or not of a given outcome. It takes value 1 when the outcome occours, 0 when it doesn't. The probability of observing value 1 is denoted by (Pi) and it entirely describes the behaviour of the random variable.

Answer 58

It can he described by means of synthetic measures. {mean} {variance} {std}

Answer 59

independent and identically distributed variables, they make up a random sample. (same distribution, same mean, same std)

Answer 60

- Point Estimation Problem: approximation of an unknown parameter. ************* - Interval Estimation Problem: interval of values containing an unknown parameter with a predetermined level of confidence. ************** - Hypothesis Testing Problem: determination whether to reject or not a given hypothesis on a parameter

Answer 61

Functions of the sample that summarize sample information. When functions of random variables, they are random as well. Inferences on the population mean are based on two statistics: sample mean, sample std.

Answer 62

1) the sample mean distribution has mean equal to the pop. mean (Mu). ****** 2) the samp. mean distribution has std=pop. std ÷ square root of the sample size. ************* 3) for a large sample size n, the distribution of the sample mean can be approximated by means of the Normal/Gaussian distribution (central limit theorem)

Answer 63

Yes. If we use the sample mean as an estimator of the unknown population mean, we might under(or)over-estimate it, but we do not make a systematic error of over or under approximation.

Answer 64

"Standard Error of the Mean". It can be seen as an overall measure of the error we make by estimating the unknown mean using the sample mean.

Answer 65

A Normal/Gaussian random variable has a density that is symmetric and bell-shaped. Due to symmetry, the mean and the median coincide. The density is specified by two parameters: the mean and the standard deviation of distribution. For given mean, >>>>std = flatter, wider bell-shaped vurve. <<<<<< {common threshold for the Normal approximation for the Central Limit Theorem: n=50)

Answer 66

The sample variance is the average of the squared deviations of each X1,...,Xn from the sample mean X_`. Inferences on the population variance are based on the sample variance.

Answer 67

The samole standard deviation is given by the square root of the sample variance and denoted by S.

Answer 68

The standard error of the mean, (sigma)/{n, depends on the unknown population standard deviation (sigma). We can then estimate the std error through the sample standard deviation S. The standard error of the mean is then estimated by S/{n.

Answer 69

The standard error measures the spread of the estimates around the parameter, and therefore provides a first evaluation of the accuracy of the estimation procedure. The lower the standard error, the higher the accuracy of the estimator.

Answer 70

A confidence interval estimation for an unknown parameter provides us with an approximation of the parameter and, at the same time, with a probabilistic assessment of the approximayion error that we make. ***** A confidence interval for an unknown parameter of given level 100(1-alpha)% is an interval of values we are 100(1-alpha)% sure the unknown parameter belongs to it. *** Typical levels of confidence are 95%, 90% and 99%, corresponding to an alpha equal respectively to 0.05, 0.1 or 0.01.

Answer 71

Whn giving an interpretation to a (1-alpha)% confidence interval, we claim that we are (1-alpha)% confident that the unknown parameter belongs to it. We are confident that in only alpha% lf cases the population mean falls outside the interval.

Answer 72

mean var_name (you get 95%) ******************* mean var_name, level(99)

Answer 73

The margin of error is given by the std error of the estimator multiplied by a posituve constant that depends on the probability distribution of the estimator and on the level of confidence. The margin of error is a building block of the confidence interval. The lower and upper bounds of a confidence interval for any parameter are obtained by respectively subtracting and adding the margin of error to the parameter estimate. parameter estimate(+-) margin of error.

Answer 74

A confidence interval for any parameter is centered at the estimate and the lower and upper bounds are obtained by respectively subtracting and adding the margin of error to the parameter estimate. ******** parameter estimate(+-) margin of error.

Answer 75

standard error of the mean × z×alpha/2

Answer 76

Standardization is a variable transformation consisting in subtracting from the variable its mean and dividing by the std deviation.

Answer 77

The length of the confidence interval measures the accuracy of the estimation. The longer the interval, the lower thr accuracy of the estimation.

Answer 78

The arithmetif average of numbers equal to 0 or 1 is equal to the percentage of 1s. If we consider a Bernoulli random variable, the mean of its distribution reduces to the probability (Pi) of observing the outcome of interest. The estimation of the population percentage of units on which we observe a given outcome reduces to the estimation of the unknown population mean of a random variabke taking values 1 when the outcome is observed and 0 in all other cases.

Answer 79

A statement on a population parameter that specifies a value or an entire range of values. An hypothesos is said to be SIMPLE if it specifies ONE single value of the parameter; COMPOSITE if it specifies an entire RANGE of values. ******** Null hypothesis H0: it specifies a single value of the parameter. Alternatve Hypothesis Ha: an alternative range of values.

Answer 80

An hypothesis testing problem consists in testing a null hypothesis against an alternative one. The true hypotheses are formulated so that only one of the two is true; we will never know which one. An hypothesis testing problem is a DECISION problem: we have to decide whether to reject the null hypothesis H0 or not to reject it, on the basis of the empirical evidence provided by the data. When H0 is rejected, we decide to act as if H0 were false and Ha true.

Answer 81

The null hypothesis is presumed to be true unless the data provide a strong evidence against iy. The alternative hypothesis is the analyst's research hypothesis. The bured of proof falls on the researcher who claims that the alternative hypothesis is true.

Answer 82

Type I Error: to reject H0 when it is true. Court room analogy of convicting an innocent: type I error as the worse one. ************ Type II Error: failing to reject H0 when it is false.

Answer 83

The probability of Type I error is called significant level of the hypothesis testing and it is denoted by alpha. The significance level is fixed by the analyst: usual choices are 0.05, 0.01, 0.001.

Answer 84

A significance test of level alpha is a testing procedure such that the probability of making Type I error is alpha and, among all test having the same level alpha, it minimizes the probability of making Type II error.

Answer 85

During significance tests we ask questions such as: are the estimate values plausibile under the null hypothesis?

Answer 86

Test the null hypothesis H0:(mu)=(mu)0 against thr alternative Ha:(mu)>(mu)0 (...or, agains thr alternative mu

Answer 87

The decision to reject or not to reject the null hypothesis is based on a function of the sample referred to as test statistics. Any test statistics compares the estimate of the parameter against the value specified by the null hypothesis. Any test statistics measures the DISTANCE between the estimate and the value of the parameter specified by the null hypothesis in terms of NUMBER OF STANDARD ERRORS the estimate is falling above or below the value specified by the null hypothesis. The test statistics in the case of tests on the mean has the following expression: X-(mu)0/se (X with a line above). The test statistics is a result of a standardization of the sample mean under H0 and replacint the std error by its estimator. Under H0, assuming that H0 is true the test statistics has a standardized Normal distribution.

Answer 88

The P value is a probability statement on whose basis we decide to reject or not the null hypothesis in an hypothesis testing problem. By definition, the p-value is the probability, under the null hypothesis, of observing a value of the test statistics MORE EXTREME of the observed value, in the direction specified by the alternative hypothesis. Assuming that the unknown population mean is equal to (mu)0, the p value returns the probability of observing a value of the sample mran falling above (mu)0 by a higher number of std errors compared to the one that is observed. The p-value returns the probability of observing a value of sample mean that provides an even stronger evidencd against the null hypothesis than the observed one.

Answer 89

For a given significance level alpha, in testing the null hypothesis against the alternative one, we reject the null hypothesis IF P VALUE < ALPHA.

Answer 90

ttest var_name = test_value

Answer 91

Test the null hypothesis (mu)=(mu)0 against =/=. The decision is taken in the basis of the test statistic (X-(mu)0/se) [x con linea sopra]. In a two-sided test, the p-value is a two-tail probability. A two-sided test can be solved by working out a CONFIDENCE INTERVAL. We work out a (1-alpha)% confidence interval for (mu) and reject the null hypothesis if the value specified by the null hypothesis (mu)0 is not an element of the confidence interval

Answer 92

tab var_dependent, sum(var_independent). ********** "The two means are different. At the level of the sample there is association between the two variables"

Answer 93

ttest var_dependent, by(var_independent). *********** look at the p-value of Ha: diff !=0, if it is smaller than alpha "There is enough empirical evidence to conclude that the two means are significantly different. The association betwee the two variables is significant." /confidence intervals of different levels: option level()/

Answer 94

- Chi-square Test of Independence ************** - Cramér's V Measure of Association

Answer 95

Two random variables are said to be independent if the (conditional) probability distributions of one of the two in the sub-populations determined by the other are all the SAME. İndependence is assessed through the Chi2 Test of İndependence.

Answer 96

The Chu2 Test of independence makes it possible to assess whether we can asshme that at the population level two categorical variables are associated. In the Chi2 Test of Independence, the evidence against the null hypothesis (the two variables are independent) is summarized by a statistics that provides an overall measure of the distance between the observed situation and the situation that we would observe (expected) under the null hypothesis. *************** tab var_dependent var_independent, chi2 ******* Pearson chi2 = chi2 test statistics value. Pr = p-value of the test.

Answer 97

The exoected counts are the countd we would observe under the null hypothesis if independence of the two variablesm If the variables are independent, the frequency distributions of Y un the groups determined by X are the same, equal to the frequency distribution of Y at the level of the sample. The expected counts are then obtained by applying to the size of each group the sample percentages of the response variable. **************** tab var_dependent var_independent, col expected

Answer 98

The Chi2 Test Statistics value for a given samole is obtajned through the following steps: 1) Work out rhe difference between the observed and expected counts; 2) Square each difference 3) Divide each square difference by the corresponding expected count 4) Add up all the obtained numbers. The higher is the difference between observed and expected counts, the higher is the value of the Chi-square test statistics. ***** Under the null hypothesis, the chi2 test statistics has a known probability distribution referred to as Chi2 Distribution. **** The higher is the value of the chi2 test statistic, the higher is the evidence against the null hypothesis

Answer 99

Cramer's V Index of Association is worked out through the Chi-square statistics through a non linear transformation. Cramer's V takes value between 0 and 1. The higher it is, the stronger js the association between the two variables. ************** tab var_dependent var_independent, V chi2 ***** es. "Cramér's V index is equal to 0.2729, there is a moderate association between the two variables."

Answer 100

Plotting a scatterplot provides a first description of the association between two NUMERICAL variables at the level of the sample. X:expl/indep, Y:dep. Each point on the graph identifies a pair of values, one for the exp var one for the dep var. *** scatter var_dependent var_independent ******* scatter var_dependent var_independent, mlabel(Country?)

Answer 101

POSITIVE LINEAR ASSOCIATION: for two numerical variables, high values of one variable tend to occur with high values of the other variable / low-low. ************* NEGATIVE L.A.: high-low, low-high

Answer 102

Pearson's Correlation Index measures the DIRECTION and the STRENGTH of correlation between two NUMERICAL variables. PCI takes values between -1 and 1. Pos. value, pos. correlation. The closer Pearson's correlation is to -1 or 1, the strongest is the correlation. A Pearson's Correlation equal to 0 tells us that there is no correlation (linear association) *********** pwcorr var_dep var_indep "each cell provides the correlation coefficient value for yhe corresponding pair of variables". The correlation matrix is symmetric. eg. 0.6816 "Pearson's correlation index between var_dep var_indep is positive and closer to 1 than to 0. There is a moderately high positive linear correlation between the two variables.

Answer 103

Linear Association is denoted by a Pearson's correlation equal to 0, equal to no correlation.

Answer 104

closer to +-1: the points in the scatterpiot fall closer to a straight line. High positive correlation: the overall trend in the dsta can be approximated by a line w/ a positive slope.

Answer 105

"Is the correlation between the two variables observed at the level of the sample also present at the level of the population?" Test on the significancd of the correlation: pwcorr var_dep var_indep, sig. 3rd row: p-value. es. 0.000 "There is enough empirical evidence to conclude that the correlation is significant"

Answer 106

pwcorr dep_var indep_var, sig

Answer 107

A regression model makes it possible to EXPLAIN the dependent variable by means of the covariate. It makes it possible to PREDICT new, not observed values of the dependent variable by means of the covariate. A multiple linear regression model makes it possible to assess the RELATIONSHIP between a numerical dependent variable, and one or more variables (independent variables, explanatory variables, covariates) that can be numerical or categorical through a model-based approach.

Answer 108

SİMPLE linear regression model: regression model with only ONE INDEPENDENT variable. ****** MULTIPLE linear regression model: regression model with MORE than one independent variable.

Answer 109

CORRELATION: Measures the degree to which two variabkes are inter-related. Same correlation coefficient if you swap X with Y. It does not quantify the change in one cariable associated with the change in another. **************** REGRESSION: The relationship is evaluated through a dependency model. The regression of x on y is not the same of y on x. The impact of one variable on another is quantified.

Answer 110

***** explaining Fb_use through Internet_use ******** Predicting thr Facebook penetration for countries not in the database for which we might know the level of internet penetration

Answer 111

The overall distance of the points from the horizontal line can be used as a measure of the goodness of the prediction provided by the sample mean. An overall measure of distance of the points from the horizontal line is given by the sum of the squared difference between the observed values of dep_var and the value predicted by the horizontal line, that is constant and equal to the sample mean.

Answer 112

Among the several lines that we can draw in the scatterplot, the regression line is the one having the SMALLEST overall distance from the points, as measured in terms of the sum of squared vertical distances from each point to the line

Answer 113

Consider a random sample of size n and let denote bt y(i) and x(i) the values taken respectively by the dependent ane the independent variable on the i-th unit of the sample. A simple linear regression model has the following general mathematical expression: y(i)=(beta)0+(beta)1x(1)+(3e)i. (beta)0 and (beta)1 are referred to as MODEL COEFFICIENTS. They are unknown parameters object of inference. (3e)i is a RANDOM variable. ********** The model claims that the value of thr dependent variable observed on the i-th unit is given by the sum of two components: - One dependent on the independent variable theough the function (beta)0+(beta)1x(1) ********* - One not depending on the explanatory variable, being a random error term, that summarizes thr error that we can make in explaining and predicting the dependent variable only on the basis of the considered independent variable.

Answer 114

y(i)=(beta)0+(beta)1x(1)+(3e)i *** The model is studied and used under the following assumptions on the error terms. *** The errors are assumed to: have mean 0 --- have the same std deviation (sigma) --- be not correlated. *** A Normality assumption is needed for running inferential procedures when the sample size is small. ************ Interpretation of the assumptions: (3e)i has mean 0 and constant std deviation (sigma) and Normal distribution. ** y(i) has mean equal to (beta)0+(beta)1x(1).

Answer 115

The parameters to make inferences on are: the intercepr (beta)0 **** the slope (beta)1 ***&& the standard deviation (sigma) referred to as "standard error of the model"

Answer 116

The difference between the observed value of the dependent and the independent variable

Answer 117

Intercept and Slope are estimated by mean of the Least Squares Method. The estimates of the coefficients are those values which minimize the sum of the squared residuals, as a measure of the overall distance between the regression line and the points.

Answer 118

(beta)1 slope represents the average change if the response as associated to a unit increase of the explanatory variable.

Answer 119

regress var_dependent var_independent

Answer 120

twoway (scatter var_dependent var_independent) (1fit var_dependent var_independent)

Answer 121

"How better can we explain and predict the dependent variable knowing the independent variable rather than not knowing it?" - The goodness of fit of the regression model can be assessed through the R-SQUARED index. Total Sum of Squares: sum of the squared deviations of the dep var, measures the total variability of the dependent variable

Answer 122

TSS=RSS+ESS. RSS: Regression Sum of Squars, the amount of total variability explained by the independent variable, the amount of total variability that the independent variable accounts for. ESS: Amount of the total variability that is not accounted for by the independent variable mean. ****** The higher is the RSS with respect to the ESS, the higher is the amount of the total variability that as the explanatory variable as sourcr / the higher is the goodness of fit of the model.

Answer 123

Eatio of RSS over TSS. R-Squared is the percentage of the overall variability of the dependent variable that is accounted for by the independent variable. R-Squared ranges between 0 and 1, the higher it is the higher is the goodness of fit of the model. ***** In the case of one single explanatory variable, the R-squared is equal to the square of Pearson's correlation index. ******** R-Squared never decreases when a new covariate is added to the model. *** R-Squared depends on the sample size and this can be a disadvantage when comparing models having a different number of covariates or fitted on samples gaving different size. For this, Adjusted R-Squared

Answer 124

Model: RSS (Regression Sum of Squares). *** Residual: ESS (Residual Sum of Squares). *** Total: TSS. *** eg. R-Squared: 0.4646, "Internet_use explaings 46.46% of the total variability of Fb_use." *** Root MSR: Model Standard Error. *** Std. Err: This column delivers the standard error for each coefficient estimate as a measure of the accuracy of the estimation procedure. *** [95% Conf. Int.]: eg. "We are 95% confident that the unknown slope falls between 0.279 and 0.641". 90% conf. int: option "level(90)".

Answer 125

"Is the sverage change in the dependent variable associated with a unit change in the explanatory variable?" "Does the explanatory variable provide a signiifcant contribution to the explanation of the dependent variable?" "Is the simple regression model significant?" ***** we reject the null hypothesis if p-value < alpha. If we reject the null hypothesis we can conclude that: - the explanatory variable provides a significant contribution to the explanation of the dependent variable ----the average change in the dependent variable associated with a unit change in the dependent variable is significant ----the regression line is significant. *** P-value: P>|t| *** (theta): value of the test statistics. "The slope estimates fall 5.19 std errors above the mean."

Answer 126

Value of the test statistics. "The slope estimates fall 5.19 std errors above the mean."

Answer 127

A multiple linear regression model studies the associstion between a numerical variable and a collection of variables that can be of any kind. *** Model assumptions: the numerical explanatory variables are assumed to be not correlated. The errors are assumed to have mean 0, same std dev (sigma), to be not correlated, a Normality assumption is needed for running an inferential procedure when the sample size is small.

Answer 128

The parameters to make inferences on are: - the dataa coefficients (beta)0, (beta)1, ..., (beta)k ---- the errors' std deviation (sigma), the "std error of the model" root mse.

Answer 129

(If multiple: controlling for variable x), variable y is not significant. The significant relationship that can be found by regressing z only on y is SPURİOUS.

Answer 130

A multiple linear regression model's goodness of fit can be assessed through: --descriptive indexes such as R-Square and Adjusted R-Square ------ Hypothedis testing procedures such as the F Test

Answer 131

The adjusted R-Squared is obtained from the R-Square, by adjusting for the effect of sample size and n* of covariates. The Adjusted R-Squared can be used for Model Comparison (R-Squared cannot because it depends on the sample size)

Answer 132

In the F test, under the null hypothesis none of the independent variables contributes to the explanation of the dependent variable. Under the alternative hypothesis, at least one explanatory variable contributes to the explanation of the dependency variable. If we reject the null hypothesis we claim that: - there is enough empirical evidence to conclude that at least ONE explanatory value is significant (we do not know which); - - there is enough empirical evidence to conclude that the variables altogether provide a significant explanation to the dependent variable. - - - there is enough empirical evidence to conclude that the model is significant. If we do not reject the null hypothesis, there is not enough empirical evidence to concludethat the variables altogether provide a significant explanation to the dependent variable, and that there is not enough empirical evidence to conclude that the model is significant. ************ The F test statistics is worked out from the ratio of the variability explained by the model over the variability that the model doesn't explain. The higher it is, the higher is the evidence against the null hypothesis. Under the null hypothesis, F has a known probability distribution referred to as F Snedecor distribution.

Answer 133

The F test statistics is worked out from the ratio of the variability explained by the model over the variability that the model doesn't explain. The higher it is, the higher is the evidence against the null hypothesis. Under the null hypothesis, F has a known probability distribution referred to as F Snedecor distribution.

Answer 134

In a multiple linear regression model, we have no constraints on the side of the explanatory variable. They can be numerical, categorical binary or multinomial. CATEGORICAL explanatory variables are entered by means of DUMMY variables.

Answer 135

eg. coef: 1.978. "1.978 is the estimate of the average difference between participatory_index in the two sub-populations of those who show interest and those who do not show interest in politics."

Answer 136

The test on the significance of the dummy's coefficient reduces to a test on the significance of the mean difference in the dummy in the two sub-populations identified by it.

Answer 137

In order to enter as explanatory variable into the model a categorical variable taking morr thsn teo levels (multinomial) we need to create a collection of dummt variables. // Let's assume that the explanatory variable takes on m different levels. - a level is selected as reference level or CONTROL group. - a dummy variable is associated with the remainig levels dummy 2 dummy3 dummy4. - the m-1 dummy variables are entered into the model as explanatory variables.

STATA Flashcards

Learn definitions and formulas! (161 cards)