STATA Flashcards
Learn definitions and formulas!
Systematic measurement error
“Mismeasurement bias”, it produces a systematic mismeasurement of the intended characteristics.
Random Measurement Error
It introduces chaotic distortion in the measurement process
Reliable Measure
A measure free of random measurement errors. A measure that is a consistent measure of the concept.
Valid Measure
A measure that records the TRUE VALUE of the intended concept. It does not measure any UNINTENDED characteristics. It is free of any SYSTEMATIC measurement error.
Over-time consistency
To assess RELIABILITY. * TEST RE-TEST METHOD, repeating the same test hoping to get the same results. *ALTERNATIVE FORM METHOD, the test is administered again in a roughly equivalent form.
Internal Consistency
Assessing reliabikity. * SPLIT-HALF METHOD: half of the questions are administered to a group, half to another. *CRONBACH’S ALPHA: statistical measure of internal consistency.
Assessing Validity
*FACE VALIDITY: informed judgement is used to determine whether a measurement strategy is measuring what it should. *CONTSTRUCT VALIDITY: assessment of an association between the measured concept with other concepts as we would expect it to be.
The question to address for “Validity”
Are we aiming at the correct target?
The question to address for Reliability
How close have we got to the target?
Variable
The result of the measurement process. The empirical measurement of a concept. Each question in a survey gives birth to a variable. A variable has a name, and at least two values. Nominal, Ordinal, Internal Level Variables. (Dummies)
What are Nominal and Ordinal variables also referred to as?
Categorical variables
Nominal Variables
They take on values that are not numbers and cannot be ranked.
Ordinal Variables
They take on values that are not numbers, but there is a criterion allowing us to RANK them. Ordinal variables communicate the RELATIVE AMOUNT of the characteristic being measured.
Interval Level Variables
They take on numerical values, providing the most precise measurement of the amount of an observed characteristic.
Decision Tree for Variable Types
Are the values numerical? Yes (Interval level variable), No —> Can we rank the values? Yes (Ordinal), No(Nominal)
Data file
.dta, this is the dataset with all its information. No analysis or results are recorded.
Do file
.do, analysis. It records commands only. It is a file that runs all the commands that you do for your analysis.
Log file
.smcl, RESULTS. It records commands and results of your analysis.
describe varname
n°observations, n°var, size, date of creation. For each variable it provides a description in term kf storage type, format and variable label.
codebook varname
information on each variable: type, RANGE, number of MISSINGS, value LABELS and counts for each value
Creating value labels: eg. Likert
label define likert 1 “Strongly Agree” 2 “Agree” … … 5 “Strongly Disagree” ******* label values var_1 var_2 var_3 likert
First step of creating value labels
label define label_nane numeric_code1 “label1” numeric_code2 “label2”
Second step of creating value labels
label values var_1 var_2 .. label_name
Missing values coding, es. gov_int, 6
mvdecode gov_int, mv(6=.)
Variable collapsing (recode), eg. gov_inc
recode gov_inc (1/2=1 “agree”) (3/5=0 “disagree”) (.=.), gen(gov_inc_agree)
Automatic creating indicator variables from a categorical variable
tab var_name, gen(var_name_dum)
Collapsing interval level variables (upper bounds)
generate var_new=recode(var_old, max1, max2, max3)
Mathematical transformatione
generate var_new=function(var1, var2, … )
Additive index
generate var_sum=var1+var2+var3
Log of a variable
generate var_log=ln(old_var)
tab {frequency table}
it displays: the observed distinct values of the variable, the raw frequencies/counts as the number of units falling within each category, percentages & relative frequencies, cumulative
Measures of Central Tendency
Mode, Median, Mean
Mode
ALL variables with manageable number of values. It is the unique measure that can be used for a categorical nominal variable. (single value with highest frequency)
Median
Internal level (numerical), categorical ordinal. 50%.
Mean
Only numerical values.
mean computation
mean var_name ***** sum var_name
Mean vs. Median
The median is more robust than the mean against outliers.
Measures of dispersion or variability
Interquartile range, range, variance and standard deviation.
An assessment of the dispersion or variability of a variable gives an answer to questions such as…
How are the freq. distributed across the values of the variable? Are the unts polarized into few categories? is there heterogeneity in the variable distribution?
Measuring dispersion for a categorical ordinal variable
Interquartile range, UQ-LQ=IR. It measures the spread of the 50% central part of the distribution. The higher the IR, the higher the dispersion of the variable.
Measuring dispersion for interval level variables
The range (max-min. obs. val). For the range, sum varname.
Deviation
The difference between each variable value and the mean. (Positive=above the mean, Negative=below). The deviation can be seen as a measure of distance between the value and the mean.
Variance
The average of the squared deviation. The variance measures the dispersion of a variable as the spread around its mean. Always positive; if all values are equal, variance=0. The hugher the variance, the higher the dispersion of a variable around its mean. The variance is not expressed in the same unit of measurement of the variable, but in its square.
Standard Deviation
The standard deviation is the square root of the variance and therefore expressed in the same unit of measurement of the variable ** sum var_name, d *** delivers std, variance, percentiles for IR.
Bar chart
histogram var_name, d percent ******* A bar chart is a graphical representation of a frequency distribution.
Histogram
hist var_name, percent ** It describes internal level variables taking several distinct values. The range of the variable is divided into a given numer of intervals (bins) of the same width. On each bin, a bar is drawn havung as heught the percentage of units falling in the interval. hist var_name, percent bin(25)
Boxplot
graph box var_name **** min, max, LQ, UQ, media , whiskers (1.5 of the IR), any value above upper whisker/lower wh. is flagged as an outlier.
Shape
We can assess the shape of a variable by plotting a bar chart or histogram. Tail on the right: positive skew, mean>median. Tail on the left: negative skew, mean
Goals of a quantitative research
1 - Measuring and describing concepts *** 2 - Suggesting and assessing explanationd for (political) concepts
Dependent variable
The variable that measures the concept we want to explain
İndependent/explanatory variable
The variable that measures the concept identified as possible determinant of the observed differences in the dependent variable.
An hypotesis should be formulated so to…
Tell us that when we compare units of analysis having different values of the independent variable we will observe difference in the dependent variable, and specify the tendency of the relationship.
Template for formulating an hypothesis
In a comparison (units of analysis), those having (one or more values of the independent variable) will be more likely to have (one value of the dependent variable) than those having (a different value of the independent variable)
Running comparisons when both the dependent and independent variables are CATEGORICAL
Cross-tabulations ******* tab dependent_var independent_var, col
Running comparisons when the dependent variable is interval level and the independent variable is categorical
Mean comparisons ******* tab var_independent, sum(var_dependent)
Running comparisons when both dependent and independent variables are interval level
Scatterplot *** scatter var_dependent var_independent
Cross-tabulation
tab dependent_var independent_var, col **** A Cross-tabulation is a table delivering the frequency distributions of the dependent variable within the groups determined by the levels of the independent variable (compare PERCENTAGES, not counts)
Graphs if dependent variable frequency distributions
hist dependent_variable, d percent by(independent_var)
Graphs of dependent variable frequency distributions (worked out within each group determined bt the levels of the independent variable)
hist dependent_var, d percent by(independent_var)
Check recoding of a dummy variable
tab var var_dummy
Mean comparison
tab var_independent, sum(var_dependent)
Graphical representation of group means
graph bar (mean) var_dependent, over(var_independent)
Boxplot by groups
graph box var_independent, over(var_dependent)
Controlled Comparisons
Controlled Comparisons make it possible to establish whether the association between the dependent and the independent variable is spurious or addictive or whether there is an interaction between independent and control variables. If the control and independent variables are both categorical, we divide the sample units into groups according to thr values of the conttol variable, and then for each grouo we compare the behavior of the dependent variable across the groups identified by the dependent variable.