begriffe Flashcards
validity
extent to which a concept/measurement is well-foundend and likely corresponds accurately to the real world
internal vs external validity
internal validity
the obtained effect of x on y for your sample is the correct effect for the sample
-> generalization of causal findings to all cases WITHIN the sample
how to obtain:
-empirical model is correctly specified, estimators are unbiased
-> changes in the dependent variable are attributed to the independent variable (and no other factors ->challenge to eliminate that chance)
external validity
obtained effect of x on y in the sample is the correct effeft of x on y in the population P
-> generalization of causal findings to other cases not included in the sample -> the overall population
how to obtain:
-enough cases
-sample represents the population in all relevant characteristics
why is validity important
-theory and findings need to show a causal effect for the research to be relevant
-stakeholders need to know whether it also holds for other cases
-in practice: experiments usually of low external but high internal validity or neither perfect internal nor external validity
validity vs reliablility
reliability is the degree of precision with wich a specific aspect is measured
advantages of scientific observation
systematic approach of observing and generating information
-objectivity as oppose to selective set ob observations
-avoidance of “filling in” information
-verifiability
population
all observational unit to whom the theory is assumed to apply
sample
a subset of the theoretically-defined population for which data is assessed
for reasons of validity, we want this subset to be representative of the population
descriptive statistics
inferential statistics
what is data
quantified information
information for one single case: date point
manifest variables: directly observable variables (zb body height)
latent variables: abstract concepts only observable through manifest indicators (zb democracy etc)
data types by source
source:
observable world -> observational data
field or lab experiment -> experimental data
an algorithm -> simulated data
data processing
-> to eliminate sources of error
processing includes:
-reduction of measurement error
-addressing of inter-coder reliablitity
-elimination of missing data points
-identification of outliers
how to measure data
measurements require
-measurement scale
-measurement unit
-measurement instrument
also includes
-counts
-quantifiactions
types of variables
can be descriped by three elements: instrument, measurement unit, scale
variables by scale
categorical variables
how are observations arranged?
nominal variables
-numerical values are used as a label or type of attributes
-no intrinsic order between categories
-zb gender, party affiliation: spö=1, övp=2
ordinal variables:
-variables of two ore more catagories which can be ranked
-value and gap is not interpretable
zb smart (no twice as smart)
variables by scale
metric variables
interval variables
-variables have a zero value (usually without a clear meaning)
-distance between attributes has the same meaning
ratio variables
-zero means thet there is nothing of this variable left
verwende datenset thedata.dta
use thedata.dta
delete all variables and data
clear
zusammenfassen eines datensets
describe, short
describe, simple
summarize
sum, detail
tabulate
list
codebook
excel datenset importieren
import excel “…”
var1 und var2 entfernen
drop var1 var2
alle außer var3 und var4 entfernen
keep var 3 var4
distribution table of a variable
tabulate …
missing values are not depicted, only if:
tabulate .., missing
create a variable
generate
give variable another name
rename
change tha value of a variable for another value
replace
add a description to a variable
label variable
add a label to a variable value
2 steps needed:
label define
label values
change order of variables in dataset
order
measuring unobservables
conceptualization and operationalization are needed
-> theoretical definitions
->clarify how concept is measured by sepcifiying indicators and how informaiton is gathered -> systematized
good operationalizaiton is linked to your theory
zb concept: study success
-> attributes : academic achievement // acquired abilites
-> components: received prives, amount of prize money // ability to solve problems etc
issues of conceptualization
problem of conflation
-sub components should be conceptually in line with attributes at the corresponding upper level -> sub-components should not relate to conceptuall different attributes
problem of redundancy:
-components at the same level should be mutually exclusive
minimalist definition of attributes
+ availability of data may be enhanced
+ no redundancy with other attributes
-every case is an instance, no variation
-meausre might not reflect the concept well (invalidity)
-measure may only be applicable for one situation
maximalist definition of attributes
= including too many (irrelevant) attributes
potential drawacks of overburdening:
-lower usefulness as concept has no empirical referents
-tautological and of little analytical use if main dependet variable is already included as an attribute
Median
50% -> Wert der Mitte,
value located directly in the center of collected data
herausfinden: sum var1, detail -> wert bei 50%
not normally distributed data
sum var1, detail
skewness: positive value indivates that a variable is skweded to the right (outliers)
-> if highly skewed to the right, median might be more representative that the mean, because mean is affected by outliers
boxplot interpretieren
well suited for ordinal and metric data
whisker from minimum value of the sample
-> lower quartile (one quarter of the sample lies here)
then box with median in the middle
whisker showing the upper quartile
whiskers are without potential outliers
Modus
Modalwert = most common value of a variable
mean
arithemtishces mittel
average value of a variable
bivariate descriptive statistics
shows the relationship between two variables
options:
-crosstables
-comparioson of key measures zb mean
-graphical comparison
-correlational measures
correlation vs causality
correlation:
var A and var B are correlated if higher/lower values of variable A coincide with higher/lower values in variable B
-> you dont know whter varA influences var B or vice versa
negative correlation: if values of var A are lower, values of var B are lower as well
positive correlation: if values of var A are higher, values of var B are higher as well
causality: direct relationship between var A and var B
-> a change in var A leads to a change in var B
–> more difficult to determine, needs research design
how can data be visualized
amounts
zb bar charts, dots, grouped bars
distributions
zb histogram, boxplots
proportions
zb pie chart, bars
x-y relationships
zb scatterplot
uncertainty
zb error bars,
geospatial data
map
sort bars in stata descending
graph hbar, over (var1, sort (1) descending)
adjust bandwith of histogram
hist var1, width (5)
vs hist war, width (10) -> more values in one bar of the histogram
increase x and y title size
xtitle (, size (large))
ytitle (, size large))
commands for tables
tabulate
fre
color schemes
…., scheme (schemename)
assign individual colors
bar(1, color (“black”))
different types of inferences
descriptive interference:
-historical accurafy of scientific information
-simply observing sample data
statistical inferences
-use sample properties to infer properties of a populations
-unterstand development ofer time or relationship between variables
-focus on understand how uncertain findings are -> t-tests
causal inferences:
-infer the existence of a causal effect from data analysis
hypothesis testing in stata
goal: infer from the sample to the population
problem: population is usually unknown and only one (not infinetly many) samples are available
need: an estimation of the uncertainty resulting from the use of random sampling
solution: mean/standard deviation or proportion value in the sample as an estimate
stratum
a subset of elements from the population that share a characteristic (usually sociodemographic zb age, gender)
sampling frame
a list of elements in a population that can be identified
convenience sample
-use of information from participants who are convenient to access
-sampling method does not need to select participants based on any set of criteria
-only use this method if representativeness is not of importance for research
quota sample
is primarily used when information is to be collected on a specific, definable target opoulation
-if it worked well, quota sample privides a structurally identical representation of the population
-volunteers could still bias the picture
stratified sample
stratified sampling involves random selection within predefined groups (e.g. gender, age)
-> people within a stratum are randomly selected
-strata is supposed to ensure that the make-up of the population is adequately mirrored
simple random sample
selection process takes place randomly
-each participant has a chance of being selected
survey weights
-when sample deviates from the actual population
-suvery weights are estimated variables -> even out the differences between sample and population