lecture 4 summary Flashcards
statistical data editing
is the process of checking observed data and correcting them if necessary
error localization
determines which values are erroneous
we recognize the following types of errors
interviewer error: interviews may not be giving the respondents the correct instruction
Omissiong: respondents often fail to answer a single question or a section of the questionnarie, either deliberately or inadvertently
AMbiguity: a response might not be legible, or it might be unclear
Inconsistncies: sometimes two responses can be logically inconsistent
lack of cooperation: in the long questionnaire with hundreds of attitude questions, a respondent might rebel and checkthe same response in a long list of questions
Ineligible respondent: an inappropriate respondent may be included in the sample (e.g. underage respondents)
Data coding
is specifying how the information should be categorized to facilitate the analysis. The main purpose is to transform the data into a form suitable for the analysis
Data matching
is the task of identifying, matching and mergin records that correspond to the same entities from severaldatabases or even within one database
Data imputation
is the process of estimating missing data and filling these valuees into the dataset
Data adjusting
refers to the process to enhance the quality of the data for the data analysis
Weighting
is the procedure by which each observation in the database is assigned a number according to some pre-specified rule
Variable re-specification
is the procedure in which the existing data are modified to create new variables, or in which a large number of variables are reduced into fewer variables
scale transformation
is the procedure to adjust the scale to ensure comparability with other scales
The model
is the value in a measurement series (category) with maximum frequency (multiple mode values are possible)
Median
is the value that lies in the middle of a frequency distribution (same number of instances above and below the median)
discrete distributions
such as binomial distribution, poisson distributions, and multinomial distributions.
Continuous distributions
such as normal distributions, log-normal distributions, t-distributions and f-distributions
A positive correlation and negative correlation reflects
a positive correlation reflects a tendency for a high value in one variable to be associated with a high value in a second variable.
A negative correlation reflects an association between a high value in one variable and a low value in a second variable
Correlation analysis
is a measurement of the linear association strength between two metrically scaled variables. Values are comparable across different variables due to restrictions to the interval. It recognizes the following limitations:
Only linear correlations can be depicted
No sufficient evidence for the presence of a causal relationship
Strength of the correlation in the sense of a leverage effect cannot be identified
Spurious association is possible if background variables are not controlled for
Causality means
that a change in one variable will produce a change in another. If we can claim for sure that x causes y, we can talk about a causal relationship. If there are theoretical reasons why different variables such as z cause a change in x and y, we need to control for this variable (e.g. strategy to increase ice cream prices), otherwise we cannot claim causality. Furthermore, we need to make sure that x causes y and not that y causes x. So the first approach to determine the direction of causation is to draw from logic and previous theories. Theory always comes first. The second approach to determine the direction of causation is to consider that there is usually a tome lage between cause and effect, and so if such time lage can be postulated a causal relationship can be identified.
There are three ways to identify causal relationship
there exists theoretical evidence for a strong association, or correlation between two variables
Changing of the cause variable precedes changing of the result variable
Evidence that no rival explanation (other correlated parameter )exists for the observed association of the variables
Experiment
features a formulation of a causal relationship (hypothesis) it is an evaluation of the directional influence of one or more independent variables on one or more dependent variables
Experimental group
test subjects who are exposed to the experimental stimulus eg. a new advertisement
Control group
test subjects who are not exposed to the experimental stimulus
Radnomizing
random assignment of test subjects to experimental / control groups
Matching
test subjects in experimental and control groups share specific criteria
stimulus
variation of a variable that should trigger a behavioural reaction in people
Laboratory experiment
is a performance of the experiment in an artificial (laboratory) environment. Test subjects are aware that they are participating in a test
Field experiment
a performance of the experiment in a natrual environment. Test subjects are not aware that they are part of an experiment
limits of correlation analysis
only linear relations can be depicted
no evidence for causal relationship
strength of the correlation in the sense of a leverage effect cannot be identified
spurious association is possible if background variables are not accounted for