L7- Evaluating interactive systems Flashcards
What is the difference between formative and summative evaluation
Formative evaluation is used in the early stages of a project to compare, assess and refine design ideas.
Formative evaluation often involves OPEN research questions where the researcher is interested in learning further information that may inform the design
Summative evaluation is more likely to be used in the later stages of a project and involve CLOSED research questions with the purpose of testing and evaluating systems according to predefined criteria
What’s the difference between analytical and empirical evaluation methods
Analytical : based on applying a theory to analysis and discussion of the design, in the absence of real world users
Empirical : making observations and measurements of users
What’s the difference between quantitative and qualitative evaluation
Numbers versus words/pictures/audio/video
Why are analytical methods useful for formative evaluation?
Analytical methods are useful for formative evaluation, because if the system design has not yet been completed, it may be difficult to observe how it is used (although low fidelity prototypes can be helpful here)
Give some examples of qualitative evaluation methods
Qualitative analytic methods include cognitive walkthrough (useful for closed research questions), and the cognitive dimensions of notations framework (useful for open research question).
Give examples of quantitative analytic methods
The Keystroke Level Model is a quantitative analytic method, which can be used to create numerical comparisons of closed research questions.
Give examples of qualitative empirical methods
think-aloud, interviews, and field observation –> ethnographic approaches
They are usually associated with open research questions, where the objective is to learn new information relevant to system design or use
Give examples of quantitative empirical methods
Quantitative empirical methods generally require a working system, so are most often summative
Examples include the use of analytics and metrics in A/B experiments, and also controlled laboratory trials
Explain how to run RCTs
Decide on a performance measure
Find a representative sample of the target population (who have given informed consent to participate)
Find an experimental task that can be used to collect performance data
How might we measure the results of an RCT?
Effect size – impact on the mean performance
Measure correlation with factors that might improve performance
Report significance measures to check whether the observed effects might have resulted from random variation or other factors than the treatment
What problems are associated with RCTs?
Overcoming natural variation needs large samples
RCTs don’t provide understanding of why a change occurred
This means that it is hard to know whether the effect will generalise (for example to commercial contexts)
If there are many relevant variables that are orthogonal to each other, such as different product features or design options, many separate experiments might therefore be required to distinguish between their effects and interactions
Thus RCTs aren’t often used for design research in commercial products
A more justifiable performance measure is profit maximisation, but sales/profit are often hard to measure with useful latency
Companies therefore tend to use PROXY MEASURES such as the number days that customers continue actively to use the product
What is internal validity?
What the study done right
Reproducibility
Scientific integrity
Refutability
What is external validity?
Does the study tell us useful things
Focussing on whether the results can be generalisable to real world situation, including factors such as representativeness of the sample population, the experimental task and the application context
Describe two ways of analysing qualitative data
While we can use statistical comparison of quantitative measures from controlled experiments; interviews and field studies require analysis of qualitative data
Qualitative data is often recorded and transcribed as written text, so the analysis can proceed using a reproducible scientific method
What is categorical coding, explain how to do it.
Categorical coding is a qualitative data analysis method that can be used to answer ‘closed’ questions, for example, comparing different groups of people or users of different products
The first step is to create a “coding frame” of expected categories of interest
The text data is then segmented (for example on phrase boundaries)
Each segment is assigned to one category, so that frequency and correspondence can be compared
In a scientific context, categorical coding should incorporate some assessment of inter-rater reliability, where two or more people make the coding decisions independently to avoid systematic bias or misinterpretation
Compare how many decisions agree relative to chance using a statistical measure such as Cohen’s Kappa for 2 people, or Fleiss’ Kappa for more and comparing to typical levels (0.6 - 0.8 is considered substantial agreement)
Inter-rater reliability may take account of how many decisions still disagreed after discussion, which may involve refining and iterating the coding frame to resolve decision criteria
It is often useful to ‘prototype’ the coding frame by having the independent raters discuss a sample before proceeding to code the main corpus