L7- Evaluating interactive systems Flashcards
What is the difference between formative and summative evaluation
Formative evaluation is used in the early stages of a project to compare, assess and refine design ideas.
Formative evaluation often involves OPEN research questions where the researcher is interested in learning further information that may inform the design
Summative evaluation is more likely to be used in the later stages of a project and involve CLOSED research questions with the purpose of testing and evaluating systems according to predefined criteria
What’s the difference between analytical and empirical evaluation methods
Analytical : based on applying a theory to analysis and discussion of the design, in the absence of real world users
Empirical : making observations and measurements of users
What’s the difference between quantitative and qualitative evaluation
Numbers versus words/pictures/audio/video
Why are analytical methods useful for formative evaluation?
Analytical methods are useful for formative evaluation, because if the system design has not yet been completed, it may be difficult to observe how it is used (although low fidelity prototypes can be helpful here)
Give some examples of qualitative evaluation methods
Qualitative analytic methods include cognitive walkthrough (useful for closed research questions), and the cognitive dimensions of notations framework (useful for open research question).
Give examples of quantitative analytic methods
The Keystroke Level Model is a quantitative analytic method, which can be used to create numerical comparisons of closed research questions.
Give examples of qualitative empirical methods
think-aloud, interviews, and field observation –> ethnographic approaches
They are usually associated with open research questions, where the objective is to learn new information relevant to system design or use
Give examples of quantitative empirical methods
Quantitative empirical methods generally require a working system, so are most often summative
Examples include the use of analytics and metrics in A/B experiments, and also controlled laboratory trials
Explain how to run RCTs
Decide on a performance measure
Find a representative sample of the target population (who have given informed consent to participate)
Find an experimental task that can be used to collect performance data
How might we measure the results of an RCT?
Effect size – impact on the mean performance
Measure correlation with factors that might improve performance
Report significance measures to check whether the observed effects might have resulted from random variation or other factors than the treatment
What problems are associated with RCTs?
Overcoming natural variation needs large samples
RCTs don’t provide understanding of why a change occurred
This means that it is hard to know whether the effect will generalise (for example to commercial contexts)
If there are many relevant variables that are orthogonal to each other, such as different product features or design options, many separate experiments might therefore be required to distinguish between their effects and interactions
Thus RCTs aren’t often used for design research in commercial products
A more justifiable performance measure is profit maximisation, but sales/profit are often hard to measure with useful latency
Companies therefore tend to use PROXY MEASURES such as the number days that customers continue actively to use the product
What is internal validity?
What the study done right
Reproducibility
Scientific integrity
Refutability
What is external validity?
Does the study tell us useful things
Focussing on whether the results can be generalisable to real world situation, including factors such as representativeness of the sample population, the experimental task and the application context
Describe two ways of analysing qualitative data
While we can use statistical comparison of quantitative measures from controlled experiments; interviews and field studies require analysis of qualitative data
Qualitative data is often recorded and transcribed as written text, so the analysis can proceed using a reproducible scientific method
What is categorical coding, explain how to do it.
Categorical coding is a qualitative data analysis method that can be used to answer ‘closed’ questions, for example, comparing different groups of people or users of different products
The first step is to create a “coding frame” of expected categories of interest
The text data is then segmented (for example on phrase boundaries)
Each segment is assigned to one category, so that frequency and correspondence can be compared
In a scientific context, categorical coding should incorporate some assessment of inter-rater reliability, where two or more people make the coding decisions independently to avoid systematic bias or misinterpretation
Compare how many decisions agree relative to chance using a statistical measure such as Cohen’s Kappa for 2 people, or Fleiss’ Kappa for more and comparing to typical levels (0.6 - 0.8 is considered substantial agreement)
Inter-rater reliability may take account of how many decisions still disagreed after discussion, which may involve refining and iterating the coding frame to resolve decision criteria
It is often useful to ‘prototype’ the coding frame by having the independent raters discuss a sample before proceeding to code the main corpus
What is grounded theory?
Qualitative data analysis method that can be used to explore open questions where there is no prior expectation or theoretical assumption of the insights that the researcher is looking for
First step: read the data closely, looking for interesting categories (‘open coding’)
The researcher then collects fragments, writing ‘memos’ to capture insights as they occur
Emerging themes are organised using axial coding' across different sources of evidence
It is important to constantly compare memos, themes and findings to the original data in order to ensure that these can be objective justified
The process ends when the theoretical description has reached
saturation’ in relation to the original data, with the main themes complete and accounted for
Explain how to get ethical clearance
Inform ethics committee before you collect any data or recruit any participants
Describe the study, who will participate, what you will ask them to do, what data you will collect
What precautions are being taken, as appropriate to the nature of the research, including the approach taken to informed consent, and whether participants will be anonymous
What are three analytical evaluation options?
Cognitive walkthrough
KLM/GOMS
Cognitive Dimensions
When would you use cognitive walkthrough?
Cognitive Walkthrough: Is normally used in formative contexts – if you do have a working system, then why aren’t you observing a real user, which is far more informative than simulating or imagining one? However, Cognitive Walkthrough can be a valuable time-saving precaution before user studies start, to fix blatant usability bugs.
When would you use KLM/GOMS?
KLM/GOMS: It is unlikely that you’ll have alternative detailed UI designs in advance, so there is not much to be learned from using these methods in the context of a Part II project. If do you have a working system, a controlled observation is superior
When would you use Cognitive Dimensions?
Is better suited to less structured tasks than Cognitive Walkthrough and KLM/GOMS, which rely on predefined user goal and task structure
What empirical approaches could you choose from?
Interviews/ethnography
Think-aloud / wizard of oz
Controlled experiments
When would you collect data using interviews/ethnography?
Useful in formative/preparation phase where an open research method is helpful in developing design ideas or capturing user requirements
When would you use think-aloud/wizard of oz?
Valuable for both paper prototypes and working systems
Highly effective at uncovering usability bugs as long as the verbal protocol is analysed rigorously using qualitative methods
When would you use controlled experiments?
Can help to establish the engineering aspects of the work
Important to ensure you can measure the important attributes in a meaningful way (with both internal and external validity)
need to test significance and report confidence interval of observed means and effect sizes
When would you use surveys and informal questionnaires
Be clear what you are measuring
Is self reporting likely to be accurate
Use a mix of open questions, which capture richer qualitative information, and closed questions that make it easier to aggregate and test hypotheses
Open questions require a coding frame to structure and compare data, or grounded theory methods (if you have a broader research question)
Collecting survey data via interviews is likely to give more insight but questionnaires are faster so that you can collect data from a larger sample
Remember to test questionnaires with a pilot study as it’s easier to get them wrong than with interviews
When would you use field testing
If a working product exists it may be possible to make a controlled release and collect data on how it is used
Make a risk assessment
Seek ethics approval before proceeding
When would you use standardised survey instruments
These are standard psychometric instruments to evaluate mental states such as fatigue, stress, confusion and emotional state
There are also standard methods to assess individual differences (e.g. personality, intelligence)
Use standardised approaches wherever possible, so your results can be compared to existing scientific literature
Making changes to these standardised surveys generally invalidates the results
What are some bad evaluation techniques?
Don’t use purely affective reports
Don’t ask a biased group – e.g. your friends – experimental demand
Don’t make claims that sound as though they result from a formative analytic process but are actually subjective
Don’t use introspective reports made by a single subject – might be biased and subjective
How would you evaluate a non HCI project
Approach testing as a scientific exercise
Define goals and hypotheses and understand the boundaries and performance limits of your system by exploring them
Keep in mind that it’s often necessary to test to point of failure so that you can make comparisons or explain limits
For non-interactive projects, still necessary to decide whether evaluation should be analytic (proceeding by reasoning and argument, in which case you should ask how consistent and well-structured is your analytic framework) OR empirical (proceeding by measurement/observation, in which case you should ask what you are measuring and why, and ensure that you have achieved scientific validity, where the measurements are compatible with your claims).
All projects can include a mix of formative and summative evaluation
If you only evaluate formatively – did you finish your project?
If carrying out summative evaluation, be clear whether the evaluation criteria are internal (derived from some theory) or external (addressing some problem)
Need to establish objectivity of qualitative data (i.e. that it isn’t simply your own opinion).