Observation Flashcards
two basic classes / times for studies and evaluation
- formative:
at the beginning to inform about context and to study possible options - summative:
to judge on the impact of a HCI design
(a summative evaluation of a design might be a formative one for the next step)
why, what, where and when to evaluate
why: study question (check user' requirements and that they can use the product and they like it)
what:
a conceptual model, early prototypes of a new system and later, more complete prototypes, human behaviour…
where:
in natural and laboratory settings
when:
* formative: throughout design;
* summative: finished products can be evaluated to collect information to inform new products
three classes of measures
user effectivity
user efficiency
user satisfaction
evaluation classes
- setting
- evaluation time
- evaluation partner
- result type
controlled settings
- setting conditions are controlled
- non-controllable conditions are measured
- e.g. lab experiments, living labs
natural settings
- study in ‘everyday’ and natural conditions that cannot be controlled
- some, but not all non-controllable conditions can be measured
- e.g. field studies, in-the-wild studies
types of evaluation time
inspective:
* inspection / evaluation while run of an experiment or while use
retrospective:
* evaluation after run of the experiment or after use
short term: short session
long term: long session
evaluation partners
the user:
- gives direct feedback e.g. for use
- best for gaining new insights into context
- if its an experiment: called “subject”
the expert:
- allows for best practice information
- reported expert experience may require many users / test subjects to be collected
Result types
subjective:
* results cannot be directly compared between subjects
objective:
* results can be directly compared between subjects
quantative:
* results are numbers
qualitative:
* results are text
Interviews - Five key issues
- setting goals
decide how to analyze data once collected - Identifying participants
decide who to gather data from - relationship with participants
clear and professional, informed consent when appropriate - Triangulation
look at data from more than one perspective
collect more than one type of data, e.g. qualitative from experiments and quantitative from interviews - Pilot studies
small trial of main study
Data recording
- notes, audio, video, photographs can be used individually or in combination
- always use a visual impression
- different challenges and advantages with each combination
three types of interviews
structured interviews
- pre-developed questions
- strictly following the wording
- easy to carry out - but limited to the question set
- more precise to evaluate
semi-structured interviews
* structured part + ‘open’ questions
unstructured interviews
- used when little background information available
- minimizes the influence of the questioner
Running the interview - structure
Introduction - introduce yourself, explain the goals of the interview, reassure about the ethical issues, ask to record, present the informed consent form
warm-up - make first questions easy and non-threatening
main body - present questions in a logical order
a cool-off period - include a few easy questions to defuse tension at the end
closure - thank interviewee, signal the end, e.g. switch of the recorder
encouraging a good response
- make sure purpose of study is clear
- promise anonymity
- ensure questionnaire is well designed
- follow-up with emails, phone calls, letters
- provide an incentive
- 40% response rate is good, 20% is often acceptable
Standard questionnaires used in HCI
SUS - system usability scale
TLX - NASA task load index
QUIS - Questionnaire for User interface satisfaction
CSUQ - Computer system usability questionnaire
SUS - benefits and restrictions
+ very easy to scale (likert)
+ useful in small sample sizes with o.k. results
+ validity o.k. (you see differences in bad and good design)
- Score 0-100 -> association with percentage
- not diagnostic, just to classify
problems with online questionnaires
- sampling is problematic if population size is unknown
- preventing individuals from responding more than once can be a problem
- individuals have also been known to change questions in email questionnaires
Types of observation
direct observation in the field
- structuring frameworks
- degree of participation
- ethnography
direct observation in controlled environments
indirect observation: tracking user’s activities
- diaries, experience sampling method
- interaction logging
- video and photographs collected remotely by drones or other equipment
Planning and conducting observation in the field
- decide on how involved you will be: passive observer to active participant
- how to gain acceptance
- how to handle sensitive topics, eg. culture, private spaces, etc.
- how to collect the data:
- what data to collect - what equipment to use - when to stop observing
Ethnography
Goal: to experience the participant and it’s context
Ethnographers immerse themselves in the culture that they study
analyzing video and data logs can be time-consuming
collections of comments, incidents and artifacts are made
co-operation of people being observed is required
informants are useful
data analysis is continuous
interpretivist technique
questions get refined as understanding grows
reports usually contain examples
online enthography
interaction online differ from face-to-face
virtual worlds have persistence that physical worlds do not have
ethical considerations and presentations of results are different
observations and materials that might be collected
- activity or job descriptions
- rules and procedures
- descriptions of activities
- recordings
- informal interviews
- diagrams (of the physical layout,…)
- photographs, videos, workflow diagrams, process maps, …
observation in a controlled environment
direct observation
- think aloud techniques
- also used in conjunction with other interview and questionnaire techniques
indirect observation
- diaries
- interaction logs
- web analytics
video, audio, photos, notes are used to capture data in both types of observation
Think Aloud
While using an application, a user is constantly explaining what he is thinking what he is doing
Quality of the evaluation depends on
- selection of test candidates
- appropriate preparation of the candidates
- appropriate setting so that a natural usage can be guaranteed
Think aloud preperation
- explain the system
- explain the setting
- explain expectation
- using the scenarios prepared earlier, write a draft list of tasks
- try out the tasks and estimate how long they will take a participants to complete
- prepare a task sheet for the participants
- get ready for the test session
- tell the participants that it is the system that is under test, not them; explain and introduce tasks
- participants start the tasks. Have them give you running commentary on what they are doing, why they are doing it and difficulties or uncertainties they encounter
- encourage participants to keep talking
- When the participants have finished, interview them briefly about the usability of the prototype and the session itself. Thank them
- write up your notes as soon as possible and incorporate into a usability report
Think aloud evaluation
- qualitative, subjective mostly
- ethnographic, delivers to the point experience for specific issues / problems
- generalisations are very difficult, require high level of experience
- interpretations can be done based on various different psychological theories and models
Living labs
- People’s use of technology in their everyday lives can be evaluated in living labs
- such evaluations are too difficult to do in a usability lab
Ubicomp Studies
- Are field studies, not lab studies
- In situ, means result includes measurements of the context
- context and situation is not controlled
- such studies are more expensive
- more likely to find novel insight and experience
- Ubicomp studies requrie additional effort
- Ubicomp studies e.g. normally also require control conditions, prestudies, calculation of number of participants, selection of participants, data selection and statistics
3 main types of ubicomp field studies
study current behavior:
* what are people doing now
proof-of-concept studies:
* does my technology function in the real world
experience studies.
* how does using my prototype change people’s behaviour or allow them to do new things
Wizard of Oz studies
good for proof of concept
person simulates and controls system from behind the scenes
- use mock interface and interact with users
- good for simulating system that would be difficult to build
Experience Studies
Surveys
- often used as prestudy
- carried out after any change of condition in a between-subject study
- regular in-between survey while a study to measure change of participants reaction
Logging
* use the mobile device to also collect data about usage
Logging - design considerations
how will you use the logged data?
* select appropriate data to log (at the right frequency)
make a list of specific questions that you expect to answer from the log data
will your logging help you know if the study is going smoothly?
Logging - web analytics
A system of tools and techniques for optimizing web usage by measuring, collecting, analyzing and reporting web data
typically focus on the number of web visitors and page views.
Experience Sampling Methodology (ESM)
ESM is a study method using questionnaires
Participants are asked to fill out short questionnaires at various points throughout the day
You get a different picture than to recall later
Considerations:
- how often to ask the participant
- how many questions
- collect experience or sensor information
Study Design
For any study:
A: start with a concrete research question
B: answer the following questions:
- what will your participants do during the study
- what data will you collect
- how long will the study be
steps to a successful study
- Have a clear research goal and question
- Create a study design document containing
* 1. Research question / Hypothesis
* 2. Detailed participant Profile
* 3. Detailed method description (what will part. do)
* 4. Detailed timeline description
* 5. Types of Data you collect
* 6. Analysismethod
* 7. How you draw conclusion / validate hypothesis
How long should your study be
Depends on type of study
- experience studies (several weeks) are longer than proof of concept studies (serveral days)
- studies of current behaviour may start from hours to weeks
Depends on novelty
* usage of novel systems is often very different at the start (enthusiasm or scepticism) and after longer period of use
Practical considerations
- If it requires much effort from the participants you have to restrict measurement time
- Frequency of need for interaction with participants: High frequency means shorter measurement time
Frequency of use
* High frequency of use reduces measurement of time required
Things to consider when interpreting data
*Reliability
does the method produce the same results on separate occasions?
- Validity
does the method measure what it is intended to measure
internal validity - external validity - Ecological validity
does the environment of the evaluation distort the results? Is the result transferable to a general environment? - Biases: Are there biases that distort the results?
- Scope: How generalizable are the results
selecting participants
First you have to answer 3 questions before you start:
- representation of participants to the intended user group
- grouping of participants
- data sampling strategy
Representation of study participants
- Representative Participant Set
- Non-Representative Participation Set
- Be careful: Many statistics assume a representative set
Grouping Participants
one group only or multiple groups
group selection based on
- self-reported experience
- frequency of use
- amount of experience
- demographics
- different activities the participants have to perform
Sampling Strategy
Random sampling
* everyone has equal probability of being selected as participant based on a list
Systematic Sampling
* Based on predefined criteria, e.g. every 10th person entering the ECE Center
Stratified sampling
* Additionally, it is important to select people reflecting the distribution in your intended user group. So you care e.g. that your final set contains 50% male and 50% female
Samples of convenience
* Volunteer based. Must be adjusted to the wanted user group
Sample Size
Depends on acceptable error!
- Major problems can be identified by 3-4 people.
- Early stage design require less participants
- But this is an oversimplification
–> Perform a pre-test where participants have to first detect known usability issues,
caluculate averatge percentage of found usability issues over all participants
Gives you the percentage of found issue in average
Test order
Participants learn fast - test order may have a significant influence on the outcome of the experiment
–> Reschedule order of tasks for each participant
This is not necessary with unrelated tasks.
Sometimes it is impossible because tasks depend on each other
Types of Evaluation without user
- Experts use their knowledge of users & technology to review software usability
- Expert critiques can be formal or informal
- Heuristic evaluation is a review guided by a set of heuristics
- Walkthroughs involve stepping through a pre-planned scenario noting potential problems.
Revised version of Nielsen’s original heuristics
- Visibility of system status
- Match between system and real world
- user control and freedom
- consistency and standards
- error prevention
- recognition rather than recall
- flexibility and efficiency of use
- aesthetic and minimalist design
- help users recognize, diagnose, recover from errors
- hep and documentation
3 stages for doing heuristic evaluation
- briefing session to tell experts what to do
- Evaluation period of 1-2 hours in which
* each expert works separately
* take one pass to get a feel for the product
* take a second pass to focus on specific features - Debriefing session in which experts work together to prioritize problems.
vorteile & nachteile heurisitic evaluation
+ few ethical problems - no users involved
+ few practical problems - no users involved
- can be difficult to find experts
- important problems may get missed
- many trivial problems are often identified
- experts have biases
Cognitive Walkthroughs
- focus on ease of learning and or usage
- designer presents an aspect of the design & usage scenarios
- Expert is told the assumptions about user population, context of use, task details
- one or more experts walk through the design prototype with the scenario
- experts are guided by questions
cognitive walkthrough questions
- Will the correct action be sufficiently evident to the user?
- Will the user notice that the correct action is available?
- Will the user associate and interpret the response from the action correctly?
- If correct action is performed, will the user see that progress is made towards his goals?
Pluralistic walkthrough
variation on the cognitive walkthrough, performed by a carefully managed team
The panel of experts begins by working separately
Then there is managed discussion that leads to agreed decisions
The approach lends itself well to participatory design
Criteria for Creating and Measure of Mental Workload
Sensitivity
* index must be sensitive to changes in task difficulty or resource demand
Selectivity
* index should NOT be sensitive to changes unrelated to resource demands
Diagnosticity
* index should indicate not just that workload is varying but the cause of variation
(Un)obstusiveness
* an index should not interfere with or contaminate the primary task being assessed
Reliability (Reproducibility)
* index should produce the same estimate for a given task and operator
Bandwidth
* the index should respond to high-frequency changes in workload
4 primary approaches to workload assessment
1) primary task: direct
2) secondary task: indirect
3) physiological correlates
4) subjective ratings: does not interfere with task, but subjective
workload assessment - primary task
Measure performance metrics:
* time * speed * strength
Derived workload metrics
* no absolute value * difference in performance my indicate difference in workload
workload assessment - secondary task
popular types:
- rhythmic tapping task
- random number generation
- probe reaction time task
- time estimation
- time production
workload assessment - physiological measurments
- Heart rate (ECG), Muscle Activity (EMG), Brain Activity (EEG)
- Respiration (GSR)
- Oxygen uptake
- Eye-Tracking
In principle precise, but
- difficult to set-up
- needs extensive physiological conditioning to bring subjects to same conditioning level
- different to compare between subjects due to high variation in physiological condition
Conditioning:
- Baseline Phase
- Interaction Phase
- Recover Phase
workload assessment - subjective ratings - NASA TLX - 5 dimensions
Mental demand
* how mentally demanding was the task?
Physical demand
* how physically demanding was the task?
Temporal demand
* how hurried or rushed was pace of the task
Performance
* How successful were you in accomplishing what you were asked to do
Effort
* How hard did you have to work to accomplish your level of performance
Frustration
* How insecure, discouraged, irritated, stressed and annoyed were you?