10 Evaluation Flashcards
Why Evaluate?
Well-designed products sell
To ensure that system matches the users‘ needs
To discover unforeseen problems
To compare your solution against competitors ( „We are x % better than…“)
Where to Evaluate?
Naturalistic Approach: Field Studies
Usability Lab
When to Evaluate and who evaluates when?
Evaluation should happen throughout the entire software development process
Early designs: evaluated by the design team, analytically and informally
Later implementations: evaluated by users, experimentally and formally
Evaluation Methods
- Determine the Goals
- Explore the Questions
- Choose the Approach and Methods
- Evaluate, Interpret & Present Data
Important aspects in creating an evaluation process?
Reliability: can the study be replicated?
Validity: is it measuring what you expected?
Biases: is the process creating biases?
Scope: can the findings be generalized?
Ethics: are ethics standards met?
External vs Internal Validity
External validity
-> confidence that results apply to real situations
-> usually good in natural settings
Internal validity
-> confidence in our explanation of experimental results
-> usually good in experimental settings
Ethics Approval
Researchers must respect the safety, welfare, and dignity of human participants in their research and treat them equally and fairly*
Criteria for approval:
- research methodology
- risks or benefits
- the right not to participate, to terminate participation, etc.
- the right to anonymity and confidentiality
Ethics - Before the test (5 things)
Only use volunteers Inform the user Maintain privacy Make users feel comfortable Don’t waste the user’s time
Ethics - During the test (4 things)
Maintain privacy
Make users feel comfortable
Don’t waste the user’s time
Ensure participant health and safety
Ethics - After the test
Inform the user
Maintain privacy
Make users feel comfortable
Usability Testing
Focus on: how well users perform tasks with the product (time to complete task and number & type of errors)
-> Controlled environmental settings
Signal & Noise Metaphor
Experiment design seeks to enhance the signal (variable of interest),
while minimizing the noise (everything else (random influences))
Controlled Experiment: Steps
- Determine the goals, explore the questions, then formulate hypothesis
- Design experiment, define experimental variables
- Choose subjects
- Run pilot experiment
- Iteratively improve experiment design
- Run experiment
- Interpret results to accept or reject hypothesis
Experimental Variables
- Independent Variables
- Dependent Variables
- Control Variables
- Random Variables
- Confounding Variables
Independent Variable - Definition & Examples
An independent variable is under your control
Independent because it is independent of participant behavior
Interface, device, button layout, visual layout, feedback mode, age, gender, background noise, expertise, etc.
Must have at least two levels (values/settings) -> test conditions
Dependent Variable - Definition & Examples
measured human behavior, depends on what the participant does
is measured during the experiment
Task completion time, speed, accuracy, error rate, throughput, target re-entries, task retries, presses of backspace, etc.
Control Variable - Definition & Examples
a circumstance that is kept constant
more control -> less variability, less generalizable
Random Variable - Definition & Examples
circumstance that is allowed to vary randomly -> more variability (bad), but more generalizable
Confounding Variable - Definition & Examples
circumstance that varies systematically with an independent variable
Experiment Task - Good Task Qualities:
Represent activities people do with the interface
Discriminate among the test conditions
Hypothesis vs Claim
A claim predicts the outcome of an experiment
Example: Reading a text in upper case takes longer than reading it in sentence case
A hypothesis claims that changing independent variables influences dependent variables
Example: Changing the case (independent variable) influences reading time (dependent variable)
- > Experiment goal: Confirm hypothesis
- > Statistical approach: Reject null hypothesis
Statistical Tests - 2 Types
Parametric
-> Data are assumed to come from a distribution, such as the normal distribution, t-distribution, etc.
Non-parametric
-> Data are not assumed to come from a distribution
Statistical Tests - Which test for nominal and ordinal (gender, age groups, …)
Non-parametric tests (e.g., Chi-square test)
Statistical Tests - Which test for Interval and Ratio (temperature in C or K, …)
Parametric tests (e.g., t-test, ANOVA), or Non-parametric tests
too few vs too many participants?
Too few: experimental effects fail to achieve statistical significance
Too many: statistical significance even for very small effect sizes
Within-subjects, Between-subjects
Within-subjects: each participant is tested on each condition
Between-subjects: each participant is tested on one condition only
Order Effects and how to avoid them
Order effects / learning effects can occur when the same participant is doing a similar task multiple times
-> only relevant for within-subject factors
Avoid them by:
- participants divided into groups, with different orders for test conditions (latin square)
Longitudinal Studies
research that seeks to promote and investigate learning
-> practice is the independent variable
Analytical Evaluation Methods (2)
heuristic evaluation
cognitive walkthrough
Golden rules of UI design
- Keep the interface simple
- Speak the user‘s language
- Be consistent and predictable
- Make things visible and provide feedback
- Minimize the user‘s memory load
- Design for error: Avoid errors, help to recover from errors, offer undo
- Design clear exits and closed dialogs
- Include help and documentation
- Offer shortcuts for experts
- Make the system responsive
Heuristic Evaluation- How many evaluators?
3-5 evaluators
Cognitive Walkthrough
Experts “walk” through the design prototype with usage scenario(s)
Experts analyze each task following 3 questions:
- Will the correct action be sufficiently evident to the user?
- Will the user notice that the correct action is available?
- Will the user associate and interpret the response from the action correctly?
Model-Based Evaluation - 3 Examples
GOMS
Keystroke Level Model („daughter“ model of GOMS)
Fitt‘s Law
GOMS - Name and Main princile
use model of execution time for basic tasks to predict how long a sequence of actions takes
GOMS = Goals, Operators, Methods, Selection rules
(Selection rules decide which method to select when there is more than one)
Keystroke Level Model
refinement of GOMS that provides a quantitative model about execution times
assigns each operator a context-independent average duration