recover Flashcards
what is an event
any change of state that has significance for the management of a service. Typically, they are notifications from monitoring tools
what is an incident
an unplanned interruption to a service or reduction in the quality of a service.
what is a problem
- what is a known error
a cause, or potential cause, of one or more incidents
- known errors are problems that have been analysed but not resolved
what is incident management
- purpose?
- to minimise the negative impacts of incidents by restoring normal service operation as quickly as possible
- diagnose and escalate
- reactive process
- not a proactive measure
what is problem management
- purpose?
- reduce likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors
- reactive and proactive
- same incident occurring many times; affects many users;
incident management process (4)
- identify
- log
- categorise
- prioritise
incident identification
come from:
1. users: walk-ups, self-service, emails, etc
2. alerts: application monitoring software
decide if issue is an incident OR request
incident logging
include:
- user’s name and contact information
- incident description
- date and time of incident
- date and time of incident report
incident categorisation
- purpose?
- assigning a category + at least 1 subcategory
- purpose: allows sorting and model incidents, automatic prioritisation; accurate incident tracking and see patterns emerge
incident prioritisation
determined by:
1. impact on users and the business: measure extent of potential damage
2. urgency: how quickly a resolution is required to reduce business impact
incident tracking status (6)
- New
- Assigned
- In progress
- On hold
- Resolved
- Closed
post incident review
- check users’ perception
- check business process and infrastructure metrics
- decide if an underlying problem exists and raise a ticket if necessary (problem management)
incident communication
- find out what happen
- escalation
- updates
- reporting incident impact and resolution
- confirming the resolution with the users
user satisfaction surveys
- why?
- success?
- good method of monitoring user perception and expectations
- key points for success: scope, define, conduct, understand, publish, translate, follow through
incident report
- basic summary: ticker number, description, impact, resolution time
- causes found: technical analysis
- actions taken: short-term workarounds, improvements to avoid similar occurrences
- post-incident follow up: measurements taken after the fix, eliminate root cause/problem tickets raised, user surveys