recover Flashcards

1
Q

what is an event

A

any change of state that has significance for the management of a service. Typically, they are notifications from monitoring tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is an incident

A

an unplanned interruption to a service or reduction in the quality of a service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a problem
- what is a known error

A

a cause, or potential cause, of one or more incidents

  • known errors are problems that have been analysed but not resolved
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is incident management
- purpose?

A
  • to minimise the negative impacts of incidents by restoring normal service operation as quickly as possible
  • diagnose and escalate
  • reactive process
  • not a proactive measure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is problem management
- purpose?

A
  • reduce likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors
  • reactive and proactive
  • same incident occurring many times; affects many users;
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

incident management process (4)

A
  1. identify
  2. log
  3. categorise
  4. prioritise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

incident identification

A

come from:
1. users: walk-ups, self-service, emails, etc
2. alerts: application monitoring software

decide if issue is an incident OR request

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

incident logging

A

include:
- user’s name and contact information
- incident description
- date and time of incident
- date and time of incident report

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

incident categorisation
- purpose?

A
  • assigning a category + at least 1 subcategory
  • purpose: allows sorting and model incidents, automatic prioritisation; accurate incident tracking and see patterns emerge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

incident prioritisation

A

determined by:
1. impact on users and the business: measure extent of potential damage
2. urgency: how quickly a resolution is required to reduce business impact

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

incident tracking status (6)

A
  1. New
  2. Assigned
  3. In progress
  4. On hold
  5. Resolved
  6. Closed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

post incident review

A
  • check users’ perception
  • check business process and infrastructure metrics
  • decide if an underlying problem exists and raise a ticket if necessary (problem management)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

incident communication

A
  • find out what happen
  • escalation
  • updates
  • reporting incident impact and resolution
  • confirming the resolution with the users
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

user satisfaction surveys
- why?
- success?

A
  • good method of monitoring user perception and expectations
  • key points for success: scope, define, conduct, understand, publish, translate, follow through
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

incident report

A
  • basic summary: ticker number, description, impact, resolution time
  • causes found: technical analysis
  • actions taken: short-term workarounds, improvements to avoid similar occurrences
  • post-incident follow up: measurements taken after the fix, eliminate root cause/problem tickets raised, user surveys
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

summary report
- purpose?
- includes?

A
  • ensure incident management effective

includes:
- number of incidents
- average resolution time
- type of incident reported
- % of incidents handled within the agreed response time
- % closed by service desk without escalation
- summarise in non-technical language and show where improvements could be made

17
Q

security concerns

A
  • incident may occur due to security event (unauthorised access, virus, cyber attack)
  • elevated system access may need to be granted to resolve incident
  • data may be lost/leaked
18
Q

support team’s role in IM

A
  • Receive and communicate all incidents
  • Filter out service/change requests
  • Resolve or escalate incidents as appropriate
  • Confirm and close tickets
  • Analyse incident logs
  • Report on incident trends and suggest improvements
19
Q

types of root causes

A
  1. (special cause) random root cause:
    - hard to track down and fix
    - log but no action unless occurs again
  2. (random cause) root cause will produce more incidents if not fixed:
    - problems
    - find and fix
20
Q

risks

A

potential incidents that have no manifested yet

21
Q

risk management
- purpose?
- how?

A
  • any potential incident is a risk and should be considered as early as possible
  • ensure reliable enterprise solutions
  • avoid/mitigate/transfer/accept
22
Q

risk classification

A
  1. severity (business impact)
  2. likelihood (probability of the event to happen)
23
Q

RTO

A

recovery time objectives
- maximum agreed acceptable period of time following a service disruption that can elapse before business functions are severely impacted
- how long to recover?

24
Q

RPO

A

recovery point objectives
- the point to which information used by a business activity must be restored to enable the activity to operate on resumption of the service
- how far back last point where data is in usable format?

25
Q

phases of problem management

A
  1. problem identification
  2. problem control
  3. error control
26
Q

problem identification

A
  • detect duplicate and recurring issues
  • during major incident, identify risk that an incident could recur
  • analyse information received that may cause problems like security risks, vendor reports, quality assurance teams
27
Q

problem control

A
  • problem analysis (RCA) / troubleshooting
  • documenting workarounds
  • documenting known errors
28
Q

troubleshooting process

A
  1. define problem statement
  2. gather information, data, etc
  3. determine - root cause analysis
  4. recommend solutions for eliminating or mitigating the problem
29
Q

RCA

A

root cause analysis
- systematic process for identifying ‘root causes’ of problems/incidents and an approach for responding to them
- prevent problems
- pinpoint contributing factors to a problem
- creates RCI & RCR

30
Q

RCA - time analysis

A
  • understand what happened and ensure all information is available
  • get data, sort by date and time, list in time order == look for patterns
31
Q

RCA - fishbone diagram

A
  • helps to understand and visualise relationships between causes
  • helps with troubleshooting documentation
  • progressively break down potential causes of a problem
    1. causes are grouped into categories
    2. create possible causes under each category
32
Q

problem response: troubleshoot recommendation

A
  1. design solution based on analysis
  2. decide & plan implementation
  3. follow change process
33
Q

what is a workaround

A
  • solution that reduces/eliminated the impact of an incident/problem for which a full resolution is not yet available
34
Q

error control

A
  • manage known errors
  • identify potential permanent solution
  • regularly reassess the status of known errors not yet resolved
35
Q

disaster recovery

A
  • aims to protect an org from effects of significantly negative events
  • allows org to maintain or quickly resume mission-critical functions following a disaster
36
Q

ESM role in disaster recovery

A
  • Escalating if a situation looks like a potential disaster
  • Help test DR plans
  • Check critical business processes
  • Triage incidents
  • Check if back to normal