A Critical Site Event: Flashcards

1
Q

What is a Critical Site Event (CSE)?

A

A CSE is an event that has the potential for large and direct impact on customers, are highly emphasized and visible, urgent situations with a high risk of load loss, and are triggered by a loss of redundancy or resiliency to systems.

Example: A loss of supply power to the cooling systems that risks overheating without direct customer impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What defines a Large Scale Event (LSE)?

A

An LSE is an event that impacts customers’ ability to connect, is reported by a customer contact, and indicates that customers are affected by a loss of service.

Example: Similar loss of power to cooling systems where customers make contact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary difference between a CSE and an LSE?

A

CSEs are triggered by events where service remains available but may cause customer impact, while LSEs are triggered by customer contact or events that result in loss of service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of the DCEO team?

A

The DCEO team is responsible for maintaining all critical infrastructure within data centers globally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What should be done when a critical alarm is triggered?

A

DCEO needs to respond immediately by acknowledging the alarm and assessing the situation at the alarm’s location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List the phases in the event management process.

A
  • React: Acknowledge the alarm
  • Investigate: Engage with the equipment
  • Communicate: Update and contact stakeholders
  • Fault Find: Identify the root cause
  • Update: Keep relevant parties informed
  • Stabilize and Restore: Implement mitigation steps.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Who is the First Responder?

A

The First Responder is any person onsite responsible for responding to alarms and notifying surrounding EOTs of the alarm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the responsibility of the Incident Commander (IC)?

A

The IC is responsible for communications during the event, ensuring proper handling, providing updates, and escalating the event as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fill in the blank: The _______ monitors and prioritizes alarms globally.

A

[Facility Operations Center (FOC)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the InfraOps Tenets?

A
  • Safe Work Environment
  • Security
  • Prepare for the Improbable
  • Automation
  • Speed Matters
  • Serviceability
  • Continuous Improvement.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What should be done if there is a loss of redundancy?

A

The First Responder should investigate potential customer impact, assess conditions, and then escalate as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False: The Primary On-Call is typically an On-Site Facility Manager.

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What steps should be taken when escalations are necessary?

A

Communicate early and often, assess options, redirect network traffic, and request additional resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the primary responsibility of the Call Leader?

A

Leads the FOC conference call during customer or redundancy/resiliency impacting events and keeps the Response Team focused on recovery efforts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What happens if a critical alarm is received?

A

The First Responder EOT needs to react immediately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the significance of the FOC in the event management process?

A

The FOC provides first-level support, monitors alarms, and helps resolve events on a 24-hour basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What actions should the First Responder take during an event?

A
  • Contact another member to act as IC
  • Establish communication with IC
  • Validate alarms and escalate if required.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the escalation process for customer-impacting LSEs?

A

All customer-impacting LSEs must be escalated to the Cluster Manager for mitigation planning and recovery approval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fill in the blank: If issues require escalation to the Regional/Cluster Manager, prepare an email to Amazon Senior _______.

A

[Leadership]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What should be assessed before attempting any resets after a power event?

A

Any damage

Important to ensure safety and proper functioning before resetting systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the first action taken by FOC when they see an alarm?

A

Cuts a TT to affected site

TT stands for Trouble Ticket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

If the FOC does not pick up on an alarm, what should the on-call EOT do?

A

Get in contact with the FOC ASAP

Ensures timely response to alarms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What should the EOT on site do if they cannot get through to the FOC?

A

Create your own TT and work from that

This allows for independent action in critical situations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which button should the EOT taking up the IC position use for events?

A

CSE Power or Thermal event button

Available through a TamperMonkey Script.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What tools are needed to address script-inserted questions?

A
  • Doors of Durin
  • AHA and its CSE Response Tool
  • GRC
  • BMS/EPMS
  • Holocron
  • The First Responder
  • The FOC
  • The DCO Team

These tools aid in managing and responding to events effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the main focus of the First Responder during an alarm investigation?

A

Fault finding and troubleshooting

They communicate only with the IC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does the IC do when the First Responder provides updates?

A
  • Contacts the FOC to spin up a conference call if required
  • Escalates using the suitable escalation path
  • Supplies regular updates to the ticket and conference call

Keeping communication clear and consistent is essential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What information should the IC update in the TT regarding critical load support?

A
  • Impact YES/NO
  • Potential impact
  • Staff onsite and roles
  • How critical load is supported
  • UPS Autonomy Times
  • Vendor engagement status
  • Access arrangements
  • Status of affected POD/Electrical room

This information is crucial for assessing the situation accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a CSE?

A

An infrastructure event impacting two or more racks

Stands for Critical Site Event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the purpose of the CSE Response Tool?

A

To diagnose a CSE by providing critical data

It offers insights into host and rack impairment, server temperature data, and critical electronic metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What should the IC do if they need to escalate an issue?

A

Ensure to speak to someone directly

Emails, texts, or chimes do not count as escalation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the role of the IC during a conference call?

A

Clearly identify themselves and remain on the call until resolved

Important for maintaining clarity and continuity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What steps should the IC take if there is a loss of redundancy?

A

Take immediate action if critical load/mechanical load has not transferred

This may involve escalating the situation and conducting tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Fill in the blank: The IC is expected to use the _______ button on the TT for thermal events.

A

CSE Thermal Event

This differentiates between power and thermal incidents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What must be updated in the TT regarding the status of AHU/CRAH modules?

A
  • Temperature in Pod
  • Difference in temps over last 5-10 mins
  • Status of remaining healthy modules
  • Number of fault-free modules
  • Status of associated HSDB/MSDB

Critical for managing thermal events effectively.

36
Q

What is the primary function of the CSE Response Tool?

A

To provide high-level and detailed views into the status of racks, hosts, and electrical infrastructure in a data center during Critical Site Events

37
Q

What type of data does the CSE Response Tool incorporate?

A

Hotspot data for thermal critical racks and average rack temperatures

38
Q

List the benefits of the CSE Response Tool

A
  • Shows important visuals on one page
  • Improves HotSpot alarms by monitoring aggregated thermal data
  • Initiates alarms based on rate-of-change thresholds
39
Q

Define ‘Up’ status for racks in the CSE Response Tool.

A

Fewer than 25% of the operational hosts in the rack are not reachable from the network

40
Q

What does ‘Down’ status indicate for a rack?

A

At least 25% of the operational hosts in the rack are not reachable from the network

41
Q

What temperature indicates a rack is ‘Thermal Critical’?

A

The average temperature of the hosts in the rack is equal to or greater than 35°C

42
Q

What does ‘Unknown’ status mean for a rack?

A

There is not enough information to determine the rack status

43
Q

How does the InfraMap floor map enhance rack status monitoring?

A

It includes rack status and thermal status data for spatial identification of issues

44
Q

What information does the InfraMap SOS dashboard provide?

A

Electrical infrastructure status data including utility, generator, or UPS power readings

45
Q

Who receives alerts when thermal CSE conditions are met?

A
  • A member of the FOC monitoring a potential CSE
  • A Field Engineer remotely monitoring the site
  • A local DCEO EOT
46
Q

What criteria create a potential thermal CSE alert?

A
  • Average host temperatures in a room equal to or over 35°C
  • More than 100 thermal critical racks with over 10 racks reporting inlet temperature
  • Average host temperatures diverging more than 5°C from the 30-minute trailing average
47
Q

What is the maximum time span for event de-duplication in thermal CSE alerts?

A

30 minutes

48
Q

What categories are racks separated into within the CSE Response Tool?

A
  • Thermal Critical
  • Thermal Impaired
  • Impaired
  • Normal
  • Unknown
49
Q

What key statuses are displayed on the CSE Dashboard?

A
  • Event-Related Rack Downs
  • At-Risk Racks
  • Event Impact
  • Electrical Infrastructure Status
  • Down Rack Count
  • IT Load and ATS Monitoring Status
50
Q

What does the Thermal Status tab show?

A

Rack thermal status, thermal impact count, and rack average temperatures

51
Q

What does the Rack Detail page provide?

A

Details about a specific rack and the hosts it contains

52
Q

What type of data does the CSE Response Tool display during a CSE?

A

Critical electrical infrastructure data alongside down racks data

53
Q

What is the purpose of the SOS Dashboard?

A

To view high-level electrical utility, generator, and UPS states for every lineup in a data center

54
Q

What does a reading of zero in the UPS column indicate?

A

The UPS is in use

55
Q

What scenario indicates the data center is operating normally?

A

All meters are green

56
Q

What does the term ‘On Generator’ indicate?

A

The generator is active and producing output load

57
Q

What indicates a lineup is on UPS?

A

Both USB and generator readings are at zero

58
Q

Describe an ‘Edge Case’ scenario.

A

The generator is tested with load output, showing greater than zero while the USB indicates power from the utility

59
Q

What happens in a ‘Live Load Transfer - Failed to Transfer’ scenario?

A

Load is shown on the generator meter and both USB inputs, with a UPS reading lower than 100%

60
Q

What are ‘Meter Defect States’?

A

States where lineups are not actively monitored or have faulty meters

61
Q

What does AHA stand for?

A

Amazon Hardware Atlas

62
Q

What is the main function of AHA?

A

Data center health monitoring during event recovery and regular operations

63
Q

How often does AHA ping hosts in PROD and EC2 fabrics?

A

Every minute

64
Q

What indicates a rack is thermal impaired?

A

At least 25% of hosts are impaired and the rack was thermal critical before becoming impaired

65
Q

What does AHA use to monitor host impairment status?

A

Ping data and thermal data

66
Q

What regions is AHA available in?

A
  • Classic Regions
  • BJS/ZHY
  • PDT/OSU
  • DCA
  • LCK
67
Q

What does the AHA Blast Radius feature allow operators to do?

A

Identify racks downstream of power topology equipment

68
Q

What information does the downstream rack page display?

A

Fabric and rack type breakdown, along with topology nodes

69
Q

What does Rack Splits provide information about?

A

Breakdown of racks in a Pod, including fabric type and supply

70
Q

What must you do to create an Impact Analysis in AHA?

A

Be part of the security group ‘aha-impact-analysis-admin’ and enter the Datacenter and Date

71
Q

What is required to create an Impact Analysis in AHA?

A

You must be part of the security group aha-impact-analysis-admin and in the InfraOps Central Ops org

This ensures only authorized users can create event analyses.

72
Q

What information must be entered to create an Impact Analysis?

A

Datacenter, Date, and Time (in UTC)

This data is essential for accurate event creation.

73
Q

How can users view different potential impacts during event creation?

A

By selecting a different date and time

This feature allows analysis of various scenarios.

74
Q

What is displayed on the next screen after selecting a date and time for Impact Analysis?

A

A timeline of impacted racks

This helps identify the starting point of the analysis.

75
Q

What must be entered in the confirmation modal when creating an event?

A

An associated ticket (or SIM ID)

This links the event to existing support tickets.

76
Q

What information is provided on the event analysis page?

A

Analysis of the event, currently impacted racks, and number of recovered hosts

This gives insights into the event’s impact and recovery status.

77
Q

What action can be taken after viewing the event analysis?

A

Click Post to Ticket to append a list of impacted racks

This facilitates communication with the support team.

78
Q

What happens to events currently after 10 hours?

A

They close automatically

This ensures timely event management.

79
Q

What is Seismo in the context of AHA?

A

The DC Availability Anomaly Detector that detects multi-rack impairments

It triggers alarms for quick response to significant issues.

80
Q

What is the temperature threshold for Seismo to cut a ticket to the FOC?

A

Above 35 degrees Celsius

This helps manage thermal anomalies in data centers.

81
Q

What is the purpose of HotSpot Redux?

A

It aggregates and processes server-level environmental sensor readings

This data is crucial for monitoring and predicting thermal events.

82
Q

What teams utilize HotSpot data?

A

AHA/InfraMap, Seismo, and DCS Science Team

They use this data to analyze and respond to potential thermal events.

83
Q

What is the function of the HWMon team in relation to HotSpot?

A

They publish source data from servers with software-based hypervisors

This is part of the data aggregation process.

84
Q

What is the legacy system that HotSpot Redux redesigned?

A

The legacy Hotspot service

The redesign improves data handling and analysis capabilities.

85
Q

What action should be taken for special feature requests?

A

Submit a ticket to get your ideas heard

This allows users to contribute to system improvements.