A Critical Site Event: Flashcards
What is a Critical Site Event (CSE)?
A CSE is an event that has the potential for large and direct impact on customers, are highly emphasized and visible, urgent situations with a high risk of load loss, and are triggered by a loss of redundancy or resiliency to systems.
Example: A loss of supply power to the cooling systems that risks overheating without direct customer impact.
What defines a Large Scale Event (LSE)?
An LSE is an event that impacts customers’ ability to connect, is reported by a customer contact, and indicates that customers are affected by a loss of service.
Example: Similar loss of power to cooling systems where customers make contact.
What is the primary difference between a CSE and an LSE?
CSEs are triggered by events where service remains available but may cause customer impact, while LSEs are triggered by customer contact or events that result in loss of service.
What is the role of the DCEO team?
The DCEO team is responsible for maintaining all critical infrastructure within data centers globally.
What should be done when a critical alarm is triggered?
DCEO needs to respond immediately by acknowledging the alarm and assessing the situation at the alarm’s location.
List the phases in the event management process.
- React: Acknowledge the alarm
- Investigate: Engage with the equipment
- Communicate: Update and contact stakeholders
- Fault Find: Identify the root cause
- Update: Keep relevant parties informed
- Stabilize and Restore: Implement mitigation steps.
Who is the First Responder?
The First Responder is any person onsite responsible for responding to alarms and notifying surrounding EOTs of the alarm.
What is the responsibility of the Incident Commander (IC)?
The IC is responsible for communications during the event, ensuring proper handling, providing updates, and escalating the event as needed.
Fill in the blank: The _______ monitors and prioritizes alarms globally.
[Facility Operations Center (FOC)]
What are the InfraOps Tenets?
- Safe Work Environment
- Security
- Prepare for the Improbable
- Automation
- Speed Matters
- Serviceability
- Continuous Improvement.
What should be done if there is a loss of redundancy?
The First Responder should investigate potential customer impact, assess conditions, and then escalate as needed.
True or False: The Primary On-Call is typically an On-Site Facility Manager.
True.
What steps should be taken when escalations are necessary?
Communicate early and often, assess options, redirect network traffic, and request additional resources.
What is the primary responsibility of the Call Leader?
Leads the FOC conference call during customer or redundancy/resiliency impacting events and keeps the Response Team focused on recovery efforts.
What happens if a critical alarm is received?
The First Responder EOT needs to react immediately.
What is the significance of the FOC in the event management process?
The FOC provides first-level support, monitors alarms, and helps resolve events on a 24-hour basis.
What actions should the First Responder take during an event?
- Contact another member to act as IC
- Establish communication with IC
- Validate alarms and escalate if required.
What is the escalation process for customer-impacting LSEs?
All customer-impacting LSEs must be escalated to the Cluster Manager for mitigation planning and recovery approval.
Fill in the blank: If issues require escalation to the Regional/Cluster Manager, prepare an email to Amazon Senior _______.
[Leadership]
What should be assessed before attempting any resets after a power event?
Any damage
Important to ensure safety and proper functioning before resetting systems.
What is the first action taken by FOC when they see an alarm?
Cuts a TT to affected site
TT stands for Trouble Ticket.
If the FOC does not pick up on an alarm, what should the on-call EOT do?
Get in contact with the FOC ASAP
Ensures timely response to alarms.
What should the EOT on site do if they cannot get through to the FOC?
Create your own TT and work from that
This allows for independent action in critical situations.
Which button should the EOT taking up the IC position use for events?
CSE Power or Thermal event button
Available through a TamperMonkey Script.
What tools are needed to address script-inserted questions?
- Doors of Durin
- AHA and its CSE Response Tool
- GRC
- BMS/EPMS
- Holocron
- The First Responder
- The FOC
- The DCO Team
These tools aid in managing and responding to events effectively.
What is the main focus of the First Responder during an alarm investigation?
Fault finding and troubleshooting
They communicate only with the IC.
What does the IC do when the First Responder provides updates?
- Contacts the FOC to spin up a conference call if required
- Escalates using the suitable escalation path
- Supplies regular updates to the ticket and conference call
Keeping communication clear and consistent is essential.
What information should the IC update in the TT regarding critical load support?
- Impact YES/NO
- Potential impact
- Staff onsite and roles
- How critical load is supported
- UPS Autonomy Times
- Vendor engagement status
- Access arrangements
- Status of affected POD/Electrical room
This information is crucial for assessing the situation accurately.
What is a CSE?
An infrastructure event impacting two or more racks
Stands for Critical Site Event.
What is the purpose of the CSE Response Tool?
To diagnose a CSE by providing critical data
It offers insights into host and rack impairment, server temperature data, and critical electronic metrics.
What should the IC do if they need to escalate an issue?
Ensure to speak to someone directly
Emails, texts, or chimes do not count as escalation.
What is the role of the IC during a conference call?
Clearly identify themselves and remain on the call until resolved
Important for maintaining clarity and continuity.
What steps should the IC take if there is a loss of redundancy?
Take immediate action if critical load/mechanical load has not transferred
This may involve escalating the situation and conducting tests.
Fill in the blank: The IC is expected to use the _______ button on the TT for thermal events.
CSE Thermal Event
This differentiates between power and thermal incidents.
What must be updated in the TT regarding the status of AHU/CRAH modules?
- Temperature in Pod
- Difference in temps over last 5-10 mins
- Status of remaining healthy modules
- Number of fault-free modules
- Status of associated HSDB/MSDB
Critical for managing thermal events effectively.
What is the primary function of the CSE Response Tool?
To provide high-level and detailed views into the status of racks, hosts, and electrical infrastructure in a data center during Critical Site Events
What type of data does the CSE Response Tool incorporate?
Hotspot data for thermal critical racks and average rack temperatures
List the benefits of the CSE Response Tool
- Shows important visuals on one page
- Improves HotSpot alarms by monitoring aggregated thermal data
- Initiates alarms based on rate-of-change thresholds
Define ‘Up’ status for racks in the CSE Response Tool.
Fewer than 25% of the operational hosts in the rack are not reachable from the network
What does ‘Down’ status indicate for a rack?
At least 25% of the operational hosts in the rack are not reachable from the network
What temperature indicates a rack is ‘Thermal Critical’?
The average temperature of the hosts in the rack is equal to or greater than 35°C
What does ‘Unknown’ status mean for a rack?
There is not enough information to determine the rack status
How does the InfraMap floor map enhance rack status monitoring?
It includes rack status and thermal status data for spatial identification of issues
What information does the InfraMap SOS dashboard provide?
Electrical infrastructure status data including utility, generator, or UPS power readings
Who receives alerts when thermal CSE conditions are met?
- A member of the FOC monitoring a potential CSE
- A Field Engineer remotely monitoring the site
- A local DCEO EOT
What criteria create a potential thermal CSE alert?
- Average host temperatures in a room equal to or over 35°C
- More than 100 thermal critical racks with over 10 racks reporting inlet temperature
- Average host temperatures diverging more than 5°C from the 30-minute trailing average
What is the maximum time span for event de-duplication in thermal CSE alerts?
30 minutes
What categories are racks separated into within the CSE Response Tool?
- Thermal Critical
- Thermal Impaired
- Impaired
- Normal
- Unknown
What key statuses are displayed on the CSE Dashboard?
- Event-Related Rack Downs
- At-Risk Racks
- Event Impact
- Electrical Infrastructure Status
- Down Rack Count
- IT Load and ATS Monitoring Status
What does the Thermal Status tab show?
Rack thermal status, thermal impact count, and rack average temperatures
What does the Rack Detail page provide?
Details about a specific rack and the hosts it contains
What type of data does the CSE Response Tool display during a CSE?
Critical electrical infrastructure data alongside down racks data
What is the purpose of the SOS Dashboard?
To view high-level electrical utility, generator, and UPS states for every lineup in a data center
What does a reading of zero in the UPS column indicate?
The UPS is in use
What scenario indicates the data center is operating normally?
All meters are green
What does the term ‘On Generator’ indicate?
The generator is active and producing output load
What indicates a lineup is on UPS?
Both USB and generator readings are at zero
Describe an ‘Edge Case’ scenario.
The generator is tested with load output, showing greater than zero while the USB indicates power from the utility
What happens in a ‘Live Load Transfer - Failed to Transfer’ scenario?
Load is shown on the generator meter and both USB inputs, with a UPS reading lower than 100%
What are ‘Meter Defect States’?
States where lineups are not actively monitored or have faulty meters
What does AHA stand for?
Amazon Hardware Atlas
What is the main function of AHA?
Data center health monitoring during event recovery and regular operations
How often does AHA ping hosts in PROD and EC2 fabrics?
Every minute
What indicates a rack is thermal impaired?
At least 25% of hosts are impaired and the rack was thermal critical before becoming impaired
What does AHA use to monitor host impairment status?
Ping data and thermal data
What regions is AHA available in?
- Classic Regions
- BJS/ZHY
- PDT/OSU
- DCA
- LCK
What does the AHA Blast Radius feature allow operators to do?
Identify racks downstream of power topology equipment
What information does the downstream rack page display?
Fabric and rack type breakdown, along with topology nodes
What does Rack Splits provide information about?
Breakdown of racks in a Pod, including fabric type and supply
What must you do to create an Impact Analysis in AHA?
Be part of the security group ‘aha-impact-analysis-admin’ and enter the Datacenter and Date
What is required to create an Impact Analysis in AHA?
You must be part of the security group aha-impact-analysis-admin and in the InfraOps Central Ops org
This ensures only authorized users can create event analyses.
What information must be entered to create an Impact Analysis?
Datacenter, Date, and Time (in UTC)
This data is essential for accurate event creation.
How can users view different potential impacts during event creation?
By selecting a different date and time
This feature allows analysis of various scenarios.
What is displayed on the next screen after selecting a date and time for Impact Analysis?
A timeline of impacted racks
This helps identify the starting point of the analysis.
What must be entered in the confirmation modal when creating an event?
An associated ticket (or SIM ID)
This links the event to existing support tickets.
What information is provided on the event analysis page?
Analysis of the event, currently impacted racks, and number of recovered hosts
This gives insights into the event’s impact and recovery status.
What action can be taken after viewing the event analysis?
Click Post to Ticket to append a list of impacted racks
This facilitates communication with the support team.
What happens to events currently after 10 hours?
They close automatically
This ensures timely event management.
What is Seismo in the context of AHA?
The DC Availability Anomaly Detector that detects multi-rack impairments
It triggers alarms for quick response to significant issues.
What is the temperature threshold for Seismo to cut a ticket to the FOC?
Above 35 degrees Celsius
This helps manage thermal anomalies in data centers.
What is the purpose of HotSpot Redux?
It aggregates and processes server-level environmental sensor readings
This data is crucial for monitoring and predicting thermal events.
What teams utilize HotSpot data?
AHA/InfraMap, Seismo, and DCS Science Team
They use this data to analyze and respond to potential thermal events.
What is the function of the HWMon team in relation to HotSpot?
They publish source data from servers with software-based hypervisors
This is part of the data aggregation process.
What is the legacy system that HotSpot Redux redesigned?
The legacy Hotspot service
The redesign improves data handling and analysis capabilities.
What action should be taken for special feature requests?
Submit a ticket to get your ideas heard
This allows users to contribute to system improvements.