Critical Callouts Flashcards
What shouldn’t a TSE do without having a DE engaged on XE8545, XE8640, XE9640 and XE9680
Replace hardware, dispatch labor, nor have a customer internally troubleshoot components
1st step if it is suspected that internal troubleshooting is needed
Pull Logs. Debugging TSR, nvidia bug-report.sh(if GPU), OS level logs(SOS report, VM Support Bundle) while system is in faulted state.
2nd step if suspected internal troubleshooting is needed
Verify system config and perform firmware updates. Including PCIe Switch Boards.
PCIe Switch Boards are not reported in
iDRAC nor TSR
Verifying system config is paramount to ensure
proper PCIe switch firmware is applied.
PCIe switch firmware bundles are only executable
within the OS, can take 10 minutes or more and need a power cycle to apply.
CPLD updates have to?
be applied by themselves.
3rd step if suspected internal troubleshooting is needed
The system must be powered down and AC pulled after firmware updates are applied.
If customer is able, how would we power down in step 3 of suspected internal troubleshooting?
Physically remove AC calbles for more than 10 minutes
If a customer is not able to physically pull AC?
attempt to run a full AC power cycle through the F2 BIOS menus
4th step if suspected internal troubleshooting is needed
dcgmi diag output would be needed.
Quick: dcgmi diag -r 1
Full: dcgmi diag -r 4
5th step if suspected internal troubleshooting is needed
DE engagement is necessary.
What is DCGM?
Nvidias Data Center GPU Manager. A suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments
What does DCGM include?
active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management.
Top 5 Things GPUs can be used for
Scientific Computing(solving complex physics)
Machine learning