Critical Callouts Flashcards

Question 1

Q

What shouldn’t a TSE do without having a DE engaged on XE8545, XE8640, XE9640 and XE9680

Answer

A

Replace hardware, dispatch labor, nor have a customer internally troubleshoot components

Question 2

Q

1st step if it is suspected that internal troubleshooting is needed

Answer

A

Pull Logs. Debugging TSR, nvidia bug-report.sh(if GPU), OS level logs(SOS report, VM Support Bundle) while system is in faulted state.

Question 3

Q

2nd step if suspected internal troubleshooting is needed

Answer

A

Verify system config and perform firmware updates. Including PCIe Switch Boards.

Question 4

Q

PCIe Switch Boards are not reported in

Answer

A

iDRAC nor TSR

Question 5

Q

Verifying system config is paramount to ensure

Answer

A

proper PCIe switch firmware is applied.

Question 6

Q

PCIe switch firmware bundles are only executable

Answer

A

within the OS, can take 10 minutes or more and need a power cycle to apply.

Question 7

Q

CPLD updates have to?

Answer

A

be applied by themselves.

Question 8

Q

3rd step if suspected internal troubleshooting is needed

Answer

A

The system must be powered down and AC pulled after firmware updates are applied.

Question 9

Q

If customer is able, how would we power down in step 3 of suspected internal troubleshooting?

Answer

A

Physically remove AC calbles for more than 10 minutes

Question 10

Q

If a customer is not able to physically pull AC?

Answer

A

attempt to run a full AC power cycle through the F2 BIOS menus

Question 11

Q

4th step if suspected internal troubleshooting is needed

Answer

A

dcgmi diag output would be needed.
Quick: dcgmi diag -r 1
Full: dcgmi diag -r 4

Question 12

Q

5th step if suspected internal troubleshooting is needed

Answer

A

DE engagement is necessary.

Question 13

Q

What is DCGM?

Answer

A

Nvidias Data Center GPU Manager. A suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments

Question 14

Q

What does DCGM include?

Answer

A

active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management.

Question 15

Q

Top 5 Things GPUs can be used for

Answer

A

Scientific Computing(solving complex physics)
Machine learning