Critical Callouts Flashcards

1
Q

What shouldn’t a TSE do without having a DE engaged on XE8545, XE8640, XE9640 and XE9680

A

Replace hardware, dispatch labor, nor have a customer internally troubleshoot components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

1st step if it is suspected that internal troubleshooting is needed

A

Pull Logs. Debugging TSR, nvidia bug-report.sh(if GPU), OS level logs(SOS report, VM Support Bundle) while system is in faulted state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

2nd step if suspected internal troubleshooting is needed

A

Verify system config and perform firmware updates. Including PCIe Switch Boards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

PCIe Switch Boards are not reported in

A

iDRAC nor TSR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Verifying system config is paramount to ensure

A

proper PCIe switch firmware is applied.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

PCIe switch firmware bundles are only executable

A

within the OS, can take 10 minutes or more and need a power cycle to apply.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

CPLD updates have to?

A

be applied by themselves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3rd step if suspected internal troubleshooting is needed

A

The system must be powered down and AC pulled after firmware updates are applied.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

If customer is able, how would we power down in step 3 of suspected internal troubleshooting?

A

Physically remove AC calbles for more than 10 minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If a customer is not able to physically pull AC?

A

attempt to run a full AC power cycle through the F2 BIOS menus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4th step if suspected internal troubleshooting is needed

A

dcgmi diag output would be needed.
Quick: dcgmi diag -r 1
Full: dcgmi diag -r 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

5th step if suspected internal troubleshooting is needed

A

DE engagement is necessary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is DCGM?

A

Nvidias Data Center GPU Manager. A suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does DCGM include?

A

active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Top 5 Things GPUs can be used for

A

Scientific Computing(solving complex physics)
Machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly