Troubleshooting Flashcards
Scenario 1
Imagine you run a service similar to Netflix on your server. This server is running on a Linux distribution. You have around 10 customers daily using your service from across the globe. It has been running without problem with this load for the last 3 months but today you have received complaints from your customers saying that the video performance is sloppy and keeps buffering constantly.
You have both remote access to the server and physical access to it.
Multiple failures possible:
High IO due to failed HDD
Problems in connectivity due to NIC / Cable
High resources utilization due to a stuck process
Storage Full
IMPORTANT: Verify he doesn’t drop the customers while checking, if he does ask him if he thinks it is ok to disconnect customers without previously informing them
(Customer Obsession)
Scenario 2
Imagine you have a computer shop and a customer brings you his desktop as it is rebooting randomly whenever he uses it. The customer explains that he already has reinstalled the OS and that the systems keeps restarting randomly on him.
The computer has the following components: >PSU >MOBO >2x HDD 1 TB each >4xDIMMs 4 GB each >1 CPU (Quadcore i7) >1x CPU Fan >1x VGA PCIe Card
Which one do you think would be the most common hardware that would cause this behaviour?
Multiple failures possible:
>RAM failure >Multiple Ram failure >Memory Controller Failure >MOBO failure >PSU failure
Scenario 3
Imagine you are working in one of the Amazon Data Centers and a customer complains to you as his server keeps powering off after exactly 30 minutes running.
The computer has the following components: PSU MOBO 4xDIMMs 4 GB each 2x HDD 1 TB each 1 CPU (Quadcore i7) 1x CPU Fan 1x VGA PCIe Card OS installed is Linux Distribution
What would you say is the most common cause for this behaviour? How would you isolate it to demonstrate it is the correct one?
Multiple failures possible:
Overheating
>Due to dust
>Fan not working
>Thermal Paste
CPU
OS cron job stopping the computer every 5 hours
CPU failure
Scenario 4
Imagine you are working in one of the Amazon Data Centers, and after extracting a drive from a server to replace it, you realize it is the wrong one. This is a RAID 5 server.
What would you do?
Reinsert the drive and locate the correct drive that is failing and remove it. As this is RAID 5, losing one HD shouldn’t be an issue.