Troubleshooting and Performance Optimization Flashcards
troubleshooting methodology
- identify problem
- establish theory of probably cause
- test the theory
- establish plan of action
- implement a solution/escalate
- verify functionality
- perform root cause analysis
- document the solution
refined troubleshooting
- identify problem scope
- reproduce the problem
- check log files
- read documentation
BIOS failure possible causes
- overheating
- unsupported features
- newer options may require UEFI
BIOS failure possible solutions
- keep server rooms/data centers properly ventilated
- update (flash) BIOS
- acquire UEFI motherboards/enable UEFI options
POST failure possible causes
- TPM firmware detects a boot configuration change
- failed hardware components
POST failure possible solutions
- enter TPM recovery code/configure boot options
- search for reported POST code to identify problem
- replace failed components
memory failure possible causes
- POST failure message
- random OS freezes/reboots
memory failure possible solutions
- run memory diagnostics
- replace failed components
processor failure/performance degradation possible causes
- overheating
- throttling slows CPU as temperature increases
- VMs with manual CPU affinity specified are performing poorly
processor failure/performance degradation possible solutions
- ensure HVAC is running correctly
- don’t manually link VMs to specific CPU cores
boot sequence possible causes
- OS not found due to changing disk order/partitions
- booting from USB might fail if not enabled in BIOS
boot sequence possible solutions
- configure bootable disk order in BIOS
- configure bootable disk partitions in OS
- flash BIOS so USB boot is supported
storage failure possible causes
- drive failure
- RAID array drive failures resulting in slow performance
storage failure possible solutions
- run disk diagnostics
- replace failed drives
- have hot spare disks in place
power failure possible causes
- power supply
- power surge
power failure possible solutions
- use redundant power sources
- use UPSs
- use surge protectors
environment failure possible causes
- HVAC malfunctioning causes overheating
- accumulated dust hampers airflow/add layer of insulation
- low humidity increases ESD
environment failure possible solutions
- ensure HVAC is running properly
- clear dust from components/air intake fans
- ensure HVAC keeps consistent relative humidity
crash cart tools
- multimeter to test power supplies
- hardware diagnostics tools for components
- can of compressed air to remove dust
- antistatic wrist strap/ESD mats
- tools for testing bad RAM chips
logon failure possible causes
- incorrect credentials
- corrupt user profile
- can’t locate authentication server
logon failure possible solutions
- reset user password
- save old user profile/remove corrupt user profile and registry references
- ensure client station points to correct DNS server
user unable to access resource possible causes
- insufficient permissions
- encryption is enabled
- Windows UAC configuration is too restrictive
- UNIX/Linux sudo is not configured to enable user access to certain commands
user unable to access resource possible solutions
- check user effective access
- check group membership
- ensure user has decryption key
- loosen UAC settings
- modify sudoers configuration file
memory leak possible causes
- poorly written software
- malware
- runaway processes
memory leak possible solutions
- reboot server to reclaim memory
- run antimalware scan
- patch software
- find functionally equivalent software
blue screen of death (BSOD)/hang/crashes possible causes
- unstable device driver
- bad RAM chips
- memory buffer overrun due to unpatched software
BSOD/hang/crashes possible solutions
- update/replace/roll back driver
- run memory diagnostics
- patch software
- replace failed RAM chip
- restart Windows server/press F8/attempt to boot using last known good configuration (LKGC)
purple screen of death (PSoD) in VMware ESXi possible causes
most commonly related to VMkernel critical errors
PSoD in VMware ESXi possible solutions
- apply OS and driver updates/roll back updates
- remove recently added hardware/test stability
- review memory core dump file
- check ESXi scratch partition for the vmware support output file
disk drive unmountable possible causes
- file system corruption
- not supported by local OS
disk drive unmountable possible solutions
- run disk scan to correct file system errors
- format drive with file system supported by local OS
logs can’t be written to possible causes
log disk volume is full
logs can’t be written to possible solutions
- free up disk space
- store logs in alternate location
- archive old log messages
slow OS performance possible causes
- OS disk is full
- disks are fragmented
- system resources lacking
- CPUs are busy
- VM memory swap file (page file)/partition is on slow disk/is corrupt
slow OS performance possible solutions
- free up space on OS drive
- extend OS drive capacity
- defragment drive
- reduce number of processors running concurrently
- place VM swap configuration file on fast disks
- enable disk write caching
software patches not being applied possible causes
- previous software dependencies aren’t present
- patches don’t match platform architecture
- software has reached end of life
- synching updates with downstream servers failing in enterprise
software patches not being applied possible solutions
- apply previous dependencies first
- acquire patches for appropriate platform architecture
- acquire newer versions of software that are supported
- check for network connectivity problems/changes
service failure possible causes
- dependency services failed to start
- service account has insufficient permisions
- service account password has expired
service failure possible solutions
- ensure dependent services are started first
- grant service account required permissions
- set service account password
OS can’t be shutdown possible causes
- hangs caused by runaway background processes
- updates still being applied
OS can’t be shutdown possible solutions
- use task manager or Linux kill command to terminate processes
- wait for updates to complete
users can’t print possible causes
- Windows print spooler service isn’t responsive
- printer is offline
- incorrect/corrupt driver
users can’t print possible solutions
- restart Windows print spooler service
- ensure printer is correctly configured/online
- uninstall/reinstall updated printer driver
- remove/reconfigure printer in OS
software packages can’t be installed in Linux possible causes
package dependencies aren’t installed/incorrect version