Troubleshooting and Performance Optimization Flashcards
troubleshooting methodology
- identify problem
- establish theory of probably cause
- test the theory
- establish plan of action
- implement a solution/escalate
- verify functionality
- perform root cause analysis
- document the solution
refined troubleshooting
- identify problem scope
- reproduce the problem
- check log files
- read documentation
BIOS failure possible causes
- overheating
- unsupported features
- newer options may require UEFI
BIOS failure possible solutions
- keep server rooms/data centers properly ventilated
- update (flash) BIOS
- acquire UEFI motherboards/enable UEFI options
POST failure possible causes
- TPM firmware detects a boot configuration change
- failed hardware components
POST failure possible solutions
- enter TPM recovery code/configure boot options
- search for reported POST code to identify problem
- replace failed components
memory failure possible causes
- POST failure message
- random OS freezes/reboots
memory failure possible solutions
- run memory diagnostics
- replace failed components
processor failure/performance degradation possible causes
- overheating
- throttling slows CPU as temperature increases
- VMs with manual CPU affinity specified are performing poorly
processor failure/performance degradation possible solutions
- ensure HVAC is running correctly
- don’t manually link VMs to specific CPU cores
boot sequence possible causes
- OS not found due to changing disk order/partitions
- booting from USB might fail if not enabled in BIOS
boot sequence possible solutions
- configure bootable disk order in BIOS
- configure bootable disk partitions in OS
- flash BIOS so USB boot is supported
storage failure possible causes
- drive failure
- RAID array drive failures resulting in slow performance
storage failure possible solutions
- run disk diagnostics
- replace failed drives
- have hot spare disks in place
power failure possible causes
- power supply
- power surge
power failure possible solutions
- use redundant power sources
- use UPSs
- use surge protectors
environment failure possible causes
- HVAC malfunctioning causes overheating
- accumulated dust hampers airflow/add layer of insulation
- low humidity increases ESD
environment failure possible solutions
- ensure HVAC is running properly
- clear dust from components/air intake fans
- ensure HVAC keeps consistent relative humidity
crash cart tools
- multimeter to test power supplies
- hardware diagnostics tools for components
- can of compressed air to remove dust
- antistatic wrist strap/ESD mats
- tools for testing bad RAM chips
logon failure possible causes
- incorrect credentials
- corrupt user profile
- can’t locate authentication server
logon failure possible solutions
- reset user password
- save old user profile/remove corrupt user profile and registry references
- ensure client station points to correct DNS server
user unable to access resource possible causes
- insufficient permissions
- encryption is enabled
- Windows UAC configuration is too restrictive
- UNIX/Linux sudo is not configured to enable user access to certain commands
user unable to access resource possible solutions
- check user effective access
- check group membership
- ensure user has decryption key
- loosen UAC settings
- modify sudoers configuration file
memory leak possible causes
- poorly written software
- malware
- runaway processes
memory leak possible solutions
- reboot server to reclaim memory
- run antimalware scan
- patch software
- find functionally equivalent software
blue screen of death (BSOD)/hang/crashes possible causes
- unstable device driver
- bad RAM chips
- memory buffer overrun due to unpatched software
BSOD/hang/crashes possible solutions
- update/replace/roll back driver
- run memory diagnostics
- patch software
- replace failed RAM chip
- restart Windows server/press F8/attempt to boot using last known good configuration (LKGC)
purple screen of death (PSoD) in VMware ESXi possible causes
most commonly related to VMkernel critical errors
PSoD in VMware ESXi possible solutions
- apply OS and driver updates/roll back updates
- remove recently added hardware/test stability
- review memory core dump file
- check ESXi scratch partition for the vmware support output file
disk drive unmountable possible causes
- file system corruption
- not supported by local OS
disk drive unmountable possible solutions
- run disk scan to correct file system errors
- format drive with file system supported by local OS
logs can’t be written to possible causes
log disk volume is full
logs can’t be written to possible solutions
- free up disk space
- store logs in alternate location
- archive old log messages
slow OS performance possible causes
- OS disk is full
- disks are fragmented
- system resources lacking
- CPUs are busy
- VM memory swap file (page file)/partition is on slow disk/is corrupt
slow OS performance possible solutions
- free up space on OS drive
- extend OS drive capacity
- defragment drive
- reduce number of processors running concurrently
- place VM swap configuration file on fast disks
- enable disk write caching
software patches not being applied possible causes
- previous software dependencies aren’t present
- patches don’t match platform architecture
- software has reached end of life
- synching updates with downstream servers failing in enterprise
software patches not being applied possible solutions
- apply previous dependencies first
- acquire patches for appropriate platform architecture
- acquire newer versions of software that are supported
- check for network connectivity problems/changes
service failure possible causes
- dependency services failed to start
- service account has insufficient permisions
- service account password has expired
service failure possible solutions
- ensure dependent services are started first
- grant service account required permissions
- set service account password
OS can’t be shutdown possible causes
- hangs caused by runaway background processes
- updates still being applied
OS can’t be shutdown possible solutions
- use task manager or Linux kill command to terminate processes
- wait for updates to complete
users can’t print possible causes
- Windows print spooler service isn’t responsive
- printer is offline
- incorrect/corrupt driver
users can’t print possible solutions
- restart Windows print spooler service
- ensure printer is correctly configured/online
- uninstall/reinstall updated printer driver
- remove/reconfigure printer in OS
software packages can’t be installed in Linux possible causes
package dependencies aren’t installed/incorrect version
software packages can’t be installed in Linux possible solutions
- update package repositories
- install
- update dependent packages
Windows resource monitor
- shows which processes are consuming most disk I/O time
- input/output operations per second (IOPS
- not provided by other tools
DCSs
- data collector sets
- similar to performance monitor
- can control when to start/stop collecting data
- can configure alert notifications when thresholds have been exceeded
Linux systems in single user mode
- run level 1
- only minimal set of services are running
- involves interrupting boot process/modifying boot startup file
top Linux command
lists top processes consuming resources
ps Linux command
lists running processes
kill Linux command
terminates processes
df Linux command
shows disk free space
Windows commands to map drives
- net use
- new-psdrive
PS get-volume
shows file system health status/size stats
slow file access possible causes
- failed RAID 5 array rebuilding data on demand in memory
- failed RAID controller disk write cache or battery
- disk array contains mismatched drive speeds
slow file access possible solutions
- ensure hot spare disks are always available
- consider using RAID 6 (tolerate 2 simultaneous drive failures)
- RAID arrays can’t queue disk requests that can’t be serviced right away without write caching
- replace faulty components
- disk arrays with slow/fast disks will use the slower speed
data unavailable possible causes
- failed server
- failed HBA
data unavailable possible solutions
- ensure high availability with failover clustering/data backups/data replication to other sites
- ensure redundant SAN paths
failed backup possible causes
- failed network connection
- media failure
failed backup possible solutions
- ensure redundant network connections for LAN/cloud-based backup
- ensure extra backup media is always available
- perform periodic restore drills
- have at least 2 backups of critical data
unavailable drives possible causes
- OS failure
- physical disk failure
- RAID controller failure
- black enclosure backplane failure
- network connection failure
unavailable drives possible solutions
- view LED indicators/LCD displays/drive error lights to catch errors
- ensure redundant network paths to critical apps/data
- replace fail RAID components/attempt to rebuild array
- replace failed hardware components
unable to mount storage media possible causes
- corrupt file system
- corrupt mass storage driver
- insufficient user permissions
- incorrect partition type
unable to mount storage media possible solutions
- run Windows disk scan
- ensure user permissions are correctly configured
- some OSs can’t read disk partitions created with other OS versions
- use correct partition type
Windows command line disk management tools
- diskpart.exe (replaces fdisk command in new OS versions)
- defrag.exe
- powershell cmdlets
Windows GUI disk management tools
- disk management
- server manager
- disk defragmenter
- disk cleanup
- error checking
df Linux command
shows disk free space
fsck Linux command
checks file system for corruption
xfs_repair Linux command
checks for/repairs XFS file system
iostat Linux command
shows disk I/O statistics for storage devices
lsof Linux command
lists open files/provides further details
mdadm Linux command
Linux software RAID array management
TDR
- time-domain reflectometer
- used to measure continuity of electric signals through circuit boards/network cable wires
- used to determine where problem exists (probes just identify there is a problem)
OTDRs
- optical time-domain reflectometers
- show where fiber-optic cables are terminated
- can show location of cable breaks
cause of most network issues
incorrect software protocol configuration
internet connectivity failure possible causes
- service provider outage
- incorrect IP address for subnet
- incorrect subnet mask
- incorrect default gateway
- incorrect DNS server
internet connectivity failure possible solutions
- verify IP address is in correct range for subnet
- ensure configured default gateway interface is on the LAN
- ping by IP address instead of FQDN to isolate name resolution problems
- check provider SLA to determine support options
LAN connectivity only possible causes
IPv4 169.254 address is assigned when DHCP is not reachable
LAN connectivity only possible solutions
- ensure DHCP server is running
- ensure UDP port 67 isn’t blocked
- ensure LANs DHCP relay is functional for DHCP servers on other subnets
network service misconfiguration possible causes
DHCP server handing out invalid IP configurations
network service misconfiguration possible solutions
correct DHCP misconfigurations
network resource unreachable/unavailable possible causes
- name resolution problems
- IP misconfiguration
- VLAN membership
- incorrect subnet mask
- incorrect route table entry
network resource unreachable/unavailable possible solutions
- use nbstat (Windows) to troubleshoot NetBIOS name resolution issues
- use nslookup or dig (Linux) for DNS unknown host messages
- make sure computer is part of the correct VLAN
- view routing table using route print (Windows) or ip route show (Linux)
unable to connect to network possible causes
- faulty network cable
- switch port security
- NIC speed set incorrectly
- RADIUS authentication failure
- MAC address filtering
unable to connect to network possible solutions
- replace faulty cables
- configure switch ports to enable device access
- set NIC speed/duplex settings to autodetect
- ensure proper authentication credentials/methods are used
- add device MAC address to filter list
tracert (Windows)/traceroute (Linux)
- display information as data moves along network
- more useful than ping
- use when destination hosts on different networks are unreachable
- tracert uses ICMP which may be blocked by firewalls
route (Windows)/ip route show (Linux)
use to display/modify routing table entries on Windows server
names resolving to unexpected IP addresses
- probably entries in the local HOSTS file on system
- entries are placed into client DNS cache in memory
- client checks cache before DNS servers
ipconfig /flushdns
- clears client DNS cache
- recent DNS queries are cached in local clients memory
- have time-to-live (TTL) value to determine how long entry is cached
- use when DNS records have recently changed
nslookup
- displays DNS information for local machine or FQDN
- can also use to modify DNS server information
symptoms of malware infection
- excessive/prolonged hardware resource use
- inability to reach network resources
- web browser homepage changed/not editable
- web browser opens pages user didn’t navigate to
- rogue processes/services running with improper privilege escalation
- missing log entries (cleared by attacker)
- encrypted files/messages demanding payment
- abnormal listening ports on server (backdoor)
immediate action when infection is detected
isolate server/subnet
malware removal
- vendor malware removal tools
- windows system restore point (client OS only)
- server reinstall/reimage
- boot through alternative means to remove infection (Windows safe mode/USB boot/PXE boot)
gpupdate/gpresult /r
- gpresult /r shows resultant set of GPOs
- GPOs may be too restrictive/cause issues
security filtering
- enables admins to ensure only specific users/groups get group policy settings
- Windows management instrumentation (WMI)
- WMI query language (WQL)
SetUID special bit (Linux)
- enables executed script/binary to run as the file owner (not invoker)
- owner could be root
- used carefully it can solve issue of user not being able to run script/program
icacls (Windows)/getfacl/setfacl (Linux)
use to save/restore file system ACLs
too much running (performance)
- server OSs barebones by default
- running too many services can harm performance
- use port scanning tools periodically
confidentiality
provided by encryption
cipher
- cryptographic algorithm used to encrypt/decrypt data
- incorrect cipher configuration can cause issues
integrity
- uses hashing algorithm to ensure data has not been tampered with
- packet/file level verification
- packet sniffers/checksums (can also detect use of insecure tools)
hashing commands
- get-filehash .\name.txt (powershell)
- sha256sum (Linux)
sizing
selecting number of virtual CPUs/amount of RAM/disk type for VM
network optimization
- configure VLANs to group machines that communicate frequently into smaller networks
- configure NIC teaming
- network load balancing (NLB) distributes incoming traffic for network services to multiple duplicated servers
horizontal scaling
elastically scaling number of servers in cloud in response to increased demand
troubleshooting step involving questioning stakeholders
identify the problem
troubleshooting step involving reproducing the problem
identify the problem
troubleshooting step involving making single change at a time
implementing the solution
machine with IPv4 statically configured can’t get GPO settings/domain controller can’t be found/machine can communicate with other local and remote hosts/GPO settings worked before manual IPv4 configuration
incorrect DNS server
AD domain admin/nothing happens when trying to run install on domain-joined server
UAC issues (run as administrator)
Linux command to terminate process
kill
drawback of heuristic host/network analysis
false positives