interview Flashcards

1
Q
  1. How do you differentiate between incidents, problems, and changes?
A

Incident: An unplanned interruption or degradation of service.
Problem: The underlying cause of one or more incidents.
Change: A modification to an IT service or system aimed at resolving problems or improving functionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. How do you handle multiple simultaneous incidents?
A

Assess and prioritize based on impact and urgency.
Assign dedicated resources to each incident.
Use playbooks to ensure a structured response for high-priority incidents.
Communicate status updates effectively to all stakeholders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. What is your experience with post-incident reviews (PIRs)?
A

Conducted PIRs within 48 hours of incident resolution.
Structured the review to include a timeline, root cause analysis, corrective actions, and lessons learned.
Facilitated open discussions to identify process gaps and ensure accountability without assigning blame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. How do you ensure compliance with SLAs during incident management?
A

Establish a clear escalation and communication path for all teams involved.
Use global incident management tools like ServiceNow or PagerDuty.
Ensure team members are aware of time zones and organizational dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. How do you diagnose performance issues on a Linux server?
A

Use tools like top, htop, or vmstat for resource usage.
Analyze logs under /var/log.
Check disk I/O using iostat or iotop.
Inspect network performance using netstat, tcpdump, or iftop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. What is your understanding of TCP/UDP and when would you use each?
A

TCP: Reliable, connection-oriented protocol for use cases like file transfers and web browsing.
UDP: Lightweight, connectionless protocol for real-time applications like DNS queries and video streaming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. How do you ensure containerized services are running optimally?
A

Use monitoring tools like Prometheus and Grafana for resource tracking.
Analyze container logs using docker logs.
Ensure health checks and resource limits (CPU, memory) are defined in Docker or Kubernetes configurations.
Investigate inter-container network latency or misconfigurations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. How would you troubleshoot DNS resolution failures?
A

Check the DNS server’s availability using nslookup or dig.
Verify DNS configurations in /etc/resolv.conf.
Investigate firewall or network settings blocking DNS traffic.
Ensure TTL values are appropriate and DNS caches are updated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. What are some common bottlenecks in CI/CD pipelines, and how do you address them?
A

Slow Builds: Optimize builds by caching dependencies or using parallel tasks.
Failed Tests: Ensure tests are modular and focus on critical areas.
Deployment Issues: Use automated rollback mechanisms or staged deployments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. How do you configure and use Prometheus or graffana for monitoring?
A

Install Prometheus and configure scrape targets in the prometheus.yml file.
Use exporters (e.g., node_exporter for Linux systems) to gather metrics.
Query metrics using PromQL and visualize them with Grafana.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. What is the difference between active and passive monitoring?
A

Active Monitoring: Simulates user transactions to test system performance proactively (e.g., synthetic monitoring).
Passive Monitoring: Observes live user activity to detect issues in real-time (e.g., packet sniffing).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. What do you look for in log files during incident resolution?
A

Errors or exceptions with timestamps matching the incident.
Patterns indicating system or user activity leading to failure.
Logs of dependent services to identify cascading issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. How do you design an incident escalation matrix?
A

Define escalation tiers based on severity and impact.
Assign escalation paths to specific roles or teams.
Establish time thresholds for each tier.
Regularly review and update the matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some key metrics you track to measure the success of incident management processes?

A

Mean Time to Detect (MTTD).
Mean Time to Acknowledge (MTTA).
Mean Time to Resolve (MTTR).
SLA compliance rates.
Post-incident review completion rates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. How do you handle emergency changes during an incident?
A

Assess the impact and risk of the change with key stakeholders.
Gain approval through an expedited emergency change management process.
Test the change in a controlled environment if time permits.
Monitor the results and document the change thoroughly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. What are some strategies for mitigating risks in high-availability systems?
A

Implement redundancy at all levels (e.g., servers, storage, networks).
Use automated failover mechanisms.
Regularly test disaster recovery and failover scenarios.
Ensure proper monitoring and alerting to catch early signs of degradation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  1. A critical database server is down during peak hours. How would you handle the situation?
A

Notify stakeholders immediately and assemble the incident response team.
Investigate logs for errors or performance degradation.
Check for hardware issues or resource exhaustion.
Apply a temporary fix, such as restoring from a backup or scaling resources.
Document the incident thoroughly and schedule a follow-up for root cause analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A monitoring tool has flagged intermittent latency in a microservices-based application. What’s your approach?

A

Examine logs for specific services with high response times.
Use distributed tracing tools to identify bottlenecks.
Investigate resource usage on affected nodes.
Test inter-service communication and network latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
  1. Explain the significance of ICMP in network troubleshooting.
A

ICMP is used for diagnostic and error-reporting purposes.
Common tools like ping and traceroute rely on ICMP to measure connectivity and path latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q
  1. How do you ensure I/O performance optimization in a high-load application?
A

Use RAID for disk performance and redundancy.
Optimize database queries and indexes.
Implement caching layers (e.g., Redis).
Monitor and adjust kernel I/O schedulers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
  1. Define the Bot’s Purpose
A

Identify the problem the bot will solve (e.g., automate repetitive tasks, assist employees, or manage workflows).
Example use cases: answering FAQs, scheduling, or retrieving on-call staff information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
  1. how to choose right bot platform?
A

Decide where the bot will operate (e.g., Slack, Microsoft Teams, email, or a custom app).
Ensure it integrates well with corporate tools (e.g., Jira, ServiceNow, or internal APIs).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
  1. Select Development Tools for bot
A

Bot Frameworks: Use frameworks like Microsoft Bot Framework, Dialogflow, or Botpress to streamline development.
Programming Language: Python, JavaScript, or Node.js are commonly used due to their simplicity and libraries.
APIs: Utilize corporate APIs (like HR systems or databases) to fetch required data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
  1. Build Core Functionality for a bot
A

Write code for the bot’s tasks. For example:
Use APIs for fetching schedules, automating queries, or retrieving documents.
Build logic for processing commands like “Show today’s on-call staff.”
Implement natural language understanding (NLU) using tools like Rasa or Dialogflow for conversational bots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
    1. Ensure Security and Compliance for bot
A

Use encryption to secure sensitive data.
Follow corporate policies on data storage and processing.
Authenticate users (e.g., via Single Sign-On (SSO) or OAuth).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q
  1. Test and Deploy for a bot
A

Test the bot in a staging environment for bugs or user flow issues.
Deploy it to production on your chosen platform (e.g., Slack workspace or Teams environment).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q
  1. Monitor and Improve
A

Monitor usage logs and performance metrics.
Gather feedback from users to improve functionality.
Update the bot regularly to handle new scenarios or integrate additional features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q
  1. Can you walk us through the lifecycle of an incident from detection to resolution?
A

Detection: Incident identified via monitoring tools, alerts, or user reports.
Classification: Assign severity level based on impact and urgency.
Response: Notify stakeholders, assemble the incident response team, and assign roles.
Diagnosis: Use logs, monitoring data, and tools to identify the root cause.
Resolution: Apply a temporary fix (if needed) and implement a permanent solution.
Communication: Regular updates to stakeholders.
Post-Incident Review: Document the incident, analyze for lessons learned, and refine processes to prevent recurrence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q
  1. How would you prioritize incidents during a major outage?
A

Assess the impact (e.g., number of users affected, financial loss).
Analyze the urgency (e.g., SLA breaches or cascading system failures).
Focus on critical systems supporting customer-facing services.
Ensure effective delegation, while providing continuous communication to stakeholders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q
  1. What monitoring tools are you familiar with, and how have you used them?
A

Tools: Prometheus, Grafana, Splunk, Datadog, and ELK Stack.
Used for tracking system performance, detecting anomalies, creating alerts, and diagnosing root causes.
Example: Used Grafana dashboards to monitor application health during peak traffic and proactively mitigated risks by analyzing metrics like CPU and memory usage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q
  1. How would you explain a technical incident to non-technical stakeholders?
A

Start with the impact (e.g., “Service X is currently unavailable for 20% of users”).
Avoid technical jargon; use analogies if helpful (e.g., “It’s like a traffic jam blocking access”).
Highlight the steps being taken and the estimated time for resolution.
Keep updates concise and provide frequent updates to maintain trust.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q
  1. What’s your approach to delivering bad news about an incident?
A

Be transparent but solution-focused.
Clearly explain the situation, impact, and mitigation steps in progress.
Reassure stakeholders by emphasizing the team’s expertise and a structured resolution plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q
  1. Describe a challenging incident you managed and how you resolved it.
A

Context: A critical payment processing service outage.
Action: Coordinated with SREs, reviewed logs, and pinpointed a database deadlock issue.
Solution: Rolled back the faulty code deployment and implemented additional monitoring to catch similar issues proactively.
Outcome: Restored services within the SLA and conducted a root cause analysis to prevent recurrence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q
  1. How would you troubleshoot an intermittent network issue?
A

Start with logs: Check network device logs and application-level errors.
Run diagnostics: Use tools like traceroute, ping, and packet capture.
Correlate data: Identify patterns like time of occurrence or specific affected regions.
Isolate components: Test individual elements of the network to narrow down the root cause.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q
  1. What tools do you use for incident tracking and reporting?
A

Jira: For tracking and documenting incidents.
ServiceNow: For managing incident lifecycles.
Confluence: For post-incident reporting and knowledge sharing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q
  1. How do you ensure high-quality incident documentation?
A

Include essential details: incident timeline, root cause, impact, resolution steps, and lessons learned.
Ensure reports are concise, structured, and accessible to both technical and non-technical audiences.
Use templates to maintain consistency across reports.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q
  1. How do you ensure effective collaboration during an incident?
A

Use a centralized communication channel (e.g., Slack or Microsoft Teams).
Clearly assign roles and responsibilities to team members.
Encourage open communication and quick escalation of issues.
Regularly update stakeholders and keep the team focused on resolution goals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q
  1. How do you manage a situation where two teams disagree on the root cause of an incident?
A

Facilitate a discussion to focus on facts rather than opinions.
Use data (e.g., logs, metrics) to guide decisions.
If unresolved, escalate to a neutral decision-maker or a higher-level incident commander.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q
  1. What’s your experience with CI/CD pipelines?
A

Familiar with Jenkins, GitLab CI/CD, and GitHub Actions.
Implemented pipelines for automated testing, deployment, and monitoring.
Example: Reduced deployment time by 30% using a well-defined CI/CD process integrated with Kubernetes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q
  1. How do you mitigate risks when rolling out changes during an incident?
A

Implement a change freeze during critical periods.
Conduct thorough pre-deployment testing.
Use canary deployments or blue-green deployments to minimize impact.
Rollback immediately if adverse effects are detected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q
  1. How do you handle stress during a critical incident?
A

Stay calm and focused by breaking the problem into smaller tasks.
Use checklists and incident response playbooks to stay organized.
Maintain clear communication and lean on the team for support.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q
  1. Describe a time you improved an incident management process.
A

Implemented a post-mortem framework to identify recurring incident trends.
Developed an automated incident notification system using Slack and ServiceNow integration.
Resulted in a 20% improvement in incident response times and better documentation quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q
  1. How do you prioritize incidents and allocate resources?
A

depending on severity of the issue, outage, number of users effected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q
  1. Can you describe a time you prevented a recurring issue?
A

ERD, PRD, production. oncall bot schedules misalligned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q
  1. How do you manage communication during high-pressure incidents?
A

Through an lark channel with all the correct POCs from the affected departments. asking for update and providing when I can.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q
  1. Provide an example of a critical incident you handled and its outcome.
A

rollbacks for bad deployments

traffic failover to ttp2 after ttp1 outage

downstream service caused errors spikes to a critical psm for my team 3C which caused an investigation and discussion with the team that caused it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q
  1. If you get 3 escalations, what would you prioritize? (3 escalations being: suicide content, LGBT or/and Government pressure)
A

In such scenarios, prioritization would depend on the urgency and impact. Suicide content would take the highest priority as it involves potential loss of life and requires immediate action to protect individuals. Next, I would address government pressure, ensuring compliance with regulations to maintain operational stability. Lastly, I would manage the LGBT-related escalation, ensuring that it is handled sensitively and in alignment with TikTok’s values of inclusivity and community support. Throughout, I would ensure clear communication and resource allocation to address all issues effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are you aspirations?

A

Through out my career ive been aspired to be more of leader where I can contribute to my team by driving project ideas, resolutions as well as being in the trench’s getting the work done.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q
  1. Tell me about a time you’ve experienced conflict and how you dealt with it.
A

during my time at TikTok I designed a bot that updates the oncall schedules based on a master schedule regardless of the region. and outputs the current oncaller. During a demo for the discovery team there was a conflict between two of the SRE’s regarding the impact of the tool for their team that escalated to shouting. Which I took the reigns of the conversation so we could break down the issues that were present since one of the discovery members didn’t understand the issue the first member was bringing up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q
  1. Why are you interested in Tiktok?
A

im interested in working at titkok because its an ever evolving industry with numerous talented engineers and projects to work on to grow my career as well as supporting a platform that can connect millions of people worldwide through the app.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q
  1. Besides your professional commitment, can you share a story about a time when you helped someone?
A

one of my friends was desperate for a career change and I brought up IT and got him involved with some foundational certs such as A+, Linux +. networking +, etc. We then went over interview prep and found some help desk jobs and now hes currently working for Microsoft as a pen test engineer after a 3 year journey.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

AGILE

A

Linear form of agile that utilizes
1.requirements
2. design
3. implementation
4. Testing
5. Deployment
6. Maintenance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Q: What command in Linux shows running processes and their resource usage?

A

top or htop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Q: How do you check disk usage in Linux?

A

Use df -h for disk usage and du -sh for directory size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Q: What is the purpose of the /etc/hosts file in Linux?

A

A: It maps hostnames to IP addresses locally, bypassing DNS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Q: What does ping do in networking?

A

A: Tests connectivity between two devices by sending ICMP echo requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Q: Name three common types of virtualization technologies.

A

A: VMware, Hyper-V, and KVM. oraclevm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Q: What is the difference between NAT and Bridged network modes in virtualization?

A

A: NAT shares the host’s IP, while Bridged gives VMs direct access to the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Q: What does the iptables command do in Linux?

A

A: Manages firewall rules for packet filtering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Q: What is the purpose of a VLAN?

A

A: To segment a network into isolated virtual networks for security and efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Q: What is the primary purpose of Kubernetes?

A

A: Orchestrates containerized applications for scaling, deployment, and management.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Q: How do you create a Docker container from an image?

A

A: Use the command docker run <image>.</image>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Q: What is a Kubernetes Pod?

A

A: The smallest deployable unit in Kubernetes, containing one or more containers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Q: How does Kubernetes ensure high availability?

A

A: By automatically replicating and rescheduling pods on healthy nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Q: What is the role of a Dockerfile?

A

A: Defines the steps to build a custom Docker image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Q: How do you check the status of nodes in a Kubernetes cluster?

A

A: Use kubectl get nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Q: What is the purpose of a Kubernetes ingress?

A

A: Manages external HTTP/S access to services within the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Q: Name one tool used to monitor Docker and Kubernetes environments.

A

A: Prometheus with Grafana or Kubernetes Dashboard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Q: What are the key stages of the incident management lifecycle?

A

A: Identification, logging, categorization, prioritization, investigation, resolution, closure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Q: What is an SLA in incident management?

A

A: Service Level Agreement – a commitment to resolve issues within a specified timeframe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Q: What is the main goal of incident management?

A

A: To restore normal service operation as quickly as possible with minimal business impact.

72
Q

Q: What is a P1 incident?

A

A: A Priority 1 incident with the highest severity, often causing significant business disruption.

73
Q

Q: What is the purpose of a post-incident review (PIR)?

A

A: To analyze root causes, evaluate the response, and identify process improvements.

74
Q

Q: What does ITIL stand for?

A

A: Information Technology Infrastructure Library.

75
Q

Q: Name a tool commonly used for incident tracking.

76
Q

Q: What is the role of an Incident Commander during an incident?

A

A: To coordinate efforts, ensure clear communication, and oversee resolution activities.

77
Q

Q: What is the first step in process development?

A

A: Identifying goals and the problem the process aims to solve.

78
Q

Q: What framework is commonly used for IT service management best practices?

A

A: ITIL (Information Technology Infrastructure Library).

79
Q

Q: Name a best practice for effective incident communication.

A

A: Provide regular updates with clear and concise information to stakeholders.

80
Q

Q: How can processes be continuously improved?

A

A: Use post-incident reviews, gather feedback, and implement iterative changes.

81
Q

Q: What does RACI stand for in process development?

A

A: Responsible, Accountable, Consulted, Informed.

82
Q

Q: Why is documentation critical in process development?

A

A: It ensures consistency, provides clarity, and supports training and compliance.

83
Q

Q: What is a playbook in incident management?

A

A: A predefined set of steps and procedures to handle specific incident types.

84
Q

Q: Name one tool for automating incident management processes.

A

A: Ansible, Rundeck, or Zapier.

85
Q

Q: What is the purpose of chmod in Linux?

A

A: It changes file permissions (e.g., chmod 755 file).

86
Q

Q: How do you find all .log files modified in the last 7 days in a directory?

A

A: find /path/to/dir -name “*.log” -mtime -7.

87
Q

Q: What is piping in Linux, and provide an example.

A

A: Piping (|) passes the output of one command as input to another. Example: ls -l | grep “filename”.

88
Q

Q: Explain the difference between > and&raquo_space; in shell scripting.

A

A: > overwrites a file, while&raquo_space; appends to a file.

89
Q

Q: What does the awk command do?

A

A: Processes and extracts data from text. Example: awk ‘{print $1}’ file.txt prints the first column of a file.

90
Q

Q: How do you write a basic shell script to list all running processes?

A

!/bin/bash

ps aux

91
Q

Q: What is the purpose of set -e in a shell script?

A

A: Exits the script immediately if a command returns a non-zero status.

92
Q

Q: How can you schedule a script to run every day at 2 AM?

A

0 2 * * * /path/to/script.sh

93
Q

Q: What does the grep -v flag do?

A

A: Excludes lines matching a pattern. Example: grep -v “error” file.txt.

94
Q

Q: How do you replace all occurrences of “foo” with “bar” in a file using sed?

A

A: sed -i ‘s/foo/bar/g’ file.txt.

95
Q

Q: What are the main components of a Kubernetes cluster?

A

A: Master node (API server, scheduler, etcd, controller manager) and worker nodes (kubelet, kube-proxy, container runtime).

96
Q

Q: How do you initialize a Kubernetes cluster?

A

A: Run kubeadm init, then configure kubectl and join worker nodes using the provided token.

97
Q

Q: How do you expose a Kubernetes deployment externally?

A

A: Use a Service of type NodePort or LoadBalancer, or configure an Ingress.

98
Q

Q: How do you build and push a Docker image to DockerHub?

A

docker build -t <username>/<imagename>:<tag> .
docker push <username>/<imagename>:<tag></tag></imagename></username></tag></imagename></username>

99
Q

Q: What command shows all running containers in Docker?

A

A: docker ps.

100
Q

Q: How do you scale a Kubernetes deployment?

A

A: Use kubectl scale deployment <name> --replicas=<count>.</count></name>

101
Q

Q: What is the purpose of a ConfigMap in Kubernetes?

A

A: Stores non-sensitive configuration data as key-value pairs, which can be used by applications.

102
Q

Q: How do you restart all pods in a deployment?

A

kubectl rollout restart deployment <name></name>

103
Q

Q: What is the difference between docker-compose and Kubernetes?

A

docker-compose is for local container orchestration, while Kubernetes manages containers across distributed systems.

104
Q

Q: How do you create a PersistentVolumeClaim (PVC) in Kubernetes?

A

Define a YAML file specifying the storage class, access modes, and size, then apply it using kubectl apply -f.

105
Q

Q: What is the difference between an incident and a problem in ITSM?

A

A: An incident is an immediate disruption of service, while a problem is the underlying cause of one or more incidents.

106
Q

Q: What is the purpose of a Change Advisory Board (CAB) in ITSM?

A

A: To review and approve proposed changes to minimize risks.

107
Q

Q: How is priority determined for incidents?

A

A: By assessing impact (business effect) and urgency (time sensitivity).

108
Q

Q: What are the key benefits of having a Major Incident Process?

A

A: Faster resolution, improved communication, reduced downtime, and better documentation for future prevention.

109
Q

Q: How do you define a Knowledge Base (KB) in ITSM?

A

A: A repository of solutions, troubleshooting guides, and documentation to help resolve incidents efficiently.

110
Q

Q: What is the role of a Service Desk in incident management?

A

A: Acts as the single point of contact for users to report incidents and request services.

111
Q

Q: What is the difference between proactive and reactive problem management?

A

A: Reactive addresses incidents after they occur, while proactive identifies and prevents potential issues.

112
Q

Q: How do you measure the success of incident management?

A

A: Key metrics include Mean Time to Resolution (MTTR), First Call Resolution (FCR) rate, and SLA compliance.

113
Q

Q: What is the purpose of incident escalation?

A

A: To involve higher-level support or management when the current team cannot resolve the issue within SLA timelines.

114
Q

Q: What tool would you use to automate incident resolution workflows?

A

A: ServiceNow, PagerDuty, or Jira Service Management.

115
Q

Q: Write a for loop script to create 10 files named file1.txt to file10.txt.

A

! /bin/bash

for i in {1..10}; do
touch “file$i.txt”
done

116
Q

Q: How do you list all open ports on a Linux system?

A

A: netstat -tuln or ss -tuln.

117
Q

Q: What is a NAT in networking, and why is it used?

A

A: Network Address Translation translates private IP addresses to a public IP address for internet communication, conserving IPv4 addresses.

118
Q

Q: How can you create a new virtual network interface in Linux?

A

A: Use ip link add <name> type bridge or configure with ifconfig or ip addr.</name>

119
Q

Q: What is the difference between a soft link and a hard link?

A

A: Soft links (symbolic links) point to the original file’s path, while hard links are direct references to the inode, unaffected by file relocation.

120
Q

Q: What is the difference between ping and traceroute?

A

A: ping tests connectivity to a host, while traceroute shows the route packets take to reach the host.

121
Q

Q: How do you set up SSH key-based authentication on a Linux system?

A

Generate a key pair: ssh-keygen.
Copy the public key to the remote server: ssh-copy-id user@remote_host.
Ensure proper permissions: chmod 700 ~/.ssh and chmod 600 ~/.ssh/authorized_keys.

122
Q

Q: What is the purpose of the xargs command in shell scripting?

A

It converts input into arguments for a command. Example: ls | xargs rm removes files listed by ls.

123
Q

Q: How do you redirect both stdout and stderr to a file in Linux?

A

A: command > file 2>&1.

124
Q

Q: What are the different escalation types in incident management?

A

Functional Escalation: Involves higher-level technical expertise.
Hierarchical Escalation: Involves senior management for visibility or decision-making.

125
Q

Q: What are SLAs, OLAs, and UCs in ITSM?

A

SLA (Service Level Agreement): Defines service delivery expectations between a provider and a customer.
OLA (Operational Level Agreement): Defines responsibilities between internal teams.
UC (Underpinning Contract): Defines obligations between a provider and third-party vendors.

126
Q

Q: How do you handle a major incident during a critical business hour?

A

Assess Impact: Identify affected systems and services.
Activate Major Incident Process: Notify stakeholders and assemble an incident response team.
Communicate Updates: Provide regular updates to users and management.
Implement Fix: Work on resolution or mitigation.
Document: Log details for post-incident review.

127
Q

Q: What is Root Cause Analysis (RCA) in ITSM?

A

A process to determine the underlying reason for an incident or problem and identify corrective measures to prevent recurrence.

128
Q

Q: What metrics are used to measure the efficiency of incident management?

A

MTTR (Mean Time to Resolve)
MTTD (Mean Time to Detect)
First Call Resolution Rate
Incident Escalation Rate

129
Q

Q: What are the key responsibilities of an incident manager?

A

Coordinate response teams.
Ensure SLA compliance.
Communicate with stakeholders.
Drive root cause analysis and post-incident reviews.
Identify areas for process improvement.

130
Q

Q: How would you prepare an incident response plan for a security breach?

A

Identify: Detect and verify the breach.
Contain: Isolate affected systems.
Eradicate: Remove the threat.
Recover: Restore systems and data.
Learn: Conduct a post-incident review.

131
Q

Q: What is DNS, and why is it important?

A

DNS (Domain Name System) resolves human-readable domain names (e.g., google.com) into IP addresses that computers use to identify resources.

132
Q

Q: What are the types of DNS records, and what do they do?

A

A (Address): Maps a domain to an IPv4 address.
AAAA: Maps a domain to an IPv6 address.
CNAME: Maps an alias to another domain name.
MX (Mail Exchange): Specifies mail servers for a domain.
NS (Name Server): Specifies authoritative DNS servers for a domain.
PTR (Pointer): Provides reverse DNS, mapping an IP address to a hostname.
TXT: Stores arbitrary text, often used for verification and policies (e.g., SPF, DKIM).

133
Q

Q: What is the purpose of the dig command in Linux?

A

A: Queries DNS servers for information about domains and their records.

134
Q

Q: How do you use nslookup to find the IP of a domain?

A

nslookup google.com

135
Q

most used kubernetes commands

A

-kubectl version: Get Kubernetes client and server versions.
- kubectl get pods: List all running pods.
- kubectl describe pod <pod_name>: Get detailed info about a pod.
- kubectl apply -f <file>.yaml: Apply configuration from a YAML file.
- kubectl delete pod <pod_name>: Delete a pod.

136
Q

What is layer 1 of osi model and what does it do

A

physical : it is responsible for the actual physical connection between the devices. The physical layer contains information in the form of bits

137
Q

What is layer 2 of osi model and what does it do

A

Data Link Layer (DLL)
The data link layer is responsible for the node-to-node delivery of the message.

138
Q

What is layer 3 of osi model and what does it do

A

Network Layer
The network layer works for the transmission of data from one host to the other located in different networks. It also takes care of packet routing i.e. selection of the shortest path to transmit the packet, from the number of routes available.

139
Q

What is layer 4 of osi model and what does it do

A

Transport Layer
The transport layer provides services to the application layer and takes services from the network layer. The data in the transport layer is referred to as Segments. It is responsible for the end-to-end delivery of the complete message. The transport layer also provides the acknowledgment of the successful data transmission and re-transmits the data if an error is found. Protocols used in Transport Layer are TCP, UDP NetBIOS, PPTP.

140
Q

What is layer 5 of osi model and what does it do

A

Layer 5 – Session Layer
Session Layer in the OSI Model is responsible for the establishment of connections, management of connections, terminations of sessions between two devices. It also provides authentication and security. Protocols used in the Session Layer are NetBIOS, PPTP.

141
Q

What is layer 6 of osi model and what does it do

A

Presentation Layer
The presentation layer is also called the Translation layer. The data from the application layer is extracted here and manipulated as per the required format to transmit over the network. Protocols used in the Presentation Layer are JPEG, MPEG, GIF, TLS/SSL, etc.

142
Q

What is layer 7 of osi model and what does it do

A

Application Layer
At the very top of the OSI Reference Model stack of layers, we find the Application layer which is implemented by the network applications. These applications produce the data to be transferred over the network. This layer also serves as a window for the application services to access the network and for displaying the received information to the user. Protocols used in the Application layer are SMTP, FTP, DNS, etc.

143
Q

TCP vs UDP

A

TCP
Creates a secure connection to ensure data is transmitted reliably. TCP verifies that data is received and checks for errors.
UDP
Does not establish a connection, so it doesn’t check for errors or confirm receipt. This means some data may be lost during transmission.

144
Q

What Qualifies a P0 Incident

A

An issue that needs to be addressed immediately and with as many resources as is required. Such an issue causes a full outage or makes a critical function of the product to be unavailable for everyone, without any known workaround.

145
Q

what is a P0

A

severe end user tiktok user impact app functions are broken and severe experience issues are being encounter.
- 3 or more teams impacted such as TCE, RDS, HDFS
- quantifiable revenue or advertiser impact
- security impact risk to customer data, security breach, data loss, vulnerabilities, hack/attack, etc

146
Q

Incident Priority Matrix

A

high system affect vs single user affect. vs urgency.

147
Q

incident management responsibilities

A

Responsibilities:

  1. imt is added to a p0 incident and begins tracking the incident timeline
  2. ensure escalation to correct technical teams based on systems impacted
  3. insures that the incident is being address in a timely manner and will drive escalations to team leads and managers
  4. opens fatal record
  5. starts incident analysis template to start incident report
  6. tracks the incident details and drives the incident group until the impact is mitigated
  7. add all relevant data to incident report and begins the post incident review process
    Security incidents escalate to the appropriate security channel
148
Q

ITSM and ITIL definition

A

IT Service Management

149
Q

incident response steps.

A
  1. intial triage ( join/create oncall, review chat logs, request issue summary, request poc updates, ensure all necessary escalation contact that are needed to investigate the issue are engaged.
  2. manage incident (update ttp incident thread every 15 minutes for critcal issues. request regular updates from technical teams, use data to populate the iat with as much data as possible)
  3. Post incident (when mitigate lower to p1, send final message to appropriate groups. create jira epic and create post mortem doc.)
150
Q

incident life cycle

A
  1. identify - detect/log incident
  2. analyze - categorize and prioritize incidents
  3. respond - investigate, diagnose and resolve incidents
  4. review - post mortem and improvements
151
Q

record process-

A
  1. identify incidents
  2. document incidents
  3. categorize incidents
  4. assign ownership
152
Q

types of roles and ownership

A
  1. incident coordinator
  2. technical specialists
  3. communication manager
  4. process owner
153
Q

types of incidents

A
  1. service outage
  2. security breach
  3. human error
  4. natural disasters
154
Q

RASCI checklist

A

Responsible: Person(s) doing the task.
Accountable: Person with final decision-making authority.
Supporting: Person(s) providing support or resources.
Consulted: Person(s) providing input or feedback.
Informed: Person(s) kept updated on progress or decisions.

155
Q

Example P0 Incident

A

Network Traffic Drop in TTP1 OCI

156
Q

Q: What is CI/CD?

A

CI/CD stands for Continuous Integration and Continuous Deployment/Delivery, automating code integration, testing, and deployment processes.

157
Q

Q: What is Continuous Integration (CI)?

A

A: CI involves merging developer code changes into a shared repository multiple times a day, with automated builds and testing to detect integration issues early.

158
Q

Q: What is Continuous Delivery (CD)?

A

A: CD automates the release process, ensuring the application is always in a deployable state but requires manual approval to deploy to production.

159
Q

Q: What is Continuous Deployment?

A

A: Continuous Deployment automates the release process entirely, deploying every change that passes automated tests to production without manual intervention.

160
Q

Q: What are the main steps in a CI/CD pipeline?

A

Code Commit: Developers push code to a repository.
Build: The application is compiled, and dependencies are installed.
Test: Automated tests validate the code.
Deploy: The tested application is deployed to staging or production.

161
Q

Q: What is the purpose of automated testing in CI/CD?

A

A: To ensure that changes do not break existing functionality or introduce bugs.

162
Q

Q: What tools are commonly used for CI/CD?

A

A: Jenkins, GitHub Actions, GitLab CI/CD, CircleCI, Travis CI, Azure DevOps, etc.

163
Q

Q: What is the role of version control in CI/CD?

A

A: Version control (e.g., Git) tracks changes, enables collaboration, and integrates with CI/CD pipelines for automated builds and testing.

164
Q

Q: What is canary deployment?

A

A: A strategy where a new version is rolled out to a small subset of users before full deployment to minimize risk.

165
Q

Q: What is a rollback in CI/CD?

A

A: Reverting to a previous stable version in case of deployment failures.

166
Q

Q: What is the purpose of a RASCI matrix?

A

A: To define and clarify roles and responsibilities within a project or process to avoid confusion.

167
Q

Q: How do you use a RASCI matrix in incident management?

A

Assign clear responsibilities for incident detection, escalation, resolution, and review.
Ensure accountability is established for critical decision-making.
Identify key stakeholders to keep informed during incidents.

168
Q

Q: Give an example of RASCI roles in an incident response.

A

Responsible: Incident responder or SRE.
Accountable: Incident manager or lead.
Supporting: System administrator or SME.
Consulted: Security team or product owner.
Informed: Leadership or affected customers.

169
Q

Q: What are common challenges with using RASCI?

A

Overlapping roles leading to confusion.
Lack of agreement on responsibilities.
Not keeping the matrix up to date with organizational changes.

170
Q

Q: What is the difference between incident management and problem management?

A

A:

Incident Management focuses on restoring service as quickly as possible.
Problem Management identifies and resolves the underlying cause of incidents.

171
Q

Q: What is the ITIL framework?

A

A: A set of best practices for IT service management that aligns IT services with business needs.

172
Q

Q: What is the purpose of the ITIL Service Desk function?

A

A: To act as a single point of contact (SPOC) for users to report incidents and request services.

173
Q

Q: What is a Change Advisory Board (CAB)?

A

A: A group responsible for evaluating and approving proposed changes to IT systems.

174
Q

Q: What are ITIL’s 4 Ps?

A

People: Stakeholders and roles.
Processes: Workflows and activities.
Products: Technology and tools.
Partners: Vendors and suppliers.

175
Q

Q: What are the key metrics in incident management?

A

Mean Time to Resolve (MTTR).
Mean Time to Detect (MTTD).
Number of recurring incidents.
SLA compliance rate.

176
Q

Q: What is the difference between reactive and proactive incident management?

A

Reactive: Focuses on resolving incidents after they occur.
Proactive: Prevents incidents through monitoring, analysis, and improvement.