Interview Questions Flashcards

1
Q

Can you describe the process you follow when migrating applications from on-premises servers to AWS or Azure? What are some challenges you’ve encountered, and how did you address them?

A

*Assisted team with migrating servers from on-prem cloud provider to AWS. Leveraging AWS MGN Service, Transferring data to EC2s, and using VPNs to make it seamless.
2. Ran into replication speed issues. So we increased the size of the EBS on a memory optimized replicating server, and chose to throttle the bandwidth meaning that MGN used 5 concurrent connections to speed up the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How have you used Terraform or similar tools to manage cloud infrastructure?

A

Working with developer teams to create templates and modules for their own self service. The files I have written allows them to provision EC2 instances on a VPC, managed by a kubernetes cluster, with S3 buckets, IAM roles, and Security groups. I would like to continue working with terraform so that I can work my way to creating more complex files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you use tools like Datadog and Splunk to set up monitoring and alerting for applications? Can you provide an example of how you’ve configured alerts based on specific SLOs?

A

The tools I’ve used the most in Datadog. For my first big project at Fidelity I created monitors and alerts for the cpu and memory usage, and node counts of our production clusters in EKS. And then a year ago my team, and the SRE team began an initiative to gather SLIs and SLOs from all of our developer teams and create application monitors in datadog for them. For example, an SLI for endpoint load time being less than 2 ms, and an SLO being 99.8%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What automation scripts or tools have you developed or used to streamline server configuration or deployment processes? How did this impact system reliability and efficiency?

A

For one of my current projects I am working on migrating redhat linux servers to redhat linux 8 servers. The old ones used TC and the new ones are tomcat opensource. For these types of migrations my team has udeploy deployment process that installs tomcat, the certificate, java binaries, JDK, and cron jobs. The process isn’t always completely seamless, but it is much faster than doing each step manually, and once it’s tested on DEV all you have to do is deploy the same process to the other environments. All in all it greatly improved reliability and efficiency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you approach server configuration and OS upgrades to ensure compliance with company guidelines and security standards?

A

For server configuration I take into account our developer’s needs and business requirements. Our on-prem virtual machines are managed by a platform that manages the servers are up the security standards according to the company. From my end I am able to configure user access, logging, and we have organization-wide incident response plans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How have you used Docker and Kubernetes (EKS) to manage application deployments? What are some best practices you follow to ensure stability and reliability?

A

I have helped support hundreds of applications in kubernetes which we use for redundancy and reliability. We implement autoscaling groups and load balancers so that the pods scale automatically, however if there is ever an issue with an app we are able to go in and restart pods, or even exec into the pods in order to view their logs. To ensure stability we use node managed groups and multi-availability zone deployments. We suggest that our application teams use blue-green deployments, and as mentioned earlier I have experience creating monitoring and alerting for our EKS resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can you explain how you conduct audits with Active Directory to maintain security and separation of control between different teams?

A

very year I have assisted our security of operations department conduct separation of control reviews on the highly-critical applications that my team helps support. My team reports to them each user group that has access to execute production changes for the applications as well as a list of the users in those groups. We use the active directory console as a directory that contains data on all of our groups and members and makes it easy to get screenshots of who’s in what groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe a time you encountered a significant incident in a production environment. How did you handle it, and what steps did you take to prevent similar issues in the future?

A

I have been involved in many highly critical incidents throughout my time with Fidelity where applications have crashed and required intervention. The three most common fixes from my end are usually to 1) reboot the servers, 2) recycle the apps, or 3) an issue with a certificate. For example, there was once an incident where external customers were having issues verifying their identity when trying to recover their username or reset their password. After some investion we found that the port used to allow the external website to verify the digital cert on the servers was closed. By opening the port, traffic began to flow again, and the issue was resolved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How have you implemented strategies to reduce the cost of resources used by development and production systems? Can you provide specific examples?

A

Though I am not directly involved in determining the budget for our teams I do suggest best practices to application teams. For example, provisioning as little as possible in the cloud in order to save costs. Though those costs would be coming from the app teams and not ourselves, I still try to urge them to make sound financial decisions when provisioning new resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you work with SREs and software developers to communicate SLI and SLO expectations effectively?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is your approach to creating and maintaining documentation and runbooks for application support? How do you ensure they are kept up to date and useful for the team?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Can you discuss your experience leading a team through an Agile sprint? What challenges did you face, and how did you overcome them?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you build and maintain strong relationships with software developers and other stakeholders to ensure successful application deployments and ongoing support?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you use automation tools like Ansible, Puppet, or Chef in your infrastructure management processes?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can you describe a time when you automated a repetitive task in your infrastructure, and what was the outcome?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is your experience with deploying and managing applications across public and private clouds?

A
17
Q

How have you used Python or Java in managing infrastructure tasks or automating processes?

A
18
Q

Can you provide an example of a script you wrote to solve an infrastructure-related problem?

A
19
Q

What are some challenges you have faced while migrating applications across different cloud platforms, and how did you address them?

A
20
Q

How do you ensure scalability and resilience in cloud-based applications?

A
21
Q

How do you use monitoring tools like AppDynamics or Splunk to ensure application performance and availability?

A
22
Q

Describe how you have handled a critical application outage and what steps you took for root cause analysis and remediation.

A

If it is a critical outage I first investigate if any updates are changes were made recently, check the application logs which usually point me in the right direction. From my experience, what usually causes an outage, if not a developer update, it’s the digital certificates being out of date, or the network traffic getting stuck at some point. Figuring out where the traffic is stuck can let you know what to fix.

23
Q

How have you implemented CI/CD pipelines using tools like Jenkins and Git?

A
24
Q

What are some DevOps practices you have adopted in your previous roles to improve infrastructure reliability?

A

Using IAC, CI/CD, monitoring and logging, scaling and load balancing, and disaster recovery and backups.

25
Q

Describe a challenging technical problem you faced and how you approached solving it.

A

A lot of my technical issues comes from testing automation. Whether is be automation for certificate installations or automation to make updates to a server, there are many unique problems I can run into depending on what I’m working with. My approach is to always research documentation first. The preparation and learning is the most important part and usually takes the longest. I also ask around for examples for similar jobs to what I’m trying to accomplish and study that as well. Then I break the automation down and start building and testing it bit by bit. Running it, getting an error, and fixing it over and over until it finally works. Then I send it out to my teammates to test, and finally write up documentation for it.

26
Q

How do you determine when to escalate an issue to higher management or technical experts?

A
27
Q

How do you make decisions when working on projects involving multiple technologies and applications?

A

When it comes to working on projects involving multiple technologies I want to make sure I’m understanding what I’m getting into and I know how the different technologies fit together. Reading the documentation and doing any initial learning is the most important step, as well as discussing with my senior teammates what they say the best practices should be. That way I can be sure that I am making the most informed decision.

28
Q

Can you describe a situation where you had to make a significant decision about infrastructure design or implementation?

A

I have not yet had the opportunity to be involved directly with designing new infrastructure, though I have been encouraged to suggest changes to infrastructure builds that we have. If I were to design infrastructure, I would recommend that the best approach is to create a handful of infrastructure templates that you can apply to groups of similar applications, and then tweak the configurations for their specific needs, rather than having hundreds of different infrastructure implementations for hundreds of different applications. In that way you are able to increase the reliability and and stability of your systems.

29
Q

How do you build strong relationships with development teams to ensure reliability and prevent incidents?

A

when it comes to supporting my developers I make sure to be very communicative with them. I prefer to use team chats over email, and I let them know that I am always available to jump on a call if they need it. My team has a great reputation for our developer support and it has always been my duty to carry on that reputation.

30
Q

Describe a time when you worked closely with a team to identify and resolve capacity risks.

A