Interview Questions Flashcards
Can you describe the process you follow when migrating applications from on-premises servers to AWS or Azure? What are some challenges you’ve encountered, and how did you address them?
*Assisted team with migrating servers from on-prem cloud provider to AWS. Leveraging AWS MGN Service, Transferring data to EC2s, and using VPNs to make it seamless.
2. Ran into replication speed issues. So we increased the size of the EBS on a memory optimized replicating server, and chose to throttle the bandwidth meaning that MGN used 5 concurrent connections to speed up the process.
How have you used Terraform or similar tools to manage cloud infrastructure?
Working with developer teams to create templates and modules for their own self service. The files I have written allows them to provision EC2 instances on a VPC, managed by a kubernetes cluster, with S3 buckets, IAM roles, and Security groups. I would like to continue working with terraform so that I can work my way to creating more complex files.
How do you use tools like Datadog and Splunk to set up monitoring and alerting for applications? Can you provide an example of how you’ve configured alerts based on specific SLOs?
The tools I’ve used the most in Datadog. For my first big project at Fidelity I created monitors and alerts for the cpu and memory usage, and node counts of our production clusters in EKS. And then a year ago my team, and the SRE team began an initiative to gather SLIs and SLOs from all of our developer teams and create application monitors in datadog for them. For example, an SLI for endpoint load time being less than 2 ms, and an SLO being 99.8%.
What automation scripts or tools have you developed or used to streamline server configuration or deployment processes? How did this impact system reliability and efficiency?
For one of my current projects I am working on migrating redhat linux servers to redhat linux 8 servers. The old ones used TC and the new ones are tomcat opensource. For these types of migrations my team has udeploy deployment process that installs tomcat, the certificate, java binaries, JDK, and cron jobs. The process isn’t always completely seamless, but it is much faster than doing each step manually, and once it’s tested on DEV all you have to do is deploy the same process to the other environments. All in all it greatly improved reliability and efficiency
How do you approach server configuration and OS upgrades to ensure compliance with company guidelines and security standards?
For server configuration I take into account our developer’s needs and business requirements. Our on-prem virtual machines are managed by a platform that manages the servers are up the security standards according to the company. From my end I am able to configure user access, logging, and we have organization-wide incident response plans.
How have you used Docker and Kubernetes (EKS) to manage application deployments? What are some best practices you follow to ensure stability and reliability?
I have helped support hundreds of applications in kubernetes which we use for redundancy and reliability. We implement autoscaling groups and load balancers so that the pods scale automatically, however if there is ever an issue with an app we are able to go in and restart pods, or even exec into the pods in order to view their logs. To ensure stability we use node managed groups and multi-availability zone deployments. We suggest that our application teams use blue-green deployments, and as mentioned earlier I have experience creating monitoring and alerting for our EKS resources.
Can you explain how you conduct audits with Active Directory to maintain security and separation of control between different teams?
very year I have assisted our security of operations department conduct separation of control reviews on the highly-critical applications that my team helps support. My team reports to them each user group that has access to execute production changes for the applications as well as a list of the users in those groups. We use the active directory console as a directory that contains data on all of our groups and members and makes it easy to get screenshots of who’s in what groups.
Describe a time you encountered a significant incident in a production environment. How did you handle it, and what steps did you take to prevent similar issues in the future?
I have been involved in many highly critical incidents throughout my time with Fidelity where applications have crashed and required intervention. The three most common fixes from my end are usually to 1) reboot the servers, 2) recycle the apps, or 3) an issue with a certificate. For example, there was once an incident where external customers were having issues verifying their identity when trying to recover their username or reset their password. After some investion we found that the port used to allow the external website to verify the digital cert on the servers was closed. By opening the port, traffic began to flow again, and the issue was resolved.
How have you implemented strategies to reduce the cost of resources used by development and production systems? Can you provide specific examples?
Though I am not directly involved in determining the budget for our teams I do suggest best practices to application teams. For example, provisioning as little as possible in the cloud in order to save costs. Though those costs would be coming from the app teams and not ourselves, I still try to urge them to make sound financial decisions when provisioning new resources.
How do you work with SREs and software developers to communicate SLI and SLO expectations effectively?
What is your approach to creating and maintaining documentation and runbooks for application support? How do you ensure they are kept up to date and useful for the team?
Can you discuss your experience leading a team through an Agile sprint? What challenges did you face, and how did you overcome them?
How do you build and maintain strong relationships with software developers and other stakeholders to ensure successful application deployments and ongoing support?
How do you use automation tools like Ansible, Puppet, or Chef in your infrastructure management processes?
Can you describe a time when you automated a repetitive task in your infrastructure, and what was the outcome?