Situational Interview 4 Flashcards
Above and beyond for a customer?
(account future date) Nurse cannot log in, she mentions she was recently hired, found the user was set as a future date, client owns security, this is where the standard process is “call your helpdesk”. Instead, anyone else hired at the same time? Identified 15 others. 7am start time. Called the client’s helpdesk, conferenced user, explained issue, explained the greater inevitable impact, they paged and resolved. Able to document on patients and give care instead of going to downtime.
Nurse is active in the future, searched system - found the user was active for future date.
She said she was a recent hire, herself and 15 others. She wasn’t sure if it was relevant.
Security and account setup are handled entirely by the client but i wanted to identify what issue she was having.
The nurse didnt know anyone that can help?
I identified the newly onboard nurses and found they all were going to run into this same issue when they came into the hospital in a few hours.
I searched our ticketing system to see if we had any information about the client’s service desk, located it
Gave the nurse the phone number but told her I’ll call over, conference us all together and explain what is happening
Called, conferenced, service desk had to engage their security team - once they were pulled in, they updated the currently impacted user, and the other 15 possible people.
As a result, she was able to document on patients overnight and did not have to go to downtime procedure, along with the nurses coming in at 7am.
(tight deadline) deliver an important project under a tight deadline.?
(Exceed downtime) Happens frequently - Upgrade team about to exceed maintenance window 1 HOUR - Critical Services not starting - Will be a DOWN System - Major impact to doctors and nurses - My role: Troubleshoot/expedite resolution to meet SLAs - Tech Bridge - App tier - Critical Services crashing - Missing parameters pointing to Config files - Update Params - Critical Services started - Upgrade continued and completed within the timeline and system released to users - NO HOSPITAL WIDE DOWNTIME, ON PAPER, DOCTORS/NURSES
This happens often as I’m responsible for ensure we maintain our SLAs for client’s system availability. One of my more recent incidents,
I had a Cerner Upgrade team engage me as they were about to exceed their maintenance window in the next hour. The upgrade was going poorly and the technical team was behind schedule due to issues with critical services not starting within the application tier. This is the type of issue that would keep the entire system down and it wouldn’t be released back to users, Effectively the electronic health record software wouldn’t be available, and all charting would need to be done on paper. This cause major impact to nurses and doctors.
My role is to troubleshoot and expedite the resolution of critical issues to meet SLAs.
I started an internal technical voice and discussion bridge, pulled in the upgrade teams and their resources that were performing the upgrade.
I connected to the app tier and found the services were crashing due to them missing parameters that were required for them to locate their configuration files.
I updated the parameters to point to the correct file locations for the core services, and launched our core services. With the core services up and running, the upgrade was able to be continued, and system was released back to users on time without causing hospital wide impact to doctors and nurses.
*** (DIG INTO DETAILS) Tell me about a time when you were trying to understand a complex problem on your team and you had to dig into the details to figure it out?
Interface issue - client wasn’t seeing anything reaching their system (messages going from Cerner to External systems) There were 3 interfaces that didn’t make it to them, which was causing these other systems to go into downtime:
- Admit(admissions), discharge, and transfers (critical as new patients are being admitted, other healthcare tools downstream weren’t seeing them admitted)
- Medication Orders going to the pharmacies - pharmacies were not seeing any electronic medication orders or scripts
- Radiology images - radiologists weren’t see orders xrays and mri’s
No backlogs on our interfaces, but client hadn’t seen anything new reach their system for 45 minutes.
Connected to their app tier and confirmed didn’t have anything queueing in the interface system, the system right before it exits to the foreign system, I had to locate charts and diagrams of the services - oracle tables, and system to try to pinpoint where the breakdown was happening further inside the core application. I identified the the services that weren’t functioning, and found that their file system had ran out of disk space. I cleared space and they started working again, dropping messages into the external interface outbound, and exited the cerner system
Client confirmed they were seeing the messages cross and data was appearing in the other system. This allowed admits, medications orders, and radiology images to flow and the external systems to be useable again. The 3 systems became functional again, doctors/nurses/staff could stop documenting information on paper (which hade made their worfklow 5x as slow, and return to the electronic record)
(Change in Direction,pushback) Tell me about a time when you had to communicate a change in direction that you anticipated people would have concerns with.
(Became SME, then discovered potential to help with Rollovers - TEAM OPPOSITION) Direct Care - BMDI - ICU patients - triage team of 2 per shift - Couldn’t resolve them in IRC - shadowed, KB, articles - distributed to shifts. WHILE training SMEs on shift - I Saw uptick during/after upgrades, KNEW WE COULD DO BETTER - PUSH BACK from several teams who were concerned we’d take over entirely, initiate flowout/rollover to handle additional, The team came to realize it benefitted the working relationship between IRC/DC, and we are still using this system today. Saved 70k per shift, 280K across all shifts - like SPOT instances.
We have a team called Direct Care, who is responsible for bedside medical devices sending vitals from patients into the electronic chart. The are patients that are in the intensive care unit and as patients go, these are some of the most vulnerable and sensitive.
An overnight team for this solution exists specifically for triaging and trying to resolve incoming issues. I had identified a trend that we were receiving calls for this solution, and not able to resolve them within our support organization.
I planned to Shadowed, KB, articles - distributed to shifts, and train the 10-12 SMEs. So our teams could handle issues on first contact
I saw a very significant uptick in calls during and after upgrades to BMDI - calls were waiting for this 2 person team overnight (calls stacking up), this is where i knew our org could do better, I worked with our leadership and direct care to ensure that subject experts in my org received rollover calls to handle the volume spike the team was experiencing.
PUSH BACK from several teams who were concerned we’d take over entirely,
The team came to realize it benefitted the working relationship between IRC/DC, and we are still using this system today.
They are often able to handle the throughput, and with the addition of the SMEs being available from my support org for rollover we were able to scale out to handle the peaks in volume and save the company a minimum of 70K per shift in the cost of additional workforce (280k total). We were kind of like SPOT instances.
(subject matter, deeper level of knowledge) Tell me about a time when you realized you needed a deeper level of subject matter expertise to do your job well.
Fetalink (page outs 7 down to 2 - support team of 4, and MTTR in half)
A software solution called Fetalink - which monitors mothers and babies during labor & delivery - new releases had been pushed out by the solutions team that were necessary but incidents had occurred frequently which, Cerner leadership changed the escalation process for my org around the software and deemed all issues coming into my organization as high severity.
Our role in the immediate response center is to try to resolve issues on first contact before we engage other teams. We follow a metric called mean time to resolution, which is how long it took us from the start of the issue until it was entirely resolved - regardless of which teams were engaged.
I quickly noticed an uptick in the number of oncall page outs to the solution team that were occurring due to my organization not knowing the solution well enough to resolve the issues on first contact. Without engaging any external teams to my org.
I immediately reached out to my leadership and the fetalink team leads to schedule a time I could shadow and sit with the fetalink support team, which was only 4 people. I shadowed them during my time off for several days. Collected common issues, core/fundamentals about the architecture, quick win.
I created wiki documentation, quick wins, and issue scenarios to best lead to next steps and resolutions - this information was shared with my entire support organization.
We work in 5 different shifts around the clock, in total we have about 80 people.
I scheduled time with other shifts and taught 2-3 people per shift to be be subject experts for fetalink within my org - they in turn taught their teammates.
With this initiative, my support organization was able to resolve issues on first contact and we reduce overall page outs to this team by over half and took our weekly page outs from 7 down to 2. This helped greatly for a team of 4 who rotated this pager, and I know I received several thank you’s from the team for helping alleviating their burnout. And my support org was able to drop our mean time to resolve in half because we were able to fix it without waiting for other teams to accept the page, sign on, and investigate - removed the delay.
(Took on more, not your responsibility) Tell me about a time when you took on something significant outside your area of responsibility.
John became an acting technical lead.
2 tech leads on night shift, they are the most senior, most knowledgeable on team, go to people on hard issues. This role does not turnover often - for example - my 2 technical leads have both been in their role for over 20 years.
Cerner offers a sabbatical, when 1 leave goes away for 6 weeks, single tech lead can’t take vacation, responsible for everyone. John spoke to manager, wanted to take on acting tech lead. Make everyone aware, John is available for issues.
Taking this on additional to current work. 10-12 other people could be reaching out at any time. John took that up for 6 weeks while other was out, not remotely required by the job.
John had own lead tell him he was working so hard he might burn out, don’t burn out.
John was doing everything he possibly could.
Any time a new incident came up John would review new incident logs for active issues, way outside John’s job.
Reviewed everyone to make sure he could help when they needed it.
After the 6 weeks were up, tech lead came back, John continued doing that same work. Continued being a point of presense, go to person, etc. Team lead considers John as a tech lead, even though he’s not a tech lead.
John is so involved, as a result issues are resolved faster, only 2 tech leads at any time, John’s team has 3, so team does the best job they can. No budget available for a 3rd tech lead. Reason why John is looking for a challenge, John can’t get tech lead, once people enter that position, no more available until someone leaves the company.
(Intiative/Bigger Better) Tell me about a time when you were working on an initiative or goal and saw an opportunity to do something much bigger or better than the initial focus.
(Became SME, then discovered potential to help with Rollovers) Direct Care - BMDI - ICU patients - triage team of 2 per shift - Couldn’t resolve them in IRC - shadowed, KB, articles - distributed to shifts. WHILE training SMEs on shift - I Saw uptick during/after upgrades, KNEW WE COULD DO BETTER, initiate flowout/rollover to handle additional, Saved 70k per shift, 280K across all shifts - like SPOT instances.
We have a team called Direct Care, who is responsible for bedside medical devices sending vitals from patients into the electronic chart. The are patients that are in the intensive care unit and as patients go, these are some of the most vulnerable and sensitive.
An overnight team for this solution exists specifically for triaging and trying to resolve incoming issues. I had identified a trend that we were receiving calls for this solution, and not able to resolve them within our support organization.
I planned to Shadowed, KB, articles - distributed to shifts, and train the 10-12 SMEs. So our teams could handle issues on first contact
I saw a very significant uptick in calls during and after upgrades to BMDI - calls were waiting for this 2 person team overnight (calls stacking up), this is where i knew our org could do better, I worked with our leadership and direct care to ensure that subject experts in my org received rollover calls to handle the volume spike the team was experiencing. They are often able to handle the throughput, and with the addition of the SMEs being available from my support org for rollover we were able to scale out to handle the peaks in volume and save the company a minimum of 70K per shift in the cost of additional workforce (280k total). We were kind of like SPOT instances.
Accident
Tasked with deleting files filling up disk space - CORE.IL. Accidentally deleted, caused client impact, a more technical team resolved it, I sent notifications to our orgs about the possibility of accidentally deleting this file. Added special documentation to the work instructions to ensure no one else experiences this, Created delete scripts that would only search and delete files that were known to be safe to delete (FIRST PASS) before trying to find unique offenders to alleviate system issues.
No one has since deleted a core.il file, and I have reduced mean time to resolution for alarms with the scripts, compared to he initial manual method.
Harsh Truth
Client called in reporting they weren’t able to chart on a single patient. I gathered info and investigated, the issue couldn’t be resolved by the methods I or my team had, I exhausted our resources internal to my team. I can’t page an oncall in the middle of the night out of their sleep to investigate a single patient issue.
I called the client back and let them know they will need to go to downtime procedures for this patient until the team is in the office. The client was very upset, the system needs to be available 24/7 for all patients. I told them I’m sorry, but we cannot engage the oncall after hours for this issue.
I gave them my name, information, direct line and told them to call me if anything changed or other patients were impacted and let them know my available overnight hours.
The client was still upset but did accept it. Sometimes we do receive single patient issues and I need to be able to protect our oncalls from non-critical page outs.
Calculated Risk -speed is critical
Authentication Container - All users not able to log in to applications - downtime procedures - paper for Nurses/Doctors - I know what to do to resolve - NOT SUPPOSE TO MAKE CHANGES (Page and delay or break protocol, ESPECIALLY for this container we don’t have documentation) - update the parameters for the container - Possibility of downstream impact. It could be reversed. Paged out change/managed teams. Kept SLA. Worked with Solution team to find out maximum safe limits were for the configuration and updated documentation.
A time you were wrong
Thinking we were going to meet our maintenance window with down interfaces, but we weren’t able to.
A planned maintenance period came up, John was engaged 1 hour out, related to software upgrades, patches etc. During a certain time, could impact client system, interfaces or downtime. Outbound interfaces were down, medication orders and RAD orders, admit/discharge orders couldn’t be sent. Lots of stuff riding on this window. John was engaged by the upgrade team, possible they won’t hit the deadline, need to get ahold of different teams to get system going. John paged interface team, during that time contacted by client side team. Client called in to know about maintenance window, told teammate to send call to John, John spoke with client and told them planned maintenance was still in the window. Running into an issue where interfaces aren’t processing, or won’t be when window is over. John let them know they were engaging resources to resolve now, around 45 minutes left by this time. In most cases this type of issue is resolved within 45 minutes.
Incomplete data
Slowness, can’t create mta, need to check everything that i can while we wait on a client only MTA (can’t recreate it in house)
Client is reporting slowness in applications - we use a very specific ‘troubleshooting tool’ to identify workflow problems. Client isn’t aware of how to gather info, we couldn’t recreate the issue in house.
Engaged client side team to ask to gather info on powerchart slowness and run the tool.
While they were trying to recreate, checked round trip timers, database contention, cpu, load, backlogs, queueing, any recent changes to the system.
Tough Decision -
Authentication Container - All users not able to log in to applications - downtime procedures - paper for Nurses/Doctors - I know what to do to resolve - NOT SUPPOSE TO MAKE CHANGES (Page and delay or break protocol, ESPECIALLY for this container we don’t have documentation) - update the parameters for the container - Possibility of downstream impact. It could be reversed. Paged out change/managed teams. Kept SLA. Worked with Solution team to find out maximum safe limits were for the configuration and updated documentation.
Coworker who also is my friend posted on Reddit - subreddit for ‘funny tech support calls’. It clearly talked disparagingly about one of our clients. I immediately contacted him and told him that he should remove it as those are our clients and we cannot be disrespecting them by posting stories of support or bridge calls with them. He said he had no intention of taking it down. Afterward, I notified his manager - he was given a written warning for his conduct and he has since never posted any stories like that.
We now communicate far less outside of work, but we still have a professional relationship.
No clear way forward
Slowness, can’t create mta, need to check everything that i can while we wait on a client only MTA (can’t recreate it in house)
Client is reporting slowness in applications - we use a very specific ‘troubleshooting tool’ to identify workflow problems. Client isn’t aware of how to gather info, we couldn’t recreate the issue in house.
Engaged client side team to ask to gather info on powerchart slowness and run the tool.
While they were trying to recreate, checked round trip timers, database contention, cpu, load, backlogs, queueing, any recent changes to the system.
Superior Knowledge or Observation:
ESO not exiting the system -
HVH_NY - Identifying services using handles. Every 3rd day we would have a server that handles document imaging, scanning insurance cards, etc go down and become unresponsive. The issue appeared to happen with consistent frequency. My role is not for root cause investigation, but the issue was one I was determined to figure out as I had seen it come in repeatedly to myself and my work colleagues.
I checked the times when the system would go down, nothing to identify. I took nightly snapshots of what the system was doing, and was comparing one day to the next to see what was occurring over a multi day period. And that’s when i saw a monitoring service that started with very low handle count in task manager, and every day it would increase, it would continue to use more handles. Cycling the service would bring the handles back in check and the system would live on for several more days. After confirming my theory, I created a batch script to auto cycle the monitoring service nightly.
Logged additional tickets to the Monitoring team to notify them what I was seeing and uploaded the log files for the monitoring service for a permanent bug fix.