Incident Management and Response
Incident management is a process for managing and responding to unplanned events or disruptions that affect an organization's services or infrastructure. The goal of incident management is to minimize the impact of these events and restore normal service operation as quickly as possible. Incident management typically involves identifying and classifying incidents, investigating and diagnosing the root cause, implementing a resolution, and communicating with relevant stakeholders throughout the process. It is a critical component of an organization's overall service management strategy, as effective incident management can help ensure the availability and reliability of the organization's services and infrastructure. Our incident management is based on a set of general best practices such as ITIL v4 and a strong set of well proven methodologies from both AWS and KeyCore.
In KeyCore Managed Services incident management and response involves the following steps:
- Incident identification: Incidents are identified through various means, such as monitoring systems, user reports, or problem management processes.
- Incident classification: Once an incident has been identified, it is classified according to its severity and impact on the organization's services. This helps prioritize the incident and determine the appropriate response.
- Incident investigation: KeyCore Managed Services and AMS use their expertise to investigate the incident and identify the root cause.
- Incident resolution: Based on the findings of the investigation, KeyCore Managed Services and AMS work to resolve the incident as quickly as possible, using their expertise and resources to restore normal service operation.
- Incident communication: Throughout the incident management and response process, KeyCore Managed Services and AMS communicate with relevant stakeholders, such as customers and users, to keep them informed of the status of the incident and any actions being taken to resolve it.
- Incident review: Once the incident has been resolved, KeyCore Managed Services and AMS conduct a review to identify any lessons learned and areas for improvement in the incident management and response process.
Overall, the goal of incident management and response in this setting is to minimize the impact of incidents on the organization's services and ensure timely and effective resolution.
Delivery Process
KeyCore Managed Services - Incident Management Process
This process starts with the initial detection of incidents and then raising a respective ticket.
Each incident is recorded so that it could be tracked, monitored, and updated throughout its life cycle.
Act No: 1 | Act Name: Raise ticket on ITSM | Owner: Incident Requester / Service Desk Agent |
Description: Once an Incident gets detected, the details are logged in ITSM to raise an incident ticket. Service Desk will refer KEDB to check whether it is a known error/ issue or not. |
Decision Box | Act Name: Is it an Incident? | Owner: Service Desk |
Description: The Service Desk determines if the ticket is an Incident or a Service Request (Service Request process will have those definitions). Service Requests are small repeatable requests for work such as password reset, access requests or requests for information.
If the ticket is a Service Request follow the Service Request process. |
||
Output: Service Request or Incident |
Act No: 2 | Act Name: Categorize and Prioritize Incident | Owner: Service Desk |
Description: Categorize and Prioritize the Incident in ITSM tool.
Categorization is assigning the Category, Type and Item (CTI), to allow the correct assignment of the ticket. Some of the incidents are related to the 3rd party and they are not assigned to the L2 teams. Service Desk will raise these categories of the ticket and assign those tickets directly to the 3rd party vendor. Prioritization of Incident would be done based on impact and urgency of issue. Incidents are prioritized into P1, P2, P3 or P4 based on company’s prioritisation. While prioritizing the Incident, it gets treated based on the criticality. |
||
Output: Categorized and Prioritized Incident |
Decision Box | Act Name: Is this P1 Incident? | Owner: Service Desk/Incident Manager |
Description: If it’s a critical incident (P1), then it triggers the critical/ major incident handling process. | ||
Output: Prioritize as Critical/ Major Incident or continue as the normal incident |
Act No: 3 | Act Name: Assign to L2 Resolver Group | Owner: Service Desk |
Description: Assign the Incident to the appropriate resolution group. Assignment is based on the categorization of the Incident. | ||
Output: Resolver group identified |
Act No: 4 | Act Name: Review and Update Incident | Owner: L2 Team |
Description: Upon receipt of an Incident, review and updating is done to the ticket.
Ensure the following has been captured correctly: · Priority · Assignment · Categorization If any additional information is required to understand the issue, contact the customer who raised the ticket on Service Desk directly. |
||
Output: Reviewed and updated incident |
Act No: 5 | Act Name: Investigate and diagnose Incident | Owner: L2 Team |
Description: Carry out investigation and diagnosis activities to identify a workaround or resolution for the issue. Update ITSM Incident with any investigation and diagnosis activities. | ||
Output: Resolution identified |
Act No: 6 | Act Name: Resolution Provided | Owner: L2 Team |
Description: Resolution provided to the Incident. Update the ticket with resolution activities. | ||
Output: Resolved Incident |
Decision Box | Act Name: L3 Support required? | Owner: L2 Team |
Description:
If L2 is able to resolve the ticket resolution, it is updated in the ticket. Else if L2 Team is unable to resolve the Incident, functionally escalate the Incident with respective L3 vendor. |
||
Output: L3 Support required or not |
Act No: 7 | Act Name: Engage respective L3 Vendor | Owner: L2 Team |
Description: If the L2 resolver group could not find the resolution and detemines that L3 support is required, the incident is assigned to respective L3 Vendor. Any communication with L3 vendor is recorded and updated. | ||
Output: L3 Vendor Engaged |
Act No: 8 | Act Name: Update status of Ticket as ‘Pending’ | Owner: L2 Team |
Description: Update ITSM to reflect that issue is raised to Vendor and set status of ticket to “Pending” to stop SLA clock. | ||
Output: Ticket status set to “Pending” |
Act No: 9 | Act Name: Investigate and Diagnose Incident | Owner: L3 Vendor |
Description: Carry out investigation and diagnosis activities to identify a workaround or resolution for the issue. Update Incident with any investigation and diagnosis activities.
L2 continue to update based on updates provided by the L3 vendor. |
||
Output: Resolution identified |
Act No: 10 | Act Name: Resolution provided | Owner: L3 Vendor |
Description: Apply resolution to resolve the Incident identified during the investigation and diagnosis and inform L2 team about the resolution. If L3 Team doesn’t have access to the respective system, L2 will apply the resolution provided by L3 Team. | ||
Output: Resolution implemented |
Act No: 11 | Act Name: Verify Resolution | Owner: Service Desk |
Description: Verify resolution by contacting customer who raised the Incident, checking alarm or other tests. L2 support might be involved at this stage if required. Tickets will be transferred back to respective L2 team if user is not satisfied with the resolution. | ||
Output: Resolution verified |
Act No: 12 | Act Name: Close Ticket | Owner: Service Desk |
Description: Incident Requester will close the Incident once issue resolution is verified to be correct and customer is satisfied. The process also checks that the Incident record is fully updated and assigns a closure category. | ||
Output: Ticket closed |
Process Actors
The incident reporter may be an end user who has experienced an issue with a service or product, or it may be an automated alarm.
The role of the incident reporter is to provide as much information as possible about the issue, including details on the symptoms, any error messages or other relevant information, and the steps taken to reproduce the issue. This information is crucial for the incident management team to properly identify and resolve the issue.
In KeyCore Managed Services the incident reporter can be either
- Customer team member - when a problem is identified outside the IT part of the workload
- KeyCore Specialist - our team will log problems when we identify architectural issues or potential problems in your solutions.
- AMS Specialist - AWS / AMS will log problems when they detect them through repeated alerts or through advanced pattern recognition.
The incident manager is responsible for coordinating and managing the resolution of incidents that occur in the cloud environment. This includes identifying the root cause of the incident, coordinating with relevant teams to resolve the issue, and communicating with stakeholders about the status of the incident and any necessary actions.
The incident manager works closely with KeyCore and AMS to ensure that the cloud infrastructure and applications are operating efficiently and effectively. They may also work with the customer's development team to ensure that any issues with the application are addressed and resolved.
Overall, the incident manager plays a key role in ensuring the reliability and availability of the cloud environment, as well as ensuring that any issues are quickly and effectively addressed to minimize disruption to the customer's business.
A AWS Certified Application specialist from KeyCore
One or more representatives from the team responsible for the application. Depending on application and setup this role may vary - it is defined during on-boarding.
For infrastructure related problems the dedicated team from AWS Managed Services will participate in the problem management process.
Both KeyCore and AMS team may need to escalate incidents to the relevant service teams in AWS - this is done through AWS Support.
Service Level Agreement
Incidents are covered by SLAs that depend on which Service Tier have been selected.
Service commitment 24/7/365 |
Key Performance Indicator | SLA Tier Premium |
Incident management | Incident P1 (Critical) Response | <=15 min |
Incident P2 (High) Response | <=4 hours | |
Incident P3 (Moderate) Response | <=12 hours | |
Incident Management Restoration/Resolution Time | Incident P1 (Critical) Resolution | <=4 hours Restoration |
Incident P2 (High) Response | <=8 hours Restoration | |
Incident P3 (Moderate) Response | <=24 hours Restoration |
Service commitment 24/7/365 |
Key Performance Indicator | SLA Tier Business |
Incident management | Incident P1 (Critical) Response | <=1 hour |
Incident P2 (High) Response | <=4 hours | |
Incident P3 (Moderate) Response | <=12 hours | |
Incident Management Restoration/Resolution Time | Incident P1 (Critical) Resolution | <=4 hours Restoration |
Incident P2 (High) Response | <=8 hours Restoration | |
Incident P3 (Moderate) Response | <=24 hours Restoration |
Output and delivered value
With Incident Management as a core component of your cloud operations, you can minimize the impact of these incidents on your organization and your customers
Improved service availability and reliability
By quickly and effectively responding to and resolving incidents, you can minimize downtime and ensure that your services are available and reliable for your customers.
Reduced impact of incidents
Effective incident management can help reduce the impact of incidents on your organization and your customers, such as by minimizing the duration of disruptions or by identifying and addressing root causes to prevent future incidents.
Improved customer satisfaction
By quickly and effectively responding to and resolving incidents, you can improve the overall customer experience and satisfaction with your services.
Increased efficiency
By streamlining incident management processes and identifying and addressing root causes, you can increase efficiency and reduce the overall cost of managing and responding to incidents.
Pricing Information
During on-boarding we will define how to handle identified incidents. We can participate in your existing Incident Management process, run the process for you or forward everything related to application to your own team.
During on-boarding we will define how to handle identified incidents. We can participate in your existing Incident Management process, run the process for you or forward everything related to application to your own team.
We will run Incident Management process on every identified incident and ensure to include relevant parties from AWS, AMS and your application team to resolve the issue.
We will run Incident Management process on every identified incident and ensure to include relevant parties from AWS, AMS and your application team to resolve the issue.
Frequently Asked Questions
When possibilities are almost endless, it is crucial to have a partner who has in-depth expert knowledge. Not just in opportunities and benefits, but in challenges.
Incident management is the process of identifying, responding to, and resolving incidents that occur in the IT environment. In this context, KeyCore Managed Services is providing cloud managed services using AWS Managed Services under the Partner-led AMS model. This means that KeyCore and AWS are responsible for managing and monitoring the IT environment, including responding to and resolving incidents that may occur.
KeyCore Managed Services has a dedicated team of experienced technicians who are responsible for incident management - part of the team is provided by AWS under the AMS model. They are available 24/7 to respond to and resolve incidents in the IT environment. They use a variety of tools and techniques to identify and resolve incidents, including monitoring tools, problem management processes, and incident management processes.
To report an incident to KeyCore Managed Services, you can contact the service desk using any of the methods available under your specific tier. Our team will then triage the incident and take appropriate action to resolve it.
The time it takes to resolve an incident will depend on the severity of the incident and the resources required to resolve it. KeyCore Managed Services has established service level agreements (SLAs) with our clients, which outline the expected response and resolution times for different types of incidents. Our team will work as quickly as possible to resolve the incident and minimize any impact on your business.
Yes, you can request updates on the status of an incident by logging a ticket through the support portal or contacting our support hotline. Our team will provide regular updates on the status of the incident and the actions being taken to resolve it.