Monitor phase involves tasks that measure various metrics of the systems to determine the health, usability, performance and availability of the system. This phase is important in providing continuous feedback to the development and operations team on how the system is behaving and performing. Feedbacks include as following: Customer experience Customer insights Application performance Application usability Infrastructure performance and availability
Note: Refer to Appendix B for list of key metrics for the above.
2.11.1 Application Performance Monitoring (APM) vs Traditional Monitoring
Application Performance Monitoring refers to the monitoring capability of the tool to monitor and diagnose application performance from user experience (APDEX) down to the code level including database query while traditional monitoring (infrastructure monitoring) referring to tools that monitor only system resources such as CPU, memory, disk and I/O
APM Traditional Monitoring Monitor user experience
Monitor application performance and correlates server performance
Able to monitor up to the code level
Able to map end to end transactional flow and performance
Able to map the service dependencies
Able to support cloud and container platforms Monitor infrastructure performance such as CPU, memory, disk and I/O
Monitor logs
Manual correlation of various monitoring data
Only supports physical and virtual servers/devices
2.11.2 Monitoring Tools
Monitoring tool shall enable the operations team to continuously monitor and diagnose any abnormalities detected from user perspective, application and infrastructure. Table below describes the examples of monitoring tools:
Tools/Platform Descriptions Dynatrace A commercial APM tool that monitors user experience, application performance, and server performance. It then correlates the performances and identify the dependencies mapping between different components. Nagios A freemium infrastructure monitoring tool that monitors server and network device performances. Prometheus An open source service and infrastructure monitoring tool that monitors servers’ and containers’ performances. It also provides data visualisation and query capabilities. Dynatrace - Operations Dashboard - APM Tool
Figure 2-24 Dynatrace ? Operations Dashboard - APM Tool
Nagios XI - Operations Dashboard - Infrastructure Monitoring Tool
Figure 2-25 Nagios XI ? Operations Dashboard ? Infrastructure Monitoring Tool
2.11.3 Active User Engagement
Once the system has gone live, operations team should continuously seek feedback from users on the usability of the system whether it is meeting user or business requirements and performance of the system. This exercise should be continuously done before the warranty period is over.
2.11.4 Incident Management
Incident refers to occurrence of production related issues such as operational issue, performance slowness or system unavailability that hinders daily operations.
An incident escalation process should be defined in order to assist operations and development teams in managing an incident. The process should include:
? Categorizing the incident (low, medium, severe)
Category Descriptions Low System is operational but citizen or users are experiencing slowness in performing their transactions. Medium System is operational but citizen or users are unable to complete their transaction intermittently. Severe System is intermittently operational or unavailable. Citizen and users are frequently or constantly unable to complete their transactions. ? Incident response team ? In the case of the root cause of an incident is unknown, incident response team shall consist of incident manager and representatives from operations team, infrastructure team, application team and security team. ? In the case of root cause has been determined, incident response team shall consist of incident manager, representatives from application team and team related to the root cause. ? Root cause investigation ? Incident response team shall immediately pursue an investigation on the incident by analysing the monitoring data and logs. ? Assess the impact of the root cause and timeline needed to resolve the root cause. ? Incident resolution ? Upon assessing the impact and timeline required to resolve the root cause, incident manager should plan and execute the remedial actions within acceptable time period. Example, ? High impact – within 2 hours ? Medium impact – within 4 hours ? Low impact – within 8 hours ? Post mortem ? Post mortem shall be conducted to review the root cause and any processes or issues that leads to the incident. Post mortem report shall be stored as part of the knowledge base for references.
Figure 2-26 Incident Management Workflow
Incident Management Workflow: 1) Once system has been detected to have suffered abnormality, performance slowness or inaccessibility, incident manager shall immediately notify and communicate with the operations team. 2) Incident manager shall assemble operations team for investigation and diagnosis of the issue through analysis of the monitoring data and logs. If workaround or temporary measure is available to allow business to operate as usual, it should be put in place while the operations team try to identify the root cause. 3) Once the root cause has been identified, incident manager shall proceed with resolution and recovery. Resolution shall not be a temporary measure. 4) Closure should only be done by the incident manager when root cause has been identified and resolved where risks of the same issue reoccurring should be minimal.
2.11.5 Feedback Loop to Development Team
One of the key practices in DevOps is to have an effective feedback loop back to report the incidents and bugs to the development team. Operations team must ensure that continuous feedbacks are given to the development team in order to improve system stability and performance.
Ticketing - To report an issue or incident that users have encountered during production to the development team.
Monitoring and Reporting - Proactively monitor and report any abnormalities and areas for improvements to the development team for periodic enhancements.
2.11.6 Compliance and User Experience
In production mode, system must be continuously measured and monitored to ensure its compliance to the quality policy and KPIs that have been defined in earlier phase. In addition to that, user experience should be measured and monitored to provide further customer insights. APM tools can now assist in quantifying user experience into response time and application performance index (APDEX) rather than relying just user complaints or feedbacks which are quite subjective.
2.11.7 Periodic Assessment
System shall continuously undergo a periodic assessment in order to ensure that baseline is continuously being updated and risks are always under control. Suggested periodic assessments are: Critical applications - Online government services - Monthly basis Support applications - HR system, room booking system - Half yearly basis
Periodic assessment can assist in risk mitigation, capacity planning and management, continuous service improvement and baselining.
Note: Refer to Appendix B for a list of key metrics that can be used to baseline or assess.
2.11.8 Monitoring Deliverables Performance assessment reports – A report to highlight the application and server performances and areas for improvements Service tickets – An escalation note to respective stakeholders that include all the necessary details and severity level of an incident