top of page

Monitoring & Event Management

Updated: Apr 26

Introduction

Monitoring and Event Management within ITIL frameworks aims to ensure IT services' optimal performance and reliability.


It involves the continuous observation and control of the service environment to detect, interpret, and act upon changes that could affect service quality.


Monitoring and Event Management is crucial for maintaining the health and effectiveness of IT services, which in turn supports the broader business objectives of an organisation.


Scope

The practice covers a range of activities essential for the proactive monitoring and management of IT service components.


These include setting up monitoring tools and techniques, defining events of interest, logging and analysing events, and initiating appropriate responses to maintain or restore regular service operations.


The scope extends across various IT components such as networks, servers, applications, and data centres, providing comprehensive oversight of potential or actual deviations from planned service performance.


Key Benefits


Implementing effective monitoring and event management brings several benefits:


  1. Enhanced Service Reliability: By continuously monitoring IT services and components, organisations can detect and resolve issues before they escalate into major incidents, thus improving overall service reliability.

  2. Proactive Problem Management: This practice allows IT teams to identify and address the root causes of incidents preemptively, reducing the frequency and impact of service disruptions.

  3. Improved Operational Efficiency: Systematic monitoring and event management streamline operations, help optimise resource use, and reduce the cost of unexpected downtime.

  4. Better Compliance and Security: Regular monitoring ensures that services adhere to required standards and regulations while also enhancing security by detecting and responding swiftly to potential threats.


Basic Concepts and Terms


Event: In the context of ITIL's Monitoring and Event Management, an event is any detectable or discernible occurrence that has significance for the management of an IT service or its infrastructure.


Events are notifications created by an IT service, configuration item, or monitoring tool, signalling status changes that may require attention. These can range from regular informational messages indicating normal operations to warnings and alerts about potential issues that could disrupt services.


Monitoring: Monitoring involves the continuous and systematic observation of the performance and health of IT services and their underlying infrastructure. It serves to identify trends, potential issues, and opportunities for improvement, ensuring that services operate within their defined thresholds.


Monitoring can be either:


  • Active monitoring: where the system regularly checks and polls services and components to report their status and performance.

  • Passive monitoring: where the system receives alerts and events generated by the services themselves, triggered by predefined conditions.

Metrics and Thresholds: Metrics are quantifiable measures used to assess IT services' performance and health. Standard metrics include response time, system availability, error rates, and usage patterns.


Thresholds are predefined values for these metrics that, when breached, trigger an event or alert. Setting appropriate thresholds is critical to effective event management as they help determine when an incident might develop and when proactive measures are needed.



Significance of Events: Events are categorised based on their potential impact on services, helping prioritise responses and allocate resources efficiently.


Categories include:


  • Informational: Routine events that do not require immediate action are logged for record-keeping and analysis.

  • Warning: Events that indicate a potential issue that may not immediately impact the service but warrant closer investigation.

  • Critical: Events that require immediate attention as they pose a direct threat to service continuity and performance.

Processes

The processes involved in Monitoring and Event Management in ITIL are designed to ensure that IT services are continuously observed and managed to maintain optimal performance and reliability.


Here is a detailed look at each key process:


Monitoring Planning

This process involves setting up the monitoring environment. It includes defining what elements of the IT infrastructure need monitoring, selecting appropriate tools and technologies, and determining the metrics and thresholds that will trigger events.


The planning phase ensures that monitoring is comprehensive, targeted, and aligned with business needs, helping to address potential service disruptions preemptively.


Activities;

  • Objective Setting: This involves defining the goals of monitoring activities grounded in service design and stakeholder requirements and ensuring the alignment of monitoring objectives with business goals.

  • Selection of Monitoring Targets: Critical service components and performance indicators are identified and prioritised based on their impact on service delivery and business outcomes.

  • Event Definition and Classification: Different events (informational, warning, and critical) are defined, and appropriate responses are mapped, setting the groundwork for effective event management.

  • Threshold Setting: Operational thresholds are established to trigger alerts for various event types, allowing for preemptive action before service quality is impacted.



Event Handling

Once an event is detected, it must be logged, categorised, and assessed to determine the appropriate response.


Event handling processes include:


  • Event Detection: Automated systems or manual checks identify changes in the system that may indicate an issue or deviation from regular operation.

  • Event Logging: All detected events are recorded in a log for audit purposes and further analysis.

  • Event Categorisation and Prioritisation: Events are classified based on their nature and impact on the system, which helps prioritise them for response.

  • Event Response: Based on the category and priority, specific actions are initiated to address the event. This could range from simple notifications to more complex recovery procedures.


Monitoring and Event Management Review

This process involves regular reviews of the monitoring and event management practices to identify areas for improvement. It includes analysing the effectiveness of the current monitoring setup, the appropriateness of responses to events, and the overall impact on IT service performance.


Recommendations from these reviews are used to refine monitoring strategies, update tools and processes, and enhance the overall resilience of IT services.


Relationship with Other Practices

Monitoring and Event Management are core ITIL practices that interact closely with other ITIL practices. Understanding these relationships can enhance the effectiveness of IT service management by ensuring cohesive and comprehensive service delivery.


Here are some fundamental relationships between Monitoring and Event Management and other ITIL practices:


1. Incident Management: Monitoring and event management play a critical role in the early detection of incidents. By identifying and classifying events that could lead to service disruptions, this practice helps trigger the incident management process. Furthermore, effective monitoring ensures that incident response can be initiated swiftly, minimising the impact on service continuity.


2. Problem Management: The data gathered through monitoring activities is vital for problem management. It helps identify recurring issues and underlying problems that affect service quality. This collaboration ensures that incidents are resolved quickly and that the root causes are addressed to prevent future occurrences.


3. Change Management: Monitoring is crucial in the post-implementation phase of change management. It helps assess the impact of changes made to the IT infrastructure by providing data on service performance before and after the changes. This feedback loop can validate the success of changes or highlight areas that may need further adjustment.


4. Service Level Management: Monitoring tools provide essential data to measure service performance against the standards defined in Service Level Agreements (SLAs). Continuous monitoring ensures service providers meet contractual obligations by proactively managing service components to adhere to agreed-upon performance levels.


5. Capacity and Performance Management: Effective monitoring provides capacity planning and performance management data. It allows IT teams to predict future demands based on current trends and performance patterns, facilitating timely upgrades and optimisations to meet these demands without overspending or resource wastage.


6. Security Management: Monitoring and event management also extends to security aspects of IT services. They help detect potential security breaches by monitoring unusual activities and triggering security incident management processes if necessary. This proactive approach is essential for maintaining IT services' confidentiality, integrity, and availability.



7. Continual Improvement: Monitoring and event management feedback are crucial for continual service improvement. They provide a factual basis for assessing the effectiveness of ITIL practices and identifying areas where processes can be optimised to improve overall service delivery.


Roles & Responsibilities




In the Monitoring and Event Management practice within ITIL, clearly defined roles and responsibilities are crucial for ensuring the effectiveness and efficiency of monitoring operations.


Here is an outline of the key roles typically involved in this practice and their primary responsibilities:


Monitoring Specialist:

  • Responsibilities: This role involves setting up and maintaining the monitoring tools and systems. Monitoring Specialists are responsible for configuring the monitoring software, defining what needs to be monitored, and setting the appropriate thresholds and alerts. They also analyse monitoring data to identify trends that could indicate potential issues.


Event Manager:

  • Responsibilities: The Event Manager oversees the event handling process. This includes ensuring that all events are logged, categorised and responded to according to their priority and impact. The manager coordinates with other ITIL practices, such as Incident and Problem Management, to ensure a coherent response to events.


IT Service Manager:

  • Responsibilities: The IT Service Manager has a broader role overseeing the Monitoring and Event Management practice. They ensure the monitoring strategies align with the business objectives and IT service management policies. They also play a crucial role in integrating the monitoring practice with other service management activities to enhance service delivery.


Technical Support Team:

  • Responsibilities: This team supports the Monitoring Specialist by providing technical expertise and assistance. They help troubleshoot and resolve issues identified through monitoring and implement changes required to address the problems identified by the monitoring systems.


Service Owner:

  • Responsibilities: Service Owners are accountable for the end-to-end management of specific services. They use data provided by the monitoring systems to ensure that their services meet the agreed performance criteria and service levels. Based on monitoring insights, they are also involved in strategic decisions about service improvements.


Quality Assurance Team:

  • Responsibilities: This team uses monitoring data to verify that IT services adhere to quality standards and compliance requirements. They are involved in periodic reviews and audits of the monitoring processes to ensure they are effective and compliant with internal and external standards.


Implementation Advice

Implementing an effective Monitoring and Event Management practice within an ITIL framework requires careful planning and consideration of several factors.


Here are some key pieces of advice and metrics to guide the implementation process:


Key Metrics


  • Performance Metrics, Such as response time, uptime, and throughput, help measure the effectiveness of the IT services and ensure they meet the agreed service levels.

  • Availability Metrics: Track the availability of IT services and components, which are crucial for maintaining business continuity.

  • Event Response Metrics: Measure the efficiency and timeliness of responses to detected events, ensuring that issues are addressed swiftly to minimise impact.

Things to Avoid


  • Over-Monitoring: Avoid setting up too many unnecessary alerts, leading to ''alert fatigue''where important alerts may be overlooked.

  • Under-Monitoring: Conversely, insufficient monitoring can miss critical events, leading to unaddressed issues that could escalate into serious problems.

  • Lack of Integration: Ensure the monitoring tools are well integrated with other IT service management processes. Poor integration can lead to silos, inefficient processes, and increased risk of errors.

  • Ignoring Context: Monitoring should be context-aware. Not all metrics are always essential; the significance of events often depends on the current operational context.

Frequently Asked Questions

What is the difference between monitoring and event management?

Event management involves continuously monitoring and observing IT systems to ensure that they operate as expected and meet performance standards. It also involves responding to events—significant changes in the IT infrastructure or operations that need attention. Monitoring provides the data that event management uses to take action.


How do you determine the right thresholds for alerts in monitoring systems?

Setting suitable thresholds involves understanding the normal operating parameters of your IT systems and the business impact of deviations. Start by analysing historical data to establish baseline performance levels, then set thresholds that allow enough time for intervention before performance degrades significantly or risks increase.


Can monitoring and event management help in reducing IT operational costs?

Yes, effective monitoring and event management can significantly reduce costs by preventing downtime, optimising resource use, and improving system efficiency. Proactive management minimises the need for emergency repairs and reduces the severity and frequency of service disruptions.


What are the best practices for integrating monitoring with other ITIL practices?

 Integration involves ensuring that other ITIL practices like Incident, Problem, and Change Management effectively use data and alerts generated by monitoring systems. This can be achieved by using common tools that share data across practices, setting up workflows that trigger actions in other processes, and regularly reviewing inter-process dependencies and communications.


How often should monitoring systems be reviewed and updated?

Monitoring systems should be reviewed regularly to ensure they continue to meet the organisation as IT systems and business processes evolve. A good practice is to review the monitoring setup at least annually or whenever significant changes to the IT infrastructure or business operations occur.


Are there specific types of monitoring tools that are recommended for ITIL practices?

While no specific tools are recommended universally, the choice of monitoring tools should depend on the specific needs of your IT environment, the complexity of your infrastructure, and integration capabilities with other management tools. Tools should provide comprehensive coverage, real-time analytics, and customisable alerting functions.

Comments


About the author

Hi, I'm Alan, and have been working within the IT sector for over 30 years.

For the last 15 years, I've focused on IT Governance, Information Security, Projects and Service Management across various styles of organisations and markets.

I hold a degree in Information Systems, ITIL Expert certificate, PRINCE2 Practitioner and CISMP (Information Security Management).

More...

bottom of page