Basics of Network Monitoring
- General Networking
- IP Addresses and Subnetting
- Switching and Routing
- Domain Name System (DNS)
- Dynamic Host Configuration Protocol (DHCP)
- General Windows Monitoring Elements
- General Monitoring Techniques and Protocols
- Simple Network Management Protocol (SNMP)
- Leveraging the Power of Scripts
- Database Basics
- Relational database design
- SQL queries
- Tips and Resources
Network Monitoring Design Philosophy
- Reporting and Alerts
- Thresholds, repeat-count, and time delays
- Suppression and de-duplication
- Data Storage Aggregation
- Overview of Agent-based Monitoring
- Overview of Agentless Monitoring
- Tips and Resources
Network Monitoring Common Practices
- Availability Monitoring
- Interface Monitoring
- Disk Monitoring
- Tips and Resources
Network Monitoring Best Practices
- Baseline Network Behavior
- Escalation Matrix
- Reports at Every Layer
- Implement High Availability with Failover Options
- Configuration Management
- Capacity Planning and Growth
- Tips and Resources
Basics of Network Monitoring
Networks have evolved from being a flat network where there were only a handful of elements. Everything was connected—to a more complex design where there are a lot more technologies, such as cloud, wireless, remote users, VPN, IoT, mobile devices, and so on.
In spite of all the evolution that has occurred, one factor that has been constant is the need for network monitoring software. Monitoring allows network admins to know what is going on in their network, be it with their WAN, LAN, VoIP, MPLS, and other connections or the state of various network elements or nodes such as the access, distribution and core switches, routers, firewalls, servers, client systems, and so on.
Before you begin with network monitoring, it is necessary to understand networking in general, as well as essentials about Windows® systems which is the major OS used in enterprises worldwide. Knowledge about the essentials of networking and the elements that make up a computer network helps with better network management and monitoring.
1. General Networking
A network is a collection of devices that are connected and can communicate with one another over a common transport or communication protocol. Here communication can refer to the transfer of data among users or instructions between nodes in the network, such as computers, mobile devices, output devices, management elements, servers, routing and switching devices, etc.
Networks can be categorized based on the geo area they span as LAN, WAN, or Internet. Further, the design or topology of a network too can differ based on user and organizational requirements, such as star, ring, bus, mesh, etc.
Whatever be the design or the topology, every network follows a reference design as described in the OSI model for data transmission and communication. Open System Interconnection (OSI) is a reference model for a network and describes how information from an application installed on a device or system moves through various nodes in the network to another device within the same network or to an external network. There are many components that make a network and enables communication between various nodes, such as network addresses, data transport & communication protocols, and methods used for transfer of packets between nodes within the same network or different networks. Below are some of the basic components that are part of every computer network and these also are the vectors that form the essentials of network monitoring.
IP Addresses and Subnetting
An IP address is the reference label assigned to each node in a network and is used by other nodes for location and communication. Further, IP addresses are binary numbers, but are stored in human readable format, either as an IPv4 address or IPv6 address. The elements with an IP address that make up a network can be divided into different sub networks based on the device type, location, access, etc. The devices in the same subnet all have a common network prefix defined in its IP address.
Switching and Routing
Switching refers to the process in which data is divided into smaller packets before they are sent and transported over the network. Routing is the act of finding a path for the packets that form data to traverse from a source node in one network to a destination node in a different network.
Domain Name System (DNS)
Each element in a network, in addition to an IP address, can also have a reference name. This allows a user to communicate with a resource using an easy to remember alphabetical name rather than a difficult to remember IP address. DNS maps the name of a resource to its physical IP address or translates a physical IP address to a name.
Dynamic Host Configuration Protocol (DHCP)
DHCP is a network protocol that allows a management server (DHCP server) to dynamically assign an IP address to the resources in its network. Without DHCP, network admins would have to assign IP addresses for each host in their network manually, making management of IP addresses difficult.
2. General Windows Monitoring Elements
Enterprises use various business applications that are installed on servers within the enterprise network or datacenter to provide services to hosts within the organization. There are also additional network and user management such as the DNS, Active Directory, DHCP, etc., that are provided from servers. Additionally, users or clients in an organization too require an Operating System. Among the multiple choices available for an Operating System, Windows based OS’s are the most widely used, both for server as well as for client host requirements in an enterprise.
The presence of business applications on servers necessitates their constant monitoring for visibility about resource usages, such as memory, disk space, cache, CPU, and more. Monitoring also helps identify possible issues that are affecting server performance. In addition to servers, client devices too require monitoring to provide a trouble-free experience to the end-user.
Windows based systems can provide data to monitoring systems which then processes and uses the data to report on the performance and health of the servers, and host machines. The data that is used for monitoring can be collected from a Windows machine using any of the available methods discussed below.
3. General Monitoring Techniques and Protocols
Now that you know what makes up a network and the components available for Windows monitoring, let us look at general monitoring techniques used by network and systems admins.
In order to successfully monitor your network or even server and systems, the availability of the below options are necessary:
- Data or information from various elements in the network. Data includes information about the working, current status & performance, and health of the element being monitored.
- An application or monitoring software must be able to collect, process, and present data in a user friendly format. Software should even alert users about impending problems based on thresholds.
- A protocol or method for transmitting information between the monitored element and the monitoring software.
Information collected from the network helps with better management and control over the network, identification of possible network issues before they cause downtime, and quick resolution of issues when something goes wrong. In short, constant monitoring will help create a high performing network.
Below are some of the general techniques available for monitoring. These techniques are used for collection of monitoring data from the network.
This is a network admin tool that is used to test the reachability and availability of a host in an IP network. The data from ping results can determine whether a host in the network is active or not. Furthermore, it can measure the transmission time and packet loss when communicating with a host.
Simple Network Management Protocol (SNMP)
SNMP is a network management protocol that is used for exchanging information between hosts in a network that includes network monitoring software. This is the most widely used protocol for management and monitoring of the network and includes the below components:
- Managed device: The node in the network that supports SNMP and access to specific information.
- Agent: A software that is part of the monitored device. An agent has access to the MIB (management information database) of the device and allows NMS systems to read and write to the MIB.
- Network Management System (NMS): An application on a system that monitors and controls the managed devices through the agent using SNMP commands.
SNMP data is collected or sent to a managed device, either by polling or using traps. Traps allow an agent to send information to an NMS about events on the device.
The MIB holds information about the structure of the data on a device for management. The MIBs contain OID (object identifiers) which is the actual identifier for the variable to be read from the device or set on the device.
Syslog (not to be confused with Windows Eventlog), is a message logging system that allows for a device to send event notifications in IP networks. The information from these messages can be used for system management, as well as security auditing. Syslogs are supported on a variety of devices ranging from printers to routers, and firewalls.
Leveraging the Power of Scripts
In networks where an NMS is not available for monitoring, or the existing NMS does not support specific functions or even extend the functionality of the existing NMS tool, network admins can make use of scripts. Scripts use common commands, such as ping, netstat, lynx, snmpwalk, etc., that are supported by most network elements to perform an action, such as collecting information from elements, making changes to device configurations, or perform a scheduled task. Bash scripts, Perl, etc. are common scripting tools used by network admins.
4. Database Basics
Database is a collection of data or information that is structured. Every database involves a DBMS (Database Management System) which is a software application that performs actions, such as data creation, updates, retrieval or deletion based on user, or other application input. In addition to the data management functions, DBMS provides for data security, helps with data backup & recovery, and maintains data integrity. The actual data and the DBMS, because of their close relation, are sometimes referred together as database. Some of the popular DBMS in the market today are MySQL, Microsoft SQL, PostgeSQL, Oracle, DB2, SAP ASE and others.
Database and their related DBMS are usually run on dedicated servers, referred to as database servers. These servers may leverage RAID technology available on storage arrays for redundancy and performance.
Relational database design
While there have been multiple database models, the most popular ones in the market have all used the relation database model (RDBMS). An RDBMS allows users to create and maintain all data in objects called tables. Each table is a collection of related data entry and consists of rows and columns. The table structure of RDBMS allows for viewing the same database in multiple ways.
Structured Query Language (SQL) is a standard language for accessing information or data from databases. SQL queries can be used to perform actions, such as create, delete, update, and other manipulations to data stored in a database.
5. Tips and Resources
Network Monitoring Design Philosophy
Monitoring helps network and systems administrators identify possible issues before they affect business continuity and to find the root cause of problems when something goes wrong in the network. Be it a small business with less than 50 nodes or a large enterprise with more than 1000 nodes, continuous monitoring helps to develop and maintain a high performing network with little downtime.
For network monitoring to be a value addition to a network, the monitoring design should adopt basic principles. For one, a monitoring system should be comprehensive and cover every aspect of an enterprise, such as the network and connectivity, systems as well as security. It would also be preferable if the system provides a single-pane-of-glass view into everything about the network and includes reporting, problem detection, resolution, and network maintenance. Further, every monitoring system should provide reports that can cater to a different level of audiences—the network and systems admin, as well as to management such as CEO, CIO, and CTO. Most importantly, a monitoring system should not be too complex to understand and use, nor should it lack basic reporting and drill down functionalities.
Network management is an extensive field that includes various functions. The various objectives of network management are classified and grouped into five different categories, namely Fault management (F), Configuration management (C), Accounting management, Performance management (P) and Security management (S)—together known as FCAPS. In networks where billing is not needed, accounting is replaced with administration.
Fault management deals with the process of recognizing, isolating, and resolving a fault that occurs in the network. Identification of potential network issues also fall under Fault management.
Configuration management deals with the process of recognizing, isolating, and resolving a fault that occurs in the network. Identification of potential network issues also fall under Fault management.
Accounting applies to service-provider networks where network resource utilization is tracked and then the information is used for billing or charge-back. In networks where billing does not apply, accounting is replaced with administration, which refers to administering end-users in the network with passwords, permissions, etc.
Performance management involves managing overall network performance. Data for parameters associated with performance, such as throughput, packet loss, response times, utilization, etc., are collected mostly using SNMP.
Security is another important area of network management. Security management in FCAPS covers the process of controlling access to resources in the network which includes data as well as configurations and protecting user information from unauthorized users.
2. Reporting and Alerts
The basic components of network monitoring are the collection of data from network elements and the processing and presentation of the collected data in a user understandable format. This process itself can be referred to as reporting. Reporting helps the network admin understand the performance of network nodes, current status of the network, and what is normal in the network. With data from reports, an administrator can make informed decisions for capacity planning, network maintenance, troubleshooting, and network security.
Reporting alone would not help an admin to maintain a high performance network. Another important requirement is the ability to identify what can go wrong within the network. While reports help understand what is normal and the current status of the network, alerts based on thresholds, and trigger points help a network administrator identify possible network issues related to performance and security before they bring down the network. Alerts and reports complement each other such that alerts let the administrator know of potential problems, and reports provide data to identify the root cause for network issues.
Every network has a baseline which describes what is normal in the network as far as network performance and network behavior is concerned. The baseline for each network differs from one another. When the values pertaining to a parameter change from an established baseline value, it has the potential to become an issue that can affect network uptime. In such scenarios, alerting based on the deviation from the mean value can help with early detection and resolution of issues, which in turn contributes towards the smooth functioning of the network with less or no downtime. Alerting helps administrators find what can possibly go wrong in the network in relation to performance and security. There are various options based on which an alert can be generated. Here are a few terms associated with alerts:
Trigger refers to the event that causes an alert to be generated. An event here can refer to the change in state of a node or a value related to the node, deviation from mean value of a parameter, crossing the threshold value of a parameter, and so on.
Thresholds, repeat-count, and time delays
Most alerts are set to be generated based on thresholds. When the baseline value related to a network parameter is crossed, a threshold violation occurs and this can be set to trigger an alert. Alerts can also set to be generated when thresholds are violated based on repeat count and time (eg. 2 times in 10 minutes).
An alert that is generated based on a threshold violation will reset when value of the parameter that triggered the alert returns to its baseline value.
Suppression and de-duplication
Certain threshold violations are expected even though they cross a threshold value. In such cases, alerts are suppressed. In other cases, the same event may cause a threshold violation to occur on multiple events, which in turn will trigger multiple alerts. To prevent such alert triggers, monitoring systems support de-duplication or even consolidation of alerts based on the event that triggered it.
4. Data Storage Aggregation
Monitoring systems collect and use data from network elements for various monitoring related functions. Networks also need continuous monitoring to ensure that problems are detected before they cause network downtime. Continuous collection for monitoring leads to an accumulation of large volumes of data. This can lead to:
- A slow-down in the performance of the monitoring solution as the tool has to analyze more data to generate required reports
- Impact on the storage space required to store monitoring data, which in turn increases the Total Cost of Ownership of the monitoring system
- Slower troubleshooting due to the larger volume of data to be analyzed
Monitoring systems make use of data aggregation to avoid the above mentioned scenarios. Data aggregation is the process in which information gathered over time is summarized and rolled up into less granular data and used for quicker generation of historical reports. The granularity of a report generated from aggregated data will depend on the aggregation pattern of the monitoring system. Many monitoring systems start with storing data in 1 minute granularity. Over time the data is averaged out and rolled up into less granular data tables, like every 10 minutes, hourly, or weekly tables. This allows a monitoring system to generate reports about a node in the network that can go back in time or spans a large time period with no performance issues and strain on storage space requirements.
5. Overview of Agent-based Monitoring
Network and systems monitoring tools are either agent-based, agentless, or a combination of both. An agent is a software on a monitored device that has access to the performance data of the device. This data is then sent to a NMS system based on requests triggered from the NMS or in some cases, based on polices defined within the agent. The presence of an agent on the monitored device provides access to granular data which in turn helps with better monitoring, reporting, and troubleshooting of issues.
The most common approach for an agent based monitoring system is to provide data to the NMS at set intervals. The presence of an agent allows the monitoring station to perform specific actions on the client that aid with better management and monitoring.
Agent-based monitoring provides advantages, such as more granular data, the capability to monitor even non-standard metrics on the device, and the ability to perform actions on the monitored device. But an agent based approach can also be time consuming as it requires agents to be installed on each device that has to be monitored, as well as additional tasks related to update and maintenance of all agents that are deployed in the network.
6. Overview of Agentless Monitoring
Agentless monitoring as the name suggests lacks an agent that is deployed on the monitored device. Instead, it makes use of remote APIs that are exposed by the service that needs to be monitored or by analyzing data packets being transferred to and from the monitored device. SNMP is the most common agentless method used to monitor network elements, while WMI (Windows Management Instrumentation) is used to monitor Windows systems.
Agentless monitoring provides advantages, such as not having the need to deploy agents on each monitored device, lower deployment and maintenance costs, and almost zero impact on the client due to the absence of an agent application or software running on it. But agentless monitoring has its set of disadvantages too. The most important one being lack of in-depth reports, compared to what agent-based monitoring can provide. Agentless monitoring is also limited by the support it can provide for custom built devices or servers that have MIBs or data that is not exposed via API’s for agentless data collection methods.
7. Tips and Resources
Network Monitoring Common Practices
Having a network management and monitoring strategy in place for a network is as important as network design and implementation. Without network monitoring, a well-planned and designed network can be brought down by the smallest of issues.
When implementing a network monitoring solution, there are a few common practices that are followed by organizations and network administrators. These common practices help define a basic strategy to get started with information on the nodes and parameters that need to be monitored.
This does not mean that network monitoring is limited to these common practices alone. The common practices define the basics that are a part of network monitoring. In addition to the common practices specified here, the network admin has to understand the design and requirements of the network they own and be able to implement additional monitoring strategies to bring all metrics and elements in the network under their purview.
1. Availability Monitoring
Availability monitoring defines the monitoring of all resources in the IT infrastructure to ensure they are available to cater to the requirements of the organization and its users. Today’s IT infrastructure requires 100% uptime to meet the business demands. The network and services offered in the network need to be available at all times to ensure business continuity. This is where availability monitoring can help. Continuous monitoring of resources and services ensures that the node or service is up and running and available to meet requirements. Some examples of availability monitoring include monitoring devices in the network to ensure the network is trouble free, bandwidth availability to ensure data delivery, availability of storage space to store organizational data, monitoring system level services to ensure enterprise critical applications are functioning smoothly, etc.
Some commonly used technologies for availability monitoring are:
- Ping: The most widely used method. ICMP pings are sent to a monitored device and based on the replies, the availability of a device or service is measured
- Telnet: Used to check the device availability in networks where ping is blocked
- SNMP: Used to measure availability or current status of a service on a device
- WMI: Used to check the availability of services running on Windows systems
- IPSLA: Cisco feature that can measure availability of WAN links and their capacity to carry specific services
2. Interface Monitoring
There are a multiple types of interfaces used in a network, such as Fast Ethernet and Gigabit Ethernet to the very high-speed Fiber channel interfaces. The interface on a device is the entry and exit point for packets that provide a service to the organization. If there is an error, packet loss, or even if the interface itself goes down, it can result in a poor quality of experience.
Interface monitoring involves monitoring the interfaces on a device for errors, packet loss, discards, utilization limits, etc. The information from interface monitoring will help identify possible network issues that are the cause of poor application or service performance.
Network monitoring systems make use of ping or SNMP to collect interface statistics from network devices. While ping using ICMP packets reports on interface stats, such as packet loss, Round Trip Time, etc., SNMP based data collection helps monitor interface bandwidth utilization, traffic speed on the interface, errors, discards, etc. Together, this information helps identify application performance issues in the network.
3. Disk Monitoring
Data or information is one of the most important resources for an organization. Organizations need data for business planning, as well as its smooth functioning. The data that is needed by an organization also has to be stored for records use or for later use. In enterprises, data is collected and stored on storage arrays that have multiple disks. Any issues that arise on disks or the storage arrays that store business data can have serious consequences on business continuity.
Disk monitoring includes proper management of disk space for effective space utilization, monitoring disk performance for errors, large file stats, free space and changes to disk space usage, I/O performance, etc. Monitoring allows admins to plan in advance for upgrades to the system, as well as the space, detection of storage related problems, and reduction in downtime if an issue occurs.
A network involves many hardware devices, such as devices used for routing & switching, storage, connectivity, application servers, etc. The hardware forms the backbone of the entire IT infrastructure. If a hardware critical to the day to day operations of the network goes down, that also will lead to network downtime. For example, a faulty power supply on the core switch or over heating of the edge router can cause a network outrage. To ensure the smooth functioning of the network, it is important to monitor the health and performance of hardware devices in the network.
To understand details about hardware health, there are multiple metrics that need to be monitored. Here are a few important metrics and why they should be monitored:
- CPU: Tasks for a device are handled by its CPU. If the CPU utilization reaches its maximum value, the device performance can take a hit
- Temperature: When tasks are performed, the CPU usage of a device too can increase. This in turn can increase the temperature. Temperature shoot-ups can cause a device to malfunction thus bringing down the network
- Fan speed and status: Temperature and fan performance go hand-in-hand. Fan speed monitoring helps ensure the fan is working and even balances cooling, thus keeping the device temperature at its optimum value
- Power supply state: A faulty power supply or a spike in power to a device can cause it to malfunction, and ultimately leading to downtime. Monitoring with alerts based on thresholds helps an admin find potential issues
5. Tips and Resources
Network Monitoring Best Practices
The fact that monitoring is one of the most important components of a network has been reiterated multiple times throughout this document. It is only with continuous monitoring that a network admin can maintain a high performance IT infrastructure for an organization.
Like the common practices which we discussed earlier, there are also best practices that are applicable to network monitoring. While common practices define the basic components that are essential for network monitoring and are applicable to every network, best practices for monitoring is a guideline to implement a good network monitoring strategy. Adopting the best practices can help the network admin streamline their network monitoring to identify and resolve issues much faster with very less MTTR (Mean Time To Resolve). Let us look at a few network monitoring best practices that are followed in many enterprises world-wide to help create a high performing network.
1. Baseline Network Behavior
To be able to identify potential problems even before users start complaining, the admin needs to be aware of what is normal in the network. Baselining network behavior over a couple of weeks or even months will help the network admin understand what normal behavior in the network is. Once normal or baseline behavior of the various elements and services in the network are understood, the information can be used by the admin to set threshold values for alerts.
When an element in the network is malfunctioning, some of the metrics associated with the node performance would display a deviation from their mean value. For example, the temperature of a core switch in the network may shoot-up. The increase in temperature can be due to an increase in CPU utilization on the switch. Understanding the normal temperature and CPU utilization of the device will help the network admin detect the deviation and take corrective actions before a malfunction occurs.
Knowledge of baseline behavior in regards to network elements helps an admin decide the thresholds at which an alert has to be triggered. This aids proactive troubleshooting and even prevents network downtime rather than being reactive after users in the network start complaining.
2. Escalation Matrix
One of the reasons why potential network issues become an actual network problem is because the alerts triggered based on a threshold are ignored or the right person is not alerted. In a large network, there are can be multiple administrators or people who take care of different aspects of the network. There can be the security admin who looks at firewall devices and Intrusion Prevention Systems, the systems admin, or even an admin responsible only for virtualization.
When setting up monitoring and reporting, the organization should have a policy on who has to be alerted when a malfunction occurs, or a potential problem is detected. Based on the policy, the right person who is administers the network aspect that is having an issue can be alerted. This in turn can reduce the time needed for analysis which further reduced the MTTR.
In addition to alerting the right admin, an escalation matrix is also necessary. An escalation plan ensures that issues are looked at and resolved on time. Specifically, when the person in charge of that element is not available or takes a long time to resolve the issue. The implementation of a well-thought out escalation matrix prevents small issues from growing into large scale organizational-wide problems.
3. Reports at Every Layer
Networks function based on the OSI layer and every communication in a network involves transfer of data from one system to another through various nodes, devices and links. Each element in the network that contributes to data transfer functions at one of the layers, such as cables at the physical layer, IP addresses at the network layer, transport protocols at the transport layer, and so on.
When a data connection fails, the failure can happen at any one of the layers or even at multiple points. Using a monitoring system that supports multiple technologies to monitor at all layers, as well as different types of devices in the network would make problem detection and troubleshooting easier. Thus, when an application delivery fails, the monitoring system can alert whether it is a server issue, a routing problem, a bandwidth problem, or a hardware malfunction.
4. Implement High Availability with Failover Options
Most monitoring systems are set up in the network they monitor. This allows for quicker and better data collection from monitored devices. But if a problem occurs and the network goes down, the monitoring system can go down too, rendering all the collected monitoring data useless or inaccessible for analysis.
This is why it is recommended to implement a monitoring strategy with High-Availability through failover. High Availability (HA) ensures that the monitoring system does not have a single point of failure and so even when the entire network goes down, the monitoring system is accessible, providing data to the network engineer for issue detection and resolution. One method for HA is failover where the monitoring data collected by an NMS is replicated and stored in a remote site. In case of failure at the primary monitoring system, the failover system can be brought up (or automatically come up) and provide data needed for troubleshooting. And to avoid a single point of failure, it is recommended to set up the failover system at a remote DR site.
5. Configuration Management
Most network issues originate from incorrect configurations. There are several instances where even minor configuration mistakes have led to network downtime or loss of data. For example, when a new service is implemented in the network and firewall rules are being added, the person adding the new firewall rule may end blocking a business critical application, or allowing non-business traffic.
This is where configuration management is applicable. When configurations are changed on devices, which include network and security devices, like routers, switches, or firewalls—with the help of configuration management, the network administrator can verify that the changes being made do not break an already working feature. Configuration management can also be used for backing up working configurations, and to make bulk configuration changes, which otherwise could take a significant amount of time and prevent unauthorized changes. Unauthorized configuration changes to devices can lead to serious security lapses that include hacking and data theft. With configuration management, the admin can keep an eye on who is making a change, what change is being made, and even provide access control to configuration changes.
Configuration management is the proactive part of network monitoring. Furthermore, configuration management helps prevent issues from occurring in the network, rather than alerting about potential problems after they begin.
6. Capacity Planning and Growth
This applies to both the network in general and network management. When an organization grows, the IT infrastructure associated with the organization also should grow. An increase in business or addition of employees for an organization has effects on the number of devices needed, network and WAN bandwidth, storage space, and many more factors.
Monitoring systems allows you to keep tabs on resources in the network and be it with free, open-source, or licensed monitoring tools—there is always a limit on the number of resources and elements that can be monitored with a specific configuration or installation. In some cases, the server on which the monitoring system is installed may need upgrades to processing power and memory. In other instances, it might be the need to add-on installations to increase functionality, or in some cases it can be an increase in the license needed for the monitoring system.
When setting up a monitoring system account for future growth, it can affect the server sizing for installation, and for licensing—which controls the number of resources that can be monitored. Separate purchases, upgrades, or even moving to a new monitoring system as the network grows is much more expensive than spending a bit more capex when setting up the monitoring system.