Your server is the backbone of your network; yet, it’s often overlooked when it comes to monitoring your business’ operational health. Often, it’s not until a server’s performance has noticeably degraded that anyone thinks to check. And sometimes a change in your server’s health forewarns of a possible hardware failure; other times, it points to an application flaw.
Considering recent events like the SolarWinds security breach, it’s clear there’s a need at every type of business for ongoing server health monitoring. Careful monitoring should identify unusual behavior like increased resource consumption which can be an indicator of malicious activity. The sooner a potential compromise is identified, the sooner it can be contained. With the average time to detect a compromise at over 200 days, every business can benefit from early detection.
What is Server Health?
Server health refers to how well a server functions; but, it is more than the performance of hard drives and power supplies. Health checks depend on the server. For example, web servers have different performance metrics than file servers. Email and application servers are not the same as database servers. Each implementation has its unique set of evaluation criteria when it comes to server health.
A physical check may include CPU usage, memory availability, and disk capacity. Other checks may encompass connectivity, dependency, and anomaly evaluation.
The checks are designed to establish a baseline using historical data.
The baseline is then used to identify deviations that can be addressed to ensure optimum performance.
Why Are Health Checks Important?
Understanding how servers are functioning is essential to business success. Without regular checks, servers can fail unexpectedly or perform erratically. Unexpected downtime because of a server failure can be costly. It’s estimated unplanned downtime can cost $100,000 per hour.
But what about the impact on customer experience?
How many times have you heard — Sorry, the computer is running slow today — as you wait in line or on the phone? Waiting for the server to respond degrades the customer experience. After all, no one likes to wait. Given that 84% of customers will move to a competitor after three bad experiences, you just can’t afford to have a customer wait because of server performance problems.
Not only is customer experience important, but underperforming servers or unpredictable behavior can be a sign of unauthorized access and use. Cyber security threats continue to rise, with an attempt made every 11 seconds.
Deploying server health checking tools can result in faster detection and remediation.
Given that the average security breach takes over 200 days to detect, using a server health check tool can significantly reduce the time cybercriminals can roam freely through your network.
Server health monitoring provides data that can be used to anticipate problems in the future. By comparing current with historical data, companies can identify potential failures and address them before they impact the bottom line. They can also use monitoring data to make informed decisions about server replacement, performance optimization, and operational adjustments.
Who Should Perform Server Health Checks?
Practically anyone can perform a server health check. It just takes time. Lots of time.
How much time depends on the number of servers and their location. Server health check tools can help you automate the process, but interpreting the data needs a level of expertise. Putting together a comprehensive monitoring plan also requires experience to identify the critical metrics to evaluate.
How to Conduct a Server Health Check
How a health check is conducted depends on the server being tested. Certain physical capabilities apply no matter the server type; however, an SQL server health check has different performance metrics than an application server. An infrastructure check which exercises servers and network functionality should deliver the following:
- Hardware metrics – Fans, power supply, disk drives, CPU, storage, memory, environmental conditions
- Reports – Information on procurement, usages, and status to use with future purchases
- Alarms – Notifications of changes in server health for faster resolution
- Baselines – Historical metrics for setting thresholds for alerts
- Visualization – Graphical representation instead of just reports to provide a quick server health assessment
After establishing the measurable metrics, thresholds are set using historical data to enable alarms to be triggered.
From the data, informed decisions can be made to improve network performance.
What Should Be Evaluated
Although the checks may vary, here are some essential server health assessments to conduct.
Servers are part of the network’s infrastructure, so their ability to connect is a critical metric to check. These checks may be performed using a load balancer or external monitoring agent. At a minimum, the tests should involve:
- Confirming that the server is listening on the expected port and that those new connections are being established
- Performing HTTP requests to ensure server responds within baseline parameters
- Checking that basic statuses are being sent
- Pinging the server can be a simple test to see if the configuration is viable.
Local Health Checks
These health checks go beyond uptime checks. They verify that applications can operate on the server. Local health checks establish that resources are available to ensure application performance. Their checks include:
- Read and Write To Disk – Most applications write to disk for logging or error tracking. Assuming that disk access is not required can lead to fatal consequences when software attempts to access a resource that is not available.
- Processes Functioning – Liveness checks may test proxy processes, but they may not check the proxy and application link. Local health checks go beyond the basic check to ensure that the processes are running and responding correctly.
- Missing Processes – Ensure that support processes are operational. If monitoring doesn’t go deep enough to check support processes, organizations run the risk of having a service fail. Sometimes these failures are difficult to detect and take longer to remedy.
Performing checks local to the server ensure that the server is executing as it should.
Dependency health checks inspect the interactions among servers. For example, an application may need to send data to the SQL server. If the two servers fail to interact, the application may fail. Dependency checks can catch expired credentials or misconfigured servers that prevent an application from interacting with a database server. Dependency checks may include:
- Configuration or Metadata – Checking for misconfigurations can catch disconnects that can lead to unpredictable behavior. For example, automated updates are no longer working on a dependency server, but the server cannot determine why updates stopped. Finding misconfigurations or missing metadata can ensure that servers continue to perform as needed.
- Communication – When servers can’t communicate with other servers, network behavior may result in difficult-to-detect discrepancies resulting in network instability.
- Software Flaws – Faulty software applications can lead to memory leaks or data corruption that impacts server performance. Checking to ensure that server performance is being maintained reduces the chances of fatal errors.
As networks become more complex, the interdependency of servers becomes critical to successful operations. Ignoring that dependency can have ramifications that far exceed the server-specific error.
Checking to see if a server is behaving differently from its baseline or similar servers in the network should be a part of any type of monitoring server performance.
These checks can identify such anomalies as:
- Clock Skewing – Many server and application functions depend on the server’s clock for executing code. If the clock is off, the system may fail, or the application may return an invalid response. For example, time limits on resetting passwords can result in user frustration if the clocks do not agree. In some instances, the difference may result in a system shutdown.
- Outdated Software – Bringing a server online, especially one that has been disconnected for a while can introduce errors. Making sure that the server is up-to-date may not detect all possible errors. Checking for anomalies can help identify possible outdated software.
- Failures – Anomaly checking can be a last line of defense for problems that may impact performance. Although perfect performance is the ideal, hardware and software rarely reach that goal. As a result, it is always prudent to check for aberrant behavior.
Anomalies occur for multiple reasons, many of which may not even be defined. That’s why check for unusual behavior is essential to server performance.
How Often to Check a Server’s Health
The short answer to how often to check a server’s health is — continuously. Unfortunately, 24/7 monitoring can absorb an entire IT support team’s resources. Easy server monitoring tools are available to help with monitoring and troubleshooting servers. Whether it is a web or a file server, monitoring tools exist that can result in optimum performance.
Tools for Server Health Checks
Monitoring a server’s health should be part of any maintenance plan. Without the details that come from monitoring and checking the infrastructure, companies are leaving themselves open for system failure or compromise. The following three tools are examples of available solutions.
This solution is a network monitoring tool suitable for companies of all sizes with its ability to scale to meet the needs of an enterprise. PRTG does more than monitor a network’s infrastructure. It can check:
- CPU load
- Hard disk capacity
- Overall performance
- RAM usage
Customizable dashboards and reports let administrators see their server environment in one place. Adding graphs and analytics displays makes it easy for IT personnel to respond to deviations.
Templates can speed the creation of dashboards and reports.
In addition to its health checks, the monitoring tool delivers the following:
- Flexible alerts
- Customizable user interfaces
- Failover-tolerant monitoring
- Distributed monitoring
- Customizable mapping
- Dynamic setup
The solution’s monitoring capabilities are designed to adjust to a company’s business requirements.
Datadog provides surveillance, analytics, and safety tools for developers, security engineers, IT departments, and cloud-based infrastructures. It combines and automates application performance tracking, log management, and infrastructure surveillance. It offers dashboards, customizable alerts, and integration.
The solution is a cloud-hosted model with the following features:
- Customizable views
- Aggregated metrics and events
- Automation tools
- Source control
- Bug tracking
- Common server components
- Monitoring and instrumentation
- Database monitoring
Datadog provides a server monitoring solution for development shops looking to incorporate source control and bug tracking into a single system.
Observium monitors network equipment and servers. Once configured, it detects network devices and can collect and display information on each port. The tool supports a long list of devices using the SNMP protocol. The solution requires its own server with a dedicated URL.
The graphical user interface offers statistical displays, diagrams, and graphs. It shows information on:
- Power supply
Data collection can extend to Apache, MySQL, BIND, and Postfix.
With its auto-discovery capabilities, it expedites the installation and configuration of networks as well as the addition of devices.
Navigating the world of server health checks can be overwhelming. Whether you perform them yourself or partner with an experienced IT service provider, health checks are a vital part of maintaining a safe and secure operating environment. Contact us to discuss how we can get server health checking tools in place.