Our users and prospective clients often ask us about the reliability of the infrastructures that we at Cobase employ to store and manage our customers’ information. This is a very legitimate concern; after all, a client of Concept is handing over complex, rich, unique data to be stored at, and managed by, some remote computing infrastructures of which little is known. Our clients’ business continuity and ultimate success depend on the availability, reliability and predictability of these infrastructures.
Delivering an available, reliable and predictable service is a complex endeavour. In this post I will focus on one particular aspect: I will try to sketch the mechanisms that at Cobase we use to monitor the computers that store, manage and deliver your data to you over the web.
These computers are collectivelly called the “server pool”, and are located in a secure, ISO 9001- and ISO 27001-compliant data centre in the United Kingdom. When we designed the architecture of the server pool, we had two major goals in mind:
- Extremely rapid detection of server health issues, such as low network, disk or CPU performance.
- Continuous exploitation of improvement opportunities in the server pool configuration, such as load balancing or database server to backup server mapping.
To implement these goals, we decided to use a two-pronged approach. First, we deployed a well-proven, mature, commercial monitoring system that constantly measures thirty-three different parameters of each server in the pool, aggregates the results, and reports back to an alert centre if any of the parameters falls below certain threshold. This is what we call reactive monitoring, and allows our support engineers to react immediately in case of lower-than-normal performance.
Secondly, we developed a custom-made proactive monitoring toolset that examines the server pool, compiles various health indicators from each server in the pool, and presents them as aggregated health levels through a web-based user interface. This toolset is aware of the particularities of how Concept works, and it uses this knowledge to interpret the raw performance data that it collects in such ways to make it more meaningful to the support engineer than any generic commercial tool could ever achieve. For example, the proactive monitoring toolset interprets raw measurements of free disk space and disk queue length obtained from database servers by taking into account the architecture of the massively scalable database subsystem of Concept, and translates them to easily understandable disk health levels. This enables our support team to look at a monitoring dashboard that offers a global picture of the server pool health, all condensed in a single screen.
Possible health levels are Optimal, Acceptable, At Risk or Critical. Any server with an Acceptable health level gets a raised eyebrow and a quick check aimed to determine what the issue is. We consider an Acceptable health level to be tolerable if the reason why it is not Optimal is well known; for example, a server where the free disk space goes under 75% automatically displays an Acceptable (rather than Optimal) health level; if the reason for this (e.g. the disk being 68% available only) is known, then this condition is considered to be tolerable, and a non-urgent action to remediate it is started (e.g. move data out or expand the server pool).
On the other hand, a server with an At Risk health level gets immediate, urgent attention aimed to diagnose and fix the issue, whatever it is. And regarding the Critical health level, we still have to see the first one.
In summary, by combining automatic, generic reactive monitoring with Concept-specific proactive monitoring, we are able to rapidly detect any performance issues in the server pool, as well as continuously exploit chances to keep the health of our servers at optimal levels.