Articles on: Technical

AlertBot System Redundancy and Scalability Explained

The AlertBot test systems have been engineered and developed to achieve the highest level of redundancy, scalability and distribution possible. Simply stated, our goal is "No single or multiple point of failure". This article goes on to explain some of the design decisions that have helped us achieve this goal. Please note that this article is written from a semi-technical standpoint.

There are two layers of systems in our design; the first layer is the Test Stations that are located in multiple locations around the world. These Test Stations are clusters of servers whose purpose is to physically test the sites and servers of our customers using the test data provided by the customer. These distributed Test Stations operate independently of each other. If one or more Test Stations goes off-line, the other Test Station locations automatically pick up the slack for the off-line Test Stations with no interruption to customers.

The second layer of systems is referred to as the sub systems. These sub systems are located at different locations from that of the Test Stations them selves. These sub systems are responsible for scheduling and tracking every single Test and Retest across all Test Stations in our entire network. They are also responsible for alerting customers of failures. Since every test in our network is tracked by these sub systems, no matter the cause of Test Station interruptions, Tests that are not completed or fail are automatically and independently rerouted to another Test Station by these sub systems.

Even at the sub system level, no one server or location is solely responsible for a particular task. There are many servers that independent work together over a distributed network to maintain the highest level of redundancy and scalability. If any server in this sub system fails, the duties and responsibilities of that server are automatically picked up by the other servers making up the sub system.

We know that customers like you depend on us and trust us with alerting you when your sites and servers fail. We know how critical this task is to you and take great responsibility in making sure that no matter what, we are always up and running. That is a commitment that our customers can depend on.

Updated on: 10/24/2022

Was this article helpful?

Thank you!