Genii Weblog
Resilience to failure
Fri 19 Dec 2008, 08:12 AM
Tweetby Ben Langhinrichs
One of the hallmarks of well designed software is its resilience to failure, not just its avoidance of failure as people commonly think. Obviously, it is best if software doesn't fail, and a huge amount of time and effort and engineering goes into design, quality assurance and testing. But what happens when there is a failure anyway? This can be very hard to handle, as one has to imagine the situation where the unimagined happens. If you can foresee it happening, you can usually design around it, but if you have not foreseen it happening, it is more of a challenge.
A good example of this is in the Lotus Domino mail routing. As mail routing has been a key feature of the Lotus Notes product since before the server was even called Domino, just about everything that can be foreseen has been, but there are still failures, whether due to new challenges such as badly constructed MIME or to simple "random ionic bombardment", which is how we used to describe those bugs that are not repeatable or even explainable. This is where the Domino maturity kicks in. Without compromising speed, the mail router has a clever hand off mechanism to ensure that until the message has been verifiably passed on (and is now some other server's responsibility), it is not removed. If a crash happens, the mail won't get lost. If there is a massive power failure, the mail won't get lost. If there is an earthquake and the data center is in California and the building sinks into the ground, the mail might be lost, but only if the server itself can't be recovered.
A good example of this is in the Lotus Domino mail routing. As mail routing has been a key feature of the Lotus Notes product since before the server was even called Domino, just about everything that can be foreseen has been, but there are still failures, whether due to new challenges such as badly constructed MIME or to simple "random ionic bombardment", which is how we used to describe those bugs that are not repeatable or even explainable. This is where the Domino maturity kicks in. Without compromising speed, the mail router has a clever hand off mechanism to ensure that until the message has been verifiably passed on (and is now some other server's responsibility), it is not removed. If a crash happens, the mail won't get lost. If there is a massive power failure, the mail won't get lost. If there is an earthquake and the data center is in California and the building sinks into the ground, the mail might be lost, but only if the server itself can't be recovered.
This may seem obvious - most elegant systems do - but many mail systems don't work this way. They send the mail on, then hope it got to the other side. Microsoft's Exchange Connector has long been plagued with just this problem. The Exchange Connector takes the message, thus relieving the Domino router of its responsibility, and then if the Exchange Connector machine dies or backs up or freezes, all of which happen distressingly often, the mail is lost. Simply, irretrievably, silently lost. (One of the huge advantages of CoexLinks over the years has been the simple fact that it works on the Domino server, and thus gets to rely on the Domino server's reliability. It is always good to tie your cart to a strong horse.)
All of this is a long winded way of getting to a bit of relief I had yesterday. Somebody used the Lotusphere Sessions db to ask a question, and a good question at that, and CoexEdit failed. It didn't crash or do anything dire, it just happened to be using an evaluation license on Tranquility/TurtlePublic, the public server, rather than a production license. Nonetheless, the process failed in an unanticipated way, thus giving a bit of insight into CoexEdit's resilience to failure.
Fortunately, CoexEdit passed the test. The software is designed so that as a person saves, the synchronization happens between HTML and rich text, but if it doesn't happen then, it simply waits silently for the next time somebody opens up the document using CoexEdit, or runs a process with CoexEdit. In this case, when I replicated to a local workstation with CoexEdit, the synchronization happened, and now the message saved on the system where CoexEdit failed is back and working, without any intervention. The Turtle Partnership made sure that CoexEdit was up and running again, so the failure is unlikely to occur again, but it is a relief to know that if it does, our software will handle it even if things go wrong.
Copyright © 2008 Genii Software Ltd.
What has been said: