SFT III-PEAT, Part Three

by Wayne M. Krakau - Chicago Computer Guide, October 1993

On with the continuing saga of Netware V3.11 SFT III, Novell's maximum reliability system. This article goes beyond theory into the real world.

The first practical aspect of the design process for an SFT III LAN is obtaining identical computers to use as mirrored servers. When I say identical, here, I really mean it. If the machines have even the slightest difference in memory, architecture, or BIOS (Basic Input/Output System) chips, the system might not work. The machines that I chose for my client's LAN, DTK 486/66 EISA Towers, were actually made side by side, one right after the other. I know this because I watched them being completed! That's how I made sure that they were identical.

This undocumented requirement for absolutely identical machines precludes the conversion of an existing single server or workstation into an SFT III mirrored server. Considering the rate at which undocumented changes and fixes are incorporated into motherboards and bios chips during the modern PC manufacturing process, I personally wouldn't bother trying to use a pair of machines if I could not confirm that they came off the assembly line at essentially the same time. Every vendor whom I have dealt with during this project has mentioned this undocumented restriction as being essential. Violating it would leave you with an unstable system that might work now, but will inherently be unsupportable in the future when, not if, it starts acting "funny".

This restriction applies even more stringently for the individual components inside the servers. There had better be an exact match between the specific network interface cards, the mirrored server link cards, the disk controller cards, the disk drives, and the video cards. Note that this advice contradicts Novell's. Novell feels that mismatched systems are viable, but the evidence that I have seen refutes that theory. While there are reports of working mismatched systems, I personally wouldn't want to get involved with such projects.

The software that makes these devices work, called drivers, must be the very latest versions. This means that each individual manufacturer must be contacted to obtain the latest driver, either from Compuserve or from its corporate bulletin board system (BBS). Merely browsing through a BBS is not enough. Many manufacturers hide the latest releases until they have been "wrung out" by chosen customers. You must call to find out the latest file name and, if necessary, the password needed to download it. Only after you obtain the latest edition of these drivers can you begin the process of configuring your SFT III system. (This is also a pretty good idea for even a plain vanilla Netware system.)

The cable between the two servers, called the Mirrored Server Link, or MSL, will be carrying data at 100Mbs (100 million bits of data per second). Because of this, it must be held to a much higher standard of workmanship than a standard 10Mbs Ethernet cable or a 16Mbs Token Ring Cable. If copper (as opposed to fiber) cable is used, for instance, you must be aware that the standard for how much bare wire can appear prior to a termination point has much less leeway than the standards for slower networking technologies.

Typically, this link is a direct machine to machine connection without an intervening concentrator. That means that one connector must be reverse-wired in a manner similar to a null modem. That is, the send and receive conductors are reversed on one end only. Be careful, and follow the manufacturers' recommendations precisely.

The most interesting suggestion that I have, is to ignore one critical point in the documentation. The documentation explicitly states that the ACTIVATE SERVER command should never be placed in either of the IOSTART.NCF initialization files. On Netwire and via 1-800-NETWARE, I was advised to arbitrarily choose one server and put the ACTIVATE SERVER command only in its IOSTART.NCF file. That is a major contradiction. My advice to you is - with careful planning and proper training - to place an ACTIVATE SERVER command in both IOSTART.NCF files!

The documentation was written with the idea that an embedded ACTIVATE SERVER command would cause a server that was supposed to come up as a secondary, to initialize instead as another primary server with exactly the same identity as the first primary. This would cause the entire internet (all directly or indirectly connected servers and workstations) to crash. The problem is, without automating the ACTIVATE SERVER command, the failure of both servers due to a sustained power failure would cause the system to stay down until supervisory personnel manually typed "ACTIVATE SERVER". The documentation assumes that both servers would never go down at the same time! I guess there are no sustained power outages in Utah (Novell's home).

The technical support people on Netwire and 1-800-NETWARE seem to have less fear of the possibility of duelling primary servers. However, they make assumptions about the predictability of future errors, believing that the system will react differently depending upon which physical server crashes. Again, the continuous availability of technical personnel, needed to restart the system manually, is assumed. That last point is my interpretation, since I can't believe that any company with data valuable enough to protect with SFT III would allow non-technical staff to get near their servers, much less lay hands upon them. The concept of a LAN running completely unattended (quite a common occurrence) seems utterly beyond the comprehension of the documentation writers and the technical support people.

Here is my plan. Adjust the allowed uptime and the recovery time parameters in your UPS (uninterruptible Power Supply) control software so that one server always goes down and later comes back up first. (You ARE using UPS's on all of your servers, aren't you?) This machine becomes your Preferred Primary Server (my term, not Novell's). The idea here, is to prevent the two servers from starting at about the same time, within a narrow range.

If they do start within a short period (about one minute when using 1G drives) both will come up as primary servers, crashing the internet. If, however, one is already up as the primary when the second wakes up, the ACTIVATE SERVER command will automatically abort and that machine will come up as the secondary server! This effect is dependent upon the existence of a valid mirrored server link, so if your link is defective in any way, the servers won't see each other and mayhem will result.

In addition to the adjustment of UPS parameters, procedural and training issues are raised. You must fully document the proper procedure for bringing up the system from scratch (when both servers are deactivated) and train your personnel appropriately. The first step is to start up the arbitrarily chosen Preferred Primary Server. Only after that server is completely awake, with all NLMs (Netware Loadable Modules) loaded, is it altogether safe to start the Preferred Secondary Server. This procedure eliminates the possibility of duelling primary servers.

Implementing my suggestion will result in having a system that will restart by itself under almost any conditions, a situation considerably more fault tolerant than that obtained by using either documented or the Novell-suggested procedures.

One final warning about Netware V3.11 SFT III is in order. It is far closer to the bleeding edge of technology than Novell would like to admit. The system that I have configured, and some others mentioned on Netwire, still don't work. The primary server has been chugging along happily for over two months, but the secondary server hasn't talked to it in weeks. The vendors involved are cooperating in finding the problem via Novell's Technical Support Alliance, but nothing has worked yet. I will publish the final results of this debugging effort in future columns. Meanwhile, I expect the real world to catch up to the theoretical soon, providing a premium fault tolerant option for critical systems.

                                    1993, Wayne M. Krakau