This financial firm decided to build its own redundant WAN routers. Here's a no-nonsense look at the tricky parts and how it all worked.
Last summer, the investment company I work for decided to build a disaster recovery site. This location, 40 miles from New York City, would provide a mirror of our Downtown Manhattan operation. We decided to utilize Linux as much as possible for this project for the following reasons:
We primarily are a Linux operation already, so we could use our existing experience.
We could customize the configuration as much as we wanted, because everything would be open source.
We hoped Linux solutions would be less expensive than other solutions, such as Cisco's.
In this article, I focus on our use of Linux in wide-area network (WAN) routers. I define a WAN router as a system that connects to both wide-area links, such as T1 or T3 lines, and local-area networks, such as 100baseT, and forwards packets between the two networks.
We purchased dedicated connections because this is a disaster recovery site and we need the connections to be as reliable as possible. Based on our calculations, one T3 (45Mb/sec) and four T1 (1.544Mb/sec each) lines would provide sufficient bandwidth for our operations. Ultimately, we decided to use the T3 link as the primary connection and leave the T1s as a bonded 5.7Mb/sec backup link.
The choice of WAN connectivity determined our network design. For redundancy, we installed two WAN routers at each site. The routers are identical and contain hardware to connect to both the T1 and T3 links. With the use of splitter hardware, we hoped to connect all the WAN links to all the routers, as shown in Figure 1. However, that design ultimately turned out to be extremely difficult to implement, due to technical issues I discuss below.
In addition to the WAN links, we also connected the remote site to the Internet through the hosting company backbone. We operated on the principle that more connectivity was better, and this turned out to be useful when we were designing the network. There's nothing like accidentally bringing down your T3 with a mistyped command to make you appreciate a back door to your routers over the Internet.
Our space for servers at the hosting company was limited to one standard rack. This put space at a premium, because we needed to install a lot of servers. Thus, we decided to use 1U systems for the WAN routers. This was a difficult decision to make, as hardware options are limited in that form factor. In retrospect, it would have been much easier to use 2U systems for the WAN routers.
The next step was the selection of T1 and T3 interface cards. The main choice here is whether to use a card with an integrated channel service unit/data service unit (CSU/DSU) that connects directly to the incoming WAN circuit or a card with a high-speed serial connection along with a standalone CSU/DSU. Given our space constraints, an integrated card made the most sense. In previous WAN installations, we used Cisco 2620 router boxes with T1 cards installed. However, that was not appropriate for this project because we wanted to connect multiple T1 and T3 lines.
After much searching, the only vendor we found that could supply both T3 and multiport T1 cards was SBE, Inc. The market for these cards is small and the number of vendors is limited. My suggestion for finding WAN cards is to start asking tech support a lot of questions and see how they respond. Also, carefully look over the driver and hardware specs before committing to a particular vendor.
With the T3 and four T1 cards from SBE, we would require a system with two free full-height, half-length PCI slots. We decided on Tyan S5102 motherboards with a single Pentium 4 Xeon 2.4GHz CPU. For memory, we used 256MB of ECC RAM for maximum reliability.
To cut down on the chance of system failure, we used Flash-based IDE devices. We found a device from SimpleTech that connects and operates like a conventional hard drive. We decided on a 256MB device as we thought that would be enough room for Fedora Core 1 to operate.
The complete computer systems (minus the WAN cards) were purchased from a white box system supplier. This proved troublesome, though, as the supplier was not able to produce four completely identical systems. The systems had variations in CPU fan manufacturers and memory speeds.
One area where the system supplier was helpful was in finding the right case. Only one of the numerous system vendors I contacted could supply a motherboard and case combination that could hold two full-height PCI cards. We had hoped to use a stock system from a supplier such as Dell or IBM, but none of the big names could give us a system that matched all our criteria.
It's critical to have redundant circuits connecting an office to a backup site. Determine who serves your sites and find a backup site served by multiple providers. Our office is connected physically to two providers, so we ended up ordering the T3 from one and the T1s from the other. If you don't research carefully which providers have actual physical connections to your sites, you are likely to end up with all your circuits running through one vendor's cable at some point.
T1s come on standard RJ-45 cables. Typically, the provider terminates the T1s—and their responsibility—at your demarcation point (demarc). The demarc generally is where all your phone connections are made. From the demarc, it is a simple matter to run regular Ethernet cables to your racks.
T3s are more complicated. The physical connection is two coaxial cables, one for transmit and one for receive. T3s use RG-59A cable with BNC connectors. The T3 provider informed us that our server room was too far from its equipment in our building, so a T3 repeater was necessary. This required 4U of space and a 120-volt outlet in our rack. Luckily, this distance flaw wasn't repeated at the hosting facility.
Our goal was to connect all circuits to all WAN routers (Figure 1) and leave the circuits turned off on the spare system on each end. One router at each end would be the master for the T3, and the other would be the master for the four T1s. If either router failed, the circuits could be brought up on the other router.
Based on our research, in particular, some of Cisco's high-end telco equipment, we knew that splitting the circuits was possible. The key constraint is only one system on each end can be transmitting and receiving at a time. That turned out to be a large problem because SBE's hardware was not designed to be inactive while connected to a line. The critical flaw was the transmitter on the T3 cards automatically turns on when power is applied to the card. So, if you have the T3 circuit up running between two systems, one on each end, and you power-cycle the spare system on one end, the T3 goes down because both systems on one end are trying to transmit. This can be worked around partially by sending a shutoff command to the transmitter on the card. This isn't possible until the machine is loaded and the OS is installed, a potential delay of several minutes.
We also discovered that the T3 signals on the coaxial cables must be impedance-matched. The impedance on a T3 cable is 75 ohms. If you simply split that connection, the impedance on the two resulting cables is 37.5 ohms, which may or may not work, depending on your hardware. The correct way to split T3 cables is to use what's called a power splitter, which contains a transformer to balance the impedance properly at 75 ohms on all connections. We used passive power splitters from Micro Circuits, Inc.
Splitting the T1s was much simpler. It's sufficient to use RJ-45 tee connectors to turn one incoming cable into two outgoing cables. Also, the SBE 4T1 card is designed to not turn on the transmitter until the driver is loaded, so you can share the connection between systems.
We were able to make all these split connections work. However, due to the startup problems with the T3 cards and other issues, we currently do not have the splitters installed. If you want to try doing this step, you have to get everything working rock solid without splitting before even attempting it. Otherwise, you will be removing the splitters every time there is a problem with a circuit because you won't have confidence in your setup.
The choice of a 256MB Flash drive for storage dictated a compact OS install. At Telemetry, we have standardized on Fedora Core 1 for all Linux systems. Thus, it was convenient to run FC1 on the router systems as well. The two goals:
Create something similar to stock Fedora Core 1 that would fit on a small drive.
Change the system configuration to avoid unnecessary writes to the drive. This is important because Flash drives have a finite lifetime, so placing log files on them is a bad idea.
It turns out to be relatively easy to build a custom Fedora system, especially compared to what was available in previous Red Hat releases. The key is to build your own system image on another machine with a fresh RPM database and then transfer that image to the router. Listing 1, available from the Linux Journal FTP site [ftp.ssc.com/pub/lj/listings/issue126/7661.tgz], shows how to build a basic system image. The procedure is to create a new RPM database somewhere on your build system, install a minimal set of RPMs to create the system and then install all other RPMs you want. I use the --aid option to rpm to tell it to satisfy all dependencies automatically by looking in a directory where I have placed copies of all the Fedora RPMs. This saves me the work of manually determining all the dependencies. Once you have the system image built, copy it over to the router for testing. We were able to create a workable system that used 171 of the 256MB available on the Flash drive.
The goal of using a Linux router is to minimize disk reads and writes. This is necessary because the memory in a Flash drive can be written to only a fixed number of times, typically in the hundreds of thousands. The way to minimize writes is to treat the router as though it were a laptop. First, enable laptop mode in the kernel, as described in the September 2004 issue of Linux Journal. This causes the system to delay writes until a read is requested instead of sending writes to the disk as soon as they occur.
Second, adjust your filesystem mounting options to delay writes as well. For ext3, set the commit interval to 60 seconds. Then, mount the filesystem with the noatime option so that reads on files don't generate a write of modified access times.
Third, move all log files off the drive and into a RAM-based filesystem, tmpfs. Listing 2 shows how to restructure your filesystem to move all log files out of /var and into a tmpfs called /var/impermanent. For this to be really useful, you also need a script, such as the one in Listing 3, that saves all the log files in a tarball on system shutdown and restores them on boot [Listings 2 and 3 also are available on the LJ FTP site]. This script should be called as early as possible in /etc/rc.d/rc.sysinit on startup and as late as possible during shutdown in /etc/init.d/halt.
WAN links are confusing! For example, the T3 and T1 drivers use different versions of the kernel HDLC stack. This means we have to keep two different versions of the sethdlc program used to set the protocol on the WAN circuit, one built against each hdlc stack.
There are many configuration parameters to set on a T3 or T1 circuit—external or internal timing? CRC size? HDLC mode? and so on. Fortunately, SBE's tech support was helpful and supplied many configuration and troubleshooting tips.
We decided to bond the four T1s into one bonded circuit, using teql. This worked, but performance was terrible if one of the T1s was removed, even after it was reconnected. My coworker, Bill Rugolsky, tracked the problem down to a lack of link status reporting. The SBE card could report whether the link was up or down, but this message was not propagated up the stack. Thus, teql continued to try to send out packets using down interfaces. Bill resolved this by patching the SBE driver and installing patches others had created to fix teql and linkwatch notification. The driver patches were provided to SBE, and we hope they are included in the next revision of their driver.
Our boss Andy Schorr did the work to set up OSPF to handle routing over the WAN links. The open-source package Quagga, a successor to Zebra, provides the necessary framework. If one of the links goes down (remember, there are two links, the T3 and the virtual link over the bonded T1s), Quagga detects this and starts routing packets over the other interface. Traditionally, point-to-point links are configured to borrow the address of another interface, typically eth0. However, we decided to use dedicated subnets for each point-to-point link. Andy had to modify the source code to make Quagga work properly in this setup.
We also had to figure out some iptables rules to make Quagga work correctly with the bonded T1s. The teql device is send-only, so packets never appear on it. This causes Quagga to drop the packets, because they come in on the wrong interface. The fix is a couple of iptables rules to make packets arriving on all the T1 interfaces (hdlc0 through hdlc3) appear on teql0:
iptables -t mangle -A PREROUTING -i hdlc\+ -j TTL --ttl-inc 1 iptables -t mangle -A PREROUTING -i hdlc\+ -j ROUTE --iif teql0
The bottom line is that setting up WAN links is tricky work and requires much study and tweaking. Don't expect things to work simply because you connect the cables.
We had to resolve a number of problems while configuring these WAN routers. Some of the earlier ones were with the WAN drivers, as mentioned above. As I was writing this article, we discovered that our T1 performance had deteriorated badly, with highly variable ping times—up to 1 second instead of the usual 10ms. We traced this to one of the WAN cards not generating interrupts; it had come loose in the PCI slot. The widely varying packet delays were occurring because the other device sharing the same interrupt line (eth0) was sending interrupts. This, in turn, caused the SBE driver to wake up and process its interrupts. This type of non-obvious failure highlights the importance of link-quality monitoring.
We are satisfied with the basic architecture, but a number of improvements need to be made. Given the annoyances of managing multiple T1s in a bonded interface, we now are planning on upgrading the T1s to a second T3. When we do that, we may drop the circuit splitting entirely. Circuit splitting adds a whole new level of complexity to the entire system, and we are unsure if it is worth it.
We have to continue to improve our monitoring of both line status and line quality. It is difficult to complain to circuit vendors about performance if you don't have historical data to back up your assertions.
It would have been convenient to use off-the-shelf servers for the router boxes. We have been investigating the latest 1U rackmount from a major manufacturer, but for several reasons it is unsuitable. The showstopper is that the BIOS does not allow booting from any Flash IDE device. The vendor knows of this limitation but will not fix the BIOS. Thus, we see ourselves building our own systems for the foreseeable future.
We will be building additional internal router boxes for handling LAN traffic, based on the WAN router model we have developed—1U systems with Flash drives running a minimal Fedora kernel.
Although this project is not complete, I feel we've accomplished enough to take a moment to evaluate its success. The key question is: would we do it again? The answer is a qualified yes. Our WAN routers perform the task of providing redundant connections between our office and backup sites. The usefulness of splitting the WAN circuits for redundancy is a wash as it adds so much complexity to the design.
This project has taken significantly longer to complete than we anticipated, a general symptom of developing your own solutions. The answers are there, but you expend more time finding them. Having a sharp, dedicated team (as I did) is crucial to making it all work. Just make sure to budget extra time for all the annoying little problems that are sure to arise.
Resources for this article: /article/7703.