Router FIB technologies

When I wrote the I-D (to version 01) I understood that a typical approach for router design was to use TCAM (Ternary Content Addressable Memory) to implement the FIB function: throwing the destination address into the TCAM would produce a number of the first matching TCAM line, which then becomes the address of an SRAM, which reads out some previously written data which tells the router which interface to forward the packet to. There are many papers referring to the use of TCAM for this "longest prefix match" operation, such as Fast incremental updates on Ternary-CAMs for routing lookups and packet classification by Devavrat Shah Pankaj Gupta, Feb 2001. See also the Force 10 and Cisco presentations linked to below #Cisco_others_TCAM .

The lecture notes by Prof. Zhi-Li Zhang's csci5221-router-design.ppt (PowerPoint, early 2007) documents various approaches, including the RAM-based "Recursive Flow Classification" (RFC). These notes include:

Putting a small cache of the FIB on each line card, with the central CPU handling packets which miss the cache (p42).
125 million packets a second is needed for 40Gbps (p 47).
Cost, speed and power consumption of DRAM, SRAM and TCAM - 15 to30 watts in 2003/04 (p 48).
Explanation of binary tries, path-compressed trees, multi-bit tries etc.
Explanation of TCAM and other techniques.

Here is a quote from page 96:

Lookup: What's Used Out There?

Overwhelming majority of routers:

modifications of multi-bit tries (h/w optimized trie algorithms)
DRAM (sometimes SRAM) based, large number of routes (>0.25M)
parallelism required for speed/storage becomes an issue

Others mostly TCAM based:

for smaller number of routes (256K)
used more frequently in L2/L3 switches
power and cost main bottlenecks

So on this basis, most high-end routers do no use TCAM. Maybe they already have enough memory to map every IPv4 /24 as I suggest. knew some routers, including especially those from Juniper, used other techniques. I was told by two people with long experience in this field that TCAM is not normally used (for instance in Cisco routers) for the FIB function. But I think TCAM is widely used, although the latest, highest capacity routers from Cisco and Juniper use an ASIC with external DRAM or SRAM. It is these newer, core and edge routers which are important to this discussion.

Juniper documentation (Efficient Scaling for Multiservice Networks page 6) indicates TCAM is typically used (in other company's routers) for filtering, but that filtering in Juniper routers is performed by similar techniques to those used for "route lookups". These techniques are tree-based, as depicted on page 9.

Juniper's M120 and MX960 do not seem to use TCAM for any function. Both of these use a single, very large - and indeed heroic - chip called the I-Chip to process packets, including rewriting them and deciding which interface to forward them to, via a separate switching system. The I-Chip integrates the function of 10 ASICs in earlier Juniper products and uses external high speed RAM. Multiple reads of this RAM are used to traverse some kind of radix tree or trie. I guess this RAM would also store data for handling MPLS too. More on packet classification algorithms and on the memory in Juniper routers below.

Some Cisco routers certainly do use TCAM. The CRS-1 does not seem to use TCAM - I think it uses and ASIC and DRAM or SRAM.

Juniper M120 uses RLDRAM

A Juniper/Micron case study download.micron.com/pdf/case_study/juniper.pdf from 2006 states that Juniper chose RLDRAM-II (as described below) in their M120 router which was announced in mid-2006. The separate input and output pin version is used, in the 288 Mbit size. This is the first of their routers to use the I-chip. The second is the larger MX960. Scrutinising the documentation, I see there is 64 Mbytes of RLDRAM in the Forwarding Engine Board, of which there are up to 6 in each router. (M120 Internet Router Hardware Guide page 156.) RLDRAM-II with 288 Mbits are 9, 18 or 36 bit wide devices. They are also available in 8, 16 and 32 bit versions, which are 32 Mbytes each. So that means there must be two of these chips on each Forwarding Engine Board (FEB). The board also has 512 Mbytes of slower DDR DRAM, which is probably largely visible in the drawing (p19) as 3 x 4 small devices (there would be 4 x 4 on the board) to the left of the big heatsink, which presumably cools the I-Chip.

According to the case-study RLDRAM-II was chosen because it was much faster than DDR DRAM. A 15ns latency is quoted, compared to 55ns for DDR. Also, RLDRAM-II was significantly cheaper and denser (bits per chip) than DDR-II SRAM or QDR-II SRAM. Assuming the 72Mbit QDR-II SRAMs were available at design time, there would need to be 8 QDR-II SRAM chips instead of 2 similarly mechanically sized RLDRAM-II chips.

So it seems that some routers do have enough high-speed RAM to implement the direct lookup of 24 bits of IPv4 which I am proposing. However, I am not sure how much of this 64Mbytes is used for buffering packets. It seems the RAM would be kept busy with lookups and the like, so maybe buffering is on the I-Chip. The only other RAM is slower DDR DRAM for the main CPU and for buffering. Page 85 shows 512Mbytes of DRAM, with 63% devoted to buffering. The line cards (FPC and cFPC) have DRAM which is used largely for buffering - page 83 shows them with 128Mbytes of DRAM, with ~58% used for buffering.

I guess the I-Chip would be flexible enough to use 24 bits of the IPv4 destination address to look into a table in RAM. My single 72Mbit QDR-II SRAM proposal is for 16 Mbytes (actually 8M x 9, but used as 16M x 4 for routers with less than 14 interfaces.) The high speed RAM in the M120 is needed for a variety of purposes, but it doesn't seem impossible that an eighth or a quarter of this (14.5M x 4 bits for less than 14 interfaces or 14.5Mbytes for up to 254 interfaces) could be devoted to IPv4 forwarding on /24 boundaries. I imagine a significant rewrite of the operating system and microcode of the I-Chip would be required to do this.

This shows that the cost of the memory I am proposing to add is not prohibitive. Each of these two 288Mbit RLDRAM-II chips would probably cost as much, or more, than the 72Mbit QDR-II SRAM I propose using.

I see a retail price (April 2007) for the M120 FEB is USD$18,992.44. So the $70 cost of the memory I am proposing ($140 for larger routers, and less in the future) is unlikely to be a major concern. The power dissipation of the FEB is rated at 86 watts (p 203), so the additional 1 to 1.4 watts for the memory chip I am proposing (2 to 2.8 watts for larger routers) is also unlikely to be a major concern.

Further description of the M120 architecture can be found in Integrating Ethernet in Converged Networks . The M120 has up to 6 boards (4 x FPC and 2 x cFPC, all on the front of the router) for driving Ethernet links etc. each operating at up to 10Gbps duplex. There are up to 6 FEBs at the rear of the router, and each one can be selected to handle the packets being received from one cFPC or from two type 1 FPCs. Exceedingly fast serial links are used on the midplane, and one FEB can be a spare, reconfigured automatically in less than a second, depending on the arrangement of FPCs and FEBs There are various websites which state that the CPU (presumably the I-Chip) on the FEB is 667MHz, but I can't find this at the Juniper site.

Juniper MX960 uses SRAM?

Page 81 of the MX960 Hardware Guide indicates that each DPC has 256Mbytes of SRAM, which is separate from the DRAM for the CPU (1024Mbyte). Since they are not using RLDRAM, and since each DPC board handles 40Gbps full duplex, compared to the 10Gbps full duplex of each M120 FEB, I guess that they are using either DDR-II SRAM or QDR-II SRAM, both of which are much faster than RLDRAM. (However, see the statement linked to below where a Juniper engineer states they use DRAM in the MX960.) Do they use one I-Chip or two or more sharing the load, each with its own SRAM? In Integrating Ethernet in Converged Networks (page 5) Juniper states that the I-Chip is a 10Gbps packet forwarding engine, so maybe they use four of them in each DPC. The chip is said to be capable of coping with huge scaling, including in the number of routes, without hardware upgrade. (Page 6.) Maybe they use four I-Chips, each with 64Mbytes of RAM - the same as used for the M120.

My guess is that the MX960 uses the same, or similar QDR-II SRAM chips to the one I am suggesting, but probably uses 32 such chips. In April 2007, the largest capacity SRAMs are QDR-II or DDR-II SRAMs, with 72 or 64 Mbits.

I guess that the MX960 could be programmed by Juniper to perform the SRAM forwarding functions I describe.

Traditional RAM-based approaches to packet classification AKA "route lookup"

Juniper, in Efficient Scaling for Multiservice Networks, especially page 11, provides an overview of how their routers process incoming packets. The "tree lookup primitive" is a low-level function within the system. The "IP Forwarding Table" is a body of data in memory and is referred to as the FIB (Forwarding Information Base). The tree lookup primitive navigates this tree, depending on the packet's destination address, and the result is the selection of the next step in the process: the packet being sent to whichever interface was specified in the final part of the tree, perhaps with a filtering stage before it is sent there. (Also, the tree lookup could lead to the packet being dropped, I guess.) Perhaps they cache recent results in the chip, using a hash index into the cache, to speed up classification.

There are many types of classification, but what I am interested in is the FIB function: given a destination address, determine whether to drop the packet or which interface to forward it to.

EZchip has a document "The Role of Memory in NPU Design" at www.ezchip.com/html/tech_whitepapers.html . This discusses three approaches to "look-ups".

Firstly, a direct lookup involves taking N bits of address and using this as in index (AKA a "key") into a linear list of data which specifies what to do next. They say this is typically only done for N <= 16. My IPv4 SRAM-forwarding proposal involves doing it for N=24.

A second approach is to hash a larger number P of bits into a small enough number Q to use as an index to a list of 2^Q variable length lists. Since multiple input values could hash to the same smaller value, each list of more than one entry must be searched to see which one applies to the actual input bits. This would be computationally intensive, since the algorithm somehow needs to look at all Q bits and somehow determine which of multiple entries in this table corresponds to this pattern. I don't have a clear idea of how hashing works efficiently when used to find which of multiple prefixes a particular 32 bit value falls within. I can imagine it being efficient when trying to match the address with a moderate number of single IP addresses, but not with prefixes which span thousands or millions of IP addresses.

If the list has no members (its starting address is 0), our lookup ended with no result. If the list has one entry, this is the correct one. If it has two or more (a "hash collision"), we have to somehow look at the original 32 bits and the individual list items to determine which list item corresponds to the actual input bits. This is messy, but it is potentially faster than a tree based approach, as described below, and its memory requirements may be practical, whereas implementing a list of 2³² items is generally prohibitive. (Though iPods have 4Gbytes of FLASH memory . . . )

EZchip state they can preform hash lookups with no more than two memory lookups. This sounds pretty impressive - I can't imagine how they do it - but it would depend on the nature of the bits being processed and how many rules are encoded by the system.

The EZchip paper gives a rough description of a tree or trie based approach. More detailed descriptions of hash and trie approaches can be found at: www.cs.mcgill.ca/~cs535/lect_notes/Lecture14-RouterArchitectures.pdf .

The simplest approach, with the most compact utilisation of memory, involves stepping through the tree one bit at a time. This is way too slow for handling 24 or more bits of address information with 10Gbps interfaces. Stepping 16 ways, on 4 bits at a time, is faster, but leads to larger memory requirements. The fastest way to handle 24 bits is to use them as an index into RAM - which is what my IPv4 proposal entails. See the parent page for a reference to where this was first suggested in 1998 by Nick McKeown and colleagues.

EZchip states that by stepping multiple bits per cycle, they cut the number of cycles required by about a third. So I assume they are stepping 3 bits at a time. In practice, with an IPv4 FIB, it makes sense to do bits 31 to 24 bits in one cycle, since there are no rules for prefixes shorter than /8. This still leaves 16 bits or more to handle. EZchip state that by doing some of the tree lookups using memory inside their Network Processor Unit chip, and leaving the rest to external RAM, they can speed things up considerably. I haven't figured out how this could be practical.

Cisco CRS-1 - FIB implemented with ASIC and RAM?

Below #Cisco_Catalyst_6500_FIB I point to Cisco presentations and documentation which shows that they use TCAM for the FIB in at least some of their routers, notably the Catalyst 6500. Here I discuss what little I could find out about the FIB function in the CRS-1 - but after writing this, I found a paper which describes the Tree-Bitmap alogorithm, which is apparently used in the CRS: ../#Vargese_CRS_1 .

The CRS-1 has an ingress FIB and an egress FIB. It is the former we are interested in, because this uses the destination address of an IPv4 or IPv6 packet to determine which interface the packet should be forwarded to. One of the various System Description documents, such as: Cisco CRS-1 Carrier Routing System 16-Slot Line Card Chassis System Description, describes the Modular Services Card (MSC). Each MSC connects to its own PLIM card, which provides the Ethernet etc. interfaces to the outside world. Page 5-6 has a drawing.

The Ingress Packet Engine is shown on page 5-3. One of its functions is to "Perform an algorithm to determine the appropriate output interface to which to route the data.". This document says nothing more about the Ingress Packet Engine, but it does state the MSC's power consumption is 375 watts. It handles 40Gbps (full-duplex - a total data rate of 80Gbps) and is a lot heftier and no-doubt more expensive than Juniper's 10Gbps USD$19k M120 Forwarding Engine Board. Update 2007-07-25: sure enough, the list price of a CRS-MSC is USD$80,000, as shown here , here and here.

The best information I could find on the CRS-1's Ingress FIB is from Advances in Routing Architecture : The CRS-1 and IOS-XR , by Andy Chien from March 2005. From this, I ascertain:

The Ingress Packet Engine is here referred to as the "RX SPP (Silicon Packet Processor) L3 engine". This ASIC is fabricated with IBM's Cu-11 process (a 130nm process using copper for conductors, rather than aluminium).
The chip contains 188 32-bit CPUs running at 250MHz. Packets are distributed to these CPUs and their output is muxed together. All CPUs access external off-chip memories for the following functions, referring to four blocks, each of which is presumably a different kind of "memory" chip, according to page 16:

ACLs, Netflow, policing etc. use a 512k entry TCAM - block name "TCAM".
2 million entries for IPv4/IPv6/Multicast/MPLS: two blocks labeled "PLU" (Packet Lookup, for the FIB, I guess) and "TLU" (Tunnel Look Up, for MPLS, I guess).
1 million 64 bit counters, in another block labeled "STATS".

I can't find any information on what chips are used for the "PLU" system. Are they TCAM with associated SRAM? Are they fast RAM with the individual CPUs or some ASIC function doing a trie-style sequence of accesses? Is there some kind of caching? There is no mention of the size or technology of this functional block. I suspect it is one of the ASICs created by IBM. Is it a fully self-contained ASIC or an ASIC with external RAM. I suspect the latter, because of the first line of quoted text here:

The data sheet states that the MSC's memory is:

"Configurable with 2GB of route table memory.
1 GB of packet buffer memory per side (2 GB total per line card [ingress and egress])."

I don't know what "configurable" means (I guess in the next paragraph), but 2Gbytes is too much to be on an ASIC. It would also be a very large amount of TCAM - many heatsinked chips. It may be too much to implement with SRAM. If it is DRAM, then how do they handle 40Gbps of packets (I guess 300 million packets a second at maximum, or more likely a practical figure of 100 million packets a second) with DRAM? Even RLDRAM has a long recovery time - such as 15nsec - before the same block of memory can be used for another read or write cycle. SRAM has no such problems - it can read or write 250 million cycles per second.

Could the TLU device for handling MPLS packets may well have on-chip DRAM? For each interface (and the MSC could be handling 40, I think) there needs to be 2 million locations, one for each possible MPLS label value. Each location would have the next hop MPLS entry (21 bits) and probably some bits to tell the router which interface to send the relabeled packet to. This sounds like too much, so I would expect external RLDRAM or SRAM for the TLU. Perhaps the TLU and PLU functions are implemented on the one chip, with the one amount of memory . . . then, there would be a configurable amount devoted to MPLS and the rest available for IPv4 and IPv6 route lookup.

Here are some images of the MSC board. These are modified Cisco images. The first highlights the area with vertical mounted DIMMs or similar, containing presumably DRAM or SRAM. (TCAM would require heatsinks.) The second is a colour enhanced image, which shows these green vertical boards more clearly. The big heatsinks of the major ASICs are visible next to these areas of RAM. I am not suggesting we try to divine something meaningful from these - I just like to see the physicality of this important board. Can anyone send me a better photo? The blocks on the far right are heatsinks for the power regulators.

Cisco CRS-1 board. How does it implement its FIB?

Cisco CRS-1 board. How does it implement its FIB?

IBM's June 2004 press release about the ASICs used for Cisco's CRS-1 states that the SPP is an 18.3mm square chip (this is BIG, by any standards) - the worlds most sophisticated ASIC. 185 million transistors, 38 million gates. This is the largest of 10 chips for Cisco. I wonder whether one of the other chips was the "PLU" route lookup "memory" device. IBM's photo gallery has a picture of the SPP chip, with the 188 32-bit CPUs visible. Unfortunately the image is not as clear as other close-ups of chips which use similar technology here and here. These are multi-megabyte JPGs, shown small on the web page, but worth downloading and viewing with a proper editing or viewing program, with zoom. The multiple levels of copper interconnect are shown here.

How do modern high-end routers implement the Fowarding Information Base function?

Does the FIB usually involve TCAM?

Juniper M120 uses RLDRAM

Juniper MX960 uses SRAM?

Traditional RAM-based approaches to packet classification AKA "route lookup"

Cisco CRS-1 - FIB implemented with ASIC and RAM?

Cisco Catalyst 6500 and other routers' FIB implemented with TCAM