Direct RAM-based lookup IP forwarding for IPv4 and IPv6, with LISP

There is an overview for non-specialists at the end this page.

There is a looming crisis in Internet routing and addressing. The system is not scaling well to the large number of ISPs, corporate and organisational networks which are being connected. Unless something is changed, more and more expensive routers will be required in the "core" of the Internet, and for the border routers of all "multihomed" organisations: those with "Autonomous System" numbers and with with two or more connections to upstream ISPs. These routers are in the "Default Free Zone" (DFZ) and need to be aware of most, or all, of the 200,000 or so separately connected segments of address space. There are growing concerns about how far this process can continue without severe cost burdens and technical limitations which would degrade the reliability of the Internet.

I suggest that the next generation of routers for the DFZ be equipped with enough RAM (such as one or two chips of high-speed Static RAM) at each interface (or each forwarding engine) so they can do a direct lookup in a single memory cycle (4 nanoseconds) to determine which interface to route an IPv4 packet to, on /24 boundaries. This requires 14.5 million memory locations.

By 20 April 2007, I became aware (see #Already_enough_RAM? that the high-end routers of Cisco and Juniper (CRS-1, M120 and MX960) all use an ASIC and RAM for their FIB, not TCAM. Perhaps these routers already have enough RAM to handle 14.5 million /24s in the way I describe - so my proposal may translate into microcode and operating system changes, rather than hardware redesign.

On or two SRAM chips and the current /24 limit on BGP advertisements would enable flat routing to /24 granularity, with millions of separately connected prefixes (meaning a million or so AS end-users with Provider Independent addresses) if the BGP system could cope with the increased number of routes, (probably with a mechanism to discourage too-frequent changes). This would enable a vastly better utilisation of IPv4 address space.

I also suggest similar techniques for IPv6, with a granularity and restricted address range to be decided. For instance, 2 million /35s could be mapped, initially in a /14. This uses spare space in the SRAM chip, and can be extended in the future to a /13 or /12 as memory becomes cheaper and as IPv6 adoption increases.

This is not the first suggestion of directly applying the top 24 bits if IPv4 destination addresses as an address into RAM. Nick McKeown and colleagues made an elegant suggestion for this in 1998 (#Nick_McKeown) using DRAM. Now, large enough SRAMs are available - which operate 8 or so times faster than DRAM. (But see "Next step" below - we probably need 16 bits of data to specify the output interface, rather than the 4 or 8 bits I assume in the Internet Draft.)

I am also contemplating a version of LISP to complement this SRAM-based flat routing (to a certain granularity) proposal, since LISP can provide multihoming and traffic engineering, down to a single IP address. I think the combination of RAM-based forwarding and LISP would complement each other very well. This is intended as a good long-term solution to the crisis in routing and addressing with the following benefits:

Requires new router functionality in the next generation of transit and border routers, in a timeframe which matches their natural replacement cycle. The extra hardware and software required should be modest, so the cost of these routers should not be vastly increased by this proposal.
LISP functionality will, in theory, be needed only in each edge network's border router(s), but could be implemented in internal routers as well as, or in place of, at the border router(s) to spread the load of LISP processing.
Does not require any changes to host operating systems or applications.
Is intended to preserve, for users, the current functionality of IP addresses without them or their software needing to know which ones are routed via the BGP system and which are handled by LISP.
Structure the DNS arrangements for LISP 2 so each IP address can have its mapping securely controlled by a different user. This means each LISP-mapped IP address would probably have its own domain.

Next step for this proposal

This Internet Draft was made on the basis of several assumptions which I now recognise as being false, so I plan to rework this proposal in the future. The basic aim will remain - to make the BGP routing system "flat" to /24 granularity for IPv4 and for IPv6 with some granularity and range to be determined. (This is not at all popular amongst RAM list members, in part due to the difficulties it would present for BGP, which is a legitimate concern.) I also want to integrate LISP or something similar, to make the most of both techniques, with particular focus on the practicalities of implementing LISP's rapidly changing FIB efficiently in the router.

The assumptions were:

That it was sufficient to store 4 or 8 bits as the result of the lookup. Although this is true in principle, for flexibility and speed (no further processing of the stored result) many routers use 16 bit values for "next hop" or the internal destination within the router. Some routers may, for generality, store even more than this, to specify how the packet's header is to be rewritten etc. But that is overkill in the case of the main body of Internet traffic, since the packets do not have their headers rewritten. This means that practical implementations will need, for instance, 2²⁴ 16 bit words = 32 Megabytes. My initial proposal was for 4 bits (small routers) being a single QDR-II SRAM with 8 Meg x 8 (or 9) bits, or two such chips for 8 bit results.

One way of implementing the memory required for 16 bit results would be to use four SRAMs, but it seems that RLDRAM may also be sufficient, as discussed in the next point.
I assumed that only SRAM would be fast enough for high speed routers. Yet Cisco uses DRAM - presumably RLDRAM - for its 40Gbps CRS-1 Modular Services Card. I can't be sure how much they use, but it is probably 1 or 2 Gbytes. According to the later version of #Vargese_CRS_1 it is more favourable to use RLDRAM and duplicate the data four times in four banks of RLDRAM, to compensate for RLDRAM's slow total cycle rate (primarily the need to restore internal amplifiers and sense lines in the chip before another read cycle can begin), than to use 1/4 the memory for a single copy of the data in QDR-II SRAM. This seems rather odd to me, but if Cisco do it, there must be good arguments for it.

So perhaps four times the amount of RLDRAM is sufficient. For 16 bit results, this means 128 Megabytes. This is a small fraction of the memory used for route lookups in the CRS-1, although part of the demand for memory in that system may be from using a longer results record, to encode further information than which internal interface to send the packet to.
I was trying to use this flat routing (to a certain granularity) model to solve all problems. However, LISP or something like it has major advantages since it can, in principle, map individual IP addresses for multihoming and with traffic engineering.

A future version of this proposal will consider more carefully how LISP functions can be integrated into border routers.
I assumed that current routers did not have enough RAM to perform the direct lookup mapping I am proposing - but perhaps they do. The CRS-1 has a lot of memory (1 Gigabyte at least) - more than enough for my proposal. According to #Vargese_CRS_1 (page 120 of the final version) the tree-bitmap algorithm requires about 16 bytes per prefix for the search structures. This is an approximate ("marketing") figure for real IPv4 routes or for random /32s. This is not counting four copies of the data to allow interleaved access to cope with RLDRAM's poor cycle time.
I doubt that I would be able to show that current routers are capable of what I am proposing, but perhaps they would be capable of it, with suitable firmware. Current designs are optimised for a wide range of prefix lengths. I am proposing a system where most of the prefixes, and a substantial fraction of the address space, would be divided as /20 to /24. When designing for that situation, and knowing that the very few prefixes longer than /24 have low traffic and can safely be handed off to another system involving one or more extra memory lookups, it makes no sense to implement the complex, memory inefficient, multi-memory-access algorithms such as tree-bitmap. The simplest approach is to eliminate the search algorithm and use SRAM on the ASIC to map IPv4 unicast addresses to 224 linear lists of results in external memory. This is a one-access system which couldn't be faster or simpler. Each list requires 2¹⁶ results, such as 16 bits. This is 128kbytes * 224 = 56 Megabytes.

I also want to think about:

Where will the LISP mapped IP addresses come from? Will they be little chunks taken out of currently routed BGP chunks? (If so, this will further fragment the BGP routing table.)
How to enable individual users to fully control the LISP mapping of their single IP address. For instance, using a DNS system where they alone control a particular domain, which is specific to their address, and which is the master controller for LISP mapping, as in LISP 2.

So a future version of this proposal will probably be called something like "Address management for flat routing to defined granularity, combined with LISP". It will be based on working towards a global routing system (currently using BGP) which is capable of supporting millions of relatively small divisions, meaning that there would be many routes of /20, /21, /22, /23 and /24 - to help use the address space very efficiently.

Since we are now talking about a major change to the Internet's architecture, I think we should aim for a system which can efficiently be implemented in hardware and software. Otherwise, like the massive number of address bits in IPv6, we would be specifying a system which will be impossible, or inordinately costly (to the tune of billions of dollars) to build and maintain indefinitely.

My proposals now and in the future are dedicated to:

Maintaining the current functions of IP addresses as far as host operating systems and application programs are concerned. (Although within a restricted range for IPv6.)
Ensuring that the IPv4 space can be divided up freely and finely, without regard to topology, down to a /24 granularity. This will place a greater load on the BGP system, which I think can be lessened with LISP's assistance and by administratively or technically reducing the incidence of unnecessarily frequent changing advertisements . I believe it will be a lot easier to split the address space finely, put more CPU power and RAM in routers and work to optimise BGP (or better still, develop a replacement with less storage requirements, better routing flexibility and better stability) than it will be to run out of IPv4 address space, or to rely on LISP to solve all our problems whilst accepting the current limitations of the BGP routing system and its consequent requirement for large, often poorly utilised, address allocations.

I think many people are resistant to the idea that space will be divided so finely - particularly their space, since this will be taken away from them and they won't have their current large blocks in an unbroken fashion any more. But this must happen, as far as I can see, when IPv4 address space runs out, either to allocate some of this space to new AS users for BGP routing or to allocate it for LISP mapping. The alternative is to force newcomers to IPv6, which I think is not going to happen and which I think would be unfair while large amounts of IPv4 address space are under-utilised.

In short, I believe we need to plan for a "flat to /24" global routing system in the long term, because this is achievable (no-one has convinced me BGP can't be significantly improved and/or better used) and because it is the only way to give new users access to the precious IPv4 address space. Therefore, long-term LISP plans should be aimed at doing things such a "flat to /24" global system can't do, rather than the current approach of burdening that system with a much larger amount of LISP encapsulated packets. At the same time, we need to plan IPv6 address management and BGP prefix admittance limits so that IPv6 will be routable by efficient means, rather than more and more TCAM or the complex, memory-hungry, multi-memory-access techniques such as tree-bitmap.

Latest version of my Internet Draft

The plain text version, at the IETF site and the HTMLised text version there, does not contain hyperlinks or even the URLs of the references:

tools.ietf.org/html/draft-whittle-sram-ip-forwarding-01

Where this proposal fits in the scheme of things

This section will only make sense to someone who is familiar with the discussions on the RAM list and has at least a rough understanding of the SRAM-forwarding part of my proposal.

This I-D proposes several things - router functionality, address assignment policy and BGP management policy - which would enable "flat routing" to /24 granularity in IPv4 and flat routing to some yet to be determined granularity in IPv6, within a carefully chosen, restricted address range. Overall, I am thinking the best solution would be to combine direct lookup RAM table forwarding with some form of LISP (the most-discussed of the Location / Identification splitting protocols).

I intend this be the basis of a complete proposal which will serve the Internet well for the next 20+ years (with expansion of IPv6 SRAM capabilities in the mid-2010s). However, this is a big problem, being worked on by dozens or hundreds of bright people with more experience than I have. At least, I hope, this work will generate some ideas which turn out to have lasting value.

So far, there has been a general lack of interest in the SRAM proposal, in part due to the way that it would lead to further burdens on the control-plane of the BGP system: more storage, more CPU activity, more inter-router communications and (unless something is done to suppress rapid changes in advertisements) more instability. I believe there is scope for routers coping with more BGP traffic, storage and processing than at present, and I believe that the main burden of BGP at present - most of the frequent updates - can be greatly reduced. So I think it is worth working on BGP and installing SRAM in DFZ routers, because of the great improvements it would bring to the core of the Internet and to our ability to finely divide and efficiently use the IPv4 address space.

LISP 2 or 3 can extend to any granularity, such as to single IP addresses. It places a moderate extra burden on the global routing system in the form of extending the packet length with a new header, and with the query-response traffic as LISP routers query a central system to gain mapping information.

LISP provides multihoming and traffic engineering (TE) without the serious burden on the main BGP system of larger FIB, larger RIB, more routes and more frequent route updates - all of which would be required if the same multihoming and TE goals were to be achieved with current BGP-based techniques.

LISP places serious burdens on whichever routers perform the LISP encapsulation (and to a lesser extent, decapsulation) within edge networks and/or at the borders of edge networks. These must convert between ordinary IP packets for LISP-mapped addresses and their LISP-encapsulated form. This means they need to a lot more packet classification and that they need substantial caches of mapping data, a well as performing encapsulation and decapsulation. LISP 2 or 3 also requires a central system to respond to queries about LISP mapping.

My aim is to use RAM forwarding to handle a significant part of the problem which LISP is currently being proposed for. In particular, I want the RAM-forwarding system to enable a larger number of edge networks to be connected and to allow a much finer breakup of the IPv4 address space than is currently possible, in order that it can be used much more efficiently - without the traditional inefficiencies caused by the hierarchical assignment of addresses, to preserve route aggregation: RFC1715 and RFC3193 due to the requirement of reducing the total number of BGP, RIB and FIB routes.

Then, a smaller proportion of traffic will carry LISP headers than is the case with current pure-LISP proposals. Likewise, a smaller number of routers will be burdened by a smaller volume of LISP operations.

I am trying to find the minimal arrangement for making RAM-based forwarding and LISP work - the minimal number of routers which need to be upgraded:

Transit routers. SRAM-forwarding (flat routing to /24 granularity for IPv4). All transit routers are in the DFZ.
Multihomed border routers. SRAM-forwarding and LISP functions, unless the edge network places all its LISP routers inside its network. All multihomed border routers are in the DFZ.
Single-homed border routers. LISP functions, unless the edge network places all its LISP routers inside its network. Single-homed border routers are not in the DFZ and don't need an SRAM-based FIB.

Large edge networks where the LISP load is too great to place on the border routers will also need to upgrade some of their internal routers:

Routers within the edge networks. These do not need SRAM forwarding. This would be optional - I am trying to find a scheme which, for smaller edge networks at least (the ones which are the hardest to encourage changes in), involves changing only the border router(s).

This is on the assumption that it is not possible for an edge network to access LISP-mapped IP addresses unless it has one or more LISP routers. For this to be possible, it would be necessary to use BGP controlled edge and transit routers to route every LISP-mapped subnet of addresses to a LISP router in the core of the Internet, or perhaps in some other edge network. But that would seem to place a huge burden on the BGP network in terms of adding greatly to the number of advertised prefixes.

Here is how my proposal compares with some others being discussed on the RAM list.

Replacing the current notion of IP addresses with another incompatible entity, which is purely an identifier and cannot be stored or regarded as an IP address. This would require major changes to host operating systems, and probably to host programs. (Perhaps all host programs would need to be rewritten. This will never happen.)

LISP does not fall into this category. Neither does my RAM-based forwarding proposal on its own, or with the form of LISP I suggest might complement it. Host operating systems and application programs will use IPv4 and IPv6 addresses exactly as they do now.
The IETF tradition of not mandating (or perhaps even suggesting) how anything should be implemented. My suggestion does not mandate how routers implement the new functionality. It is possible that some current generation routers, such as the Juniper M120 and MX960 and the Cisco CRS-1 already have enough RAM to implement my RAM-based forwarding proposal. If so, they would require new software to do so. They already have 2 to 4 times the memory I suggest, but they may use that for other purposes in addition to the FIB function. At least it shows that the cost of the memory I am proposing is unlikely to be a serious obstacle.

My proposal involves careful alignment of address management policy with a particular set of router capabilities, for which I suggest there is at least one economical, elegant and power-efficient solution. My proposal would involve broad agreement beyond the IETF, including network operators, RIRs and router vendors.
Incremental adoption. Most proposals fail the test of providing benefits to early adopters: Does an edge network (an AS network of an ISP or non-ISP organisation) which implements the change derive sufficient benefit, or any benefit at all, until large numbers - perhaps all, or almost all edge networks - also implement the changes? My proposal fails this test too, but I suggest that it would not involve major expense. I believe it promises a good and perhaps the best set of outcomes possible without changing hosts or most of the routers in edge networks. This promise eventuates over a long time frame - to 2012 to 2014 or so - but is concrete in the near term in that when purchasing new routers, the networks will be reasonably confident that the new routers will last for 5 years or more.

My suggestion only makes sense if there is broad agreement to update all DFZ routers (or perhaps all single-homed edge routers as well, if LISP functionality is to be built into them too) to meet a new standard of capability. The timeframe for my proposal is within the natural replacement cycle of routers - say to be completed in 2012 to 2014. In order that all edge networks continue to have full connectivity, it requires complete update of these routers, after which a significant proportion of IP addresses could be mapped to LISP and taken out of the BGP system. The impetus for doing this would be multihoming and traffic engineering which would not be possible while they remained in the BGP system.

Edge networks which fail to upgrade their 2007 routers will face problems no matter what new protocols or router standards are adopted by other networks. While this proposal does not provide incremental benefits, the idea is that the cost differential for the new generation of routers will not be so high, and that with broad agreement the new routers could be in place in time to resolve the major problems of routing and addressing.
Most proposals discussed on the RAM list involve a new protocol, tunneling etc. based on separating the concept of Identifier and Locator, which are currently both performed by a single IP address. Many of the bolder proposals involve a new form of identifier, which would require major changes to host operating systems and probably host programs - but LISP does not require any host changes. Some IP addresses will be globally routable locators for LISP routers, and others will not be globally routable, but will be used within edge networks like ordinary IP addresses, with packets being sent to those addresses to be intercepted by a LISP router, encapsulated and tunneled to the appropriate LISP router (with a globally routable Location IP address) near their destination. There, the header is removed, leaving an ordinary IP packet, which is routed normally within the edge network to the destination.

I am trying to figure out a form of LISP 2 or 3 (lookups by DNS or something similar) which would nicely complement the RAM-forwarding proposal. So I am not interested in LISP 1.5 - and LISP 1.0 has no practical benefits. I want to use DNS or something similar so that individual end-users can control the LISP mapping of their IP address(es).

News

Updates

Here are some points in addition to what is in the current version of the I-D - but be sure to read Next step too .

How do routers implement the FIB function?

There is a separate page router-fib/ where I discuss how the FIB is implemented in various routers. In summary:

Juniper M120	ASIC and RLDRAM (64MB)
Juniper MX960	ASIC and SRAM?
Cisco Catalyst 6500	TCAM
Cisco CRS-1 Tree-Bitmap algorithm - see #Vargese_CRS_1.	ASIC and RLDRAM?
Force-10	TCAM
Most high end routers according to Prof. Zhi-Li Zhang	DRAM and SRAM with modifications of multi-bit tries

This page also leads to some pictures from IBM, of the Silicon Packet Processor chip which they built for Cisco - the largest ASIC ever made at that time (perhaps still true in mid 2007), with 188 independent 32 bit CPUs.

Do modern high-end routers have enough RAM to flat route IPv4 to /24 granularity?

If (as indicated by the above section and the page it links to) many of the current generation of high-end routers generally use RAM-based techniques, I wonder if they already have enough RAM and the required straightforward indexing ability in their ASICs to implement what I am suggesting in my I-D.

The Tree-Bitmap algorithm is used in the CRS-1: #Vargese_CRS_1.

RLDRAM-II and DDR-II SRAM as alternatives to QDR-II SRAM

An article by Anuj Chakrapani at NetworkSystemsDesignLine compares the suitability of three families of memory for functions such as table lookup (FIB) in networking equipment. Here is my summary of their characteristics. They all have about the same raw cycle time of 250 (or sometimes a little more) clock cycles per microsecond, with two 8 or 9 bit reads/writes per clock cycle.

Memory technology	Max Mbit early 2007	Separate inputs & outputs	Latency	Random addresses	Direct mapped FIB or recursive tree / trie based lookup
QDR-II SRAM	72	Yes	Low	No problem.	Ideal.
DDR-II SRAM	72	No	Low	No problem.	Ideal if it is OK to share the data in and out signals on the same set of pins.
RLDRAM-II	576	Both versions available	High	Controller must schedule addresses to suit internal structure of chip - 8 blocks, each of which has a long recovery time after the previous cycle.	Problematic to schedule the addresses to meet the chip requirements, but chips are 8 or 16 times denser and ~1/4 the price per bit.

Quad Data Rate SRAM: www.qdrsram.com
Double Data Rate SRAM: www.necel.com/memory/en/products/sram/ddr-72m.html
Reduced Latency DRAM: www.micron.com/products/dram/rldram/
Juniper using Micron RLDRAM: www.micron.com/applications/networking/juniper.aspx , download.micron.com/pdf/case_study/juniper.pdf ,

Links

In October 2006, the Internet Architecture Board ran a two day workshop in Amsterdam, on the crisis in routing and addressing: This site www.iab.org/about/workshops/routingandaddressing/ contains the presentations. The report can be found at: tools.ietf.org/html/draft-iab-raws-report-01 . A message from the IAB and IESG in December 2006 www1.ietf.org/mail-archive/web/ietf-announce/current/msg03255.html outlines the the strategy for solving these problems.

RADIR - Routing and Addressing Directorate

www.ietf.org/IESG/content/radir.html This group was established on 2 March 2007 and now has eight members. The archives of their mailing list are at https://www1.ietf.org/mailman/listinfo/radir and a wiki is here: www3.tools.ietf.org/group/radir/ .

LISP - Locator/ID Separation Protocol - draft-farinacci-lisp

Version 00 of LISP was replaced in early July 2007 by version 01. This drops IP-in-IP encapsulation in favor of UDP.

Version 00 is here tools.ietf.org/html/draft-farinacci-lisp-00 with some background material here: www.cs.ucla.edu/~lixia/rrg-lisp-ID-00.pdf A more detailed list of LISP versions was posted by Dino Farinacci on 2007-04-16: www1.ietf.org/mail-archive/web/ram/current/msg01289.html .

Vince Fuller posted a note about LISP 1.5 on 20 June: www1.ietf.org/mail-archive/web/ram/current/msg01562.html .

Version 01 is (or soon will be) at: tools.ietf.org/html/draft-farinacci-lisp-01 and you should always be able to find the most recent version by removing the number from this URL: tools.ietf.org/html/draft-farinacci-lisp .

An Internet Draft for "push" distribution of LISP 3.x mapping information is LISP-NERD: tools.ietf.org/html/draft-lear-lisp-nerd . This also contains a critique of DNS based "pull" distribution but suggests that it may make sense for an ITR to download a subset of the global mapping database, such as for a particular "region" (as if Internet communications are highly regionalised, which I think they are not) and to use DNS lookups for prefixes not mentioned in that database.

An Internet Draft for a "pull" based distribution of LISP 3.x mapping information is: LISP-CONS: tools.ietf.org/html/draft-meyer-lisp-cons .

The APT I-D tools.ietf.org/html/draft-jen-apt can be used for LISP 3.x but is intended for the iFIT proposal: (out-of-date, in July 2007, I-D: tools.ietf.org/html/draft-wang-ietf-efit and July 2007 paper: A Scalable Routing System for Future Internet www.cs.ucla.edu/~lixia/papers/07SIG_IP6WS.pdf ).

All these proposals are discussed on the RAM list.

RRG - IRTF Routing Research Group

www.irtf.org/charter?gtype=rg&group=rrg (Not: psg.com/~avri/irtf/rrg-page.html.)

There is a wiki, as noted above, which links to many other things beyond what you see here. The mailing list archives are: psg.com/lists/rrg/ . An Internet Draft from this group, started in April 2007, is "Design Goals for Scalable Internet Routing": tools.ietf.org/html/draft-irtf-rrg-design-goals . There was a sub-group on BGP stability but the best list to discuss BGP is the IDR list, mentioned next.

IDR - Inter-Domain Routing - Working Group

The IETF's IDR WG is responsible for BGP, the Border Gateway Protocol. The IDR WG's charter and list of active RFCs, Internet Drafts etc. is tools.ietf.org/wg/idr/ and www.ietf.org/html.charters/idr-charter.html . Mailing list archives are at: www1.ietf.org/mail-archive/web/idr/current/ .

BGP is a carefully honed protocol by which tens of thousands of routers share information to enable them to decide how best to forward packets to their destination, where there are now over 215,000 separately advertised prefixes in the global IPv4 BGP system, each of them requiring a potentially different forwarding path in any single router. BGP's messages are between peer routers and involve intentionally reduced information about each router's "best path" for a given prefix. This is sufficient to robustly protect the system against routing loops, but too brief for the routers to make well informed decisions about whether a failure which caused one "best path" to be usable would also affect another "best path", reported by another peer, which the router is considering to adopt as its "best path".

The limitations of BGP and of routers' ability to handle its workload, with more and more prefixes and with some routers having to maintain a record of all "best paths" of each of potentially dozens of peers, sets a practical upper limit on the number of prefixes which can be advertised in the BGP system. Since there are more and more ISPs and AS-end-users wanting to advertise more and more prefixes, for their legitimate needs of having portable IP addresses, multihoming for robust connections to the Net and for Traffic Engineering, either routers and/or BGP will have to drastically improve their ability to scale to large numbers of prefixes or some new routing and addressing architecture needs to be developed so the needs of ISPs and many end-users can be accommodated without burdening the BGP system with more and more prefixes.

Incremental improvements are being developed and considered for BGP. Some are considered in this IDR mailing list. Other more experimental approaches are discussed in the RRG Routing Research Group list or the RAM list.

The BGP protocol is complex. Each router's behavior is complex and depends on local policy (rules about where to send or receive traffic for various prefixes), the network topology and the behavior of its peers. The behavior of the whole BGP routing system is far more complex still, and some of its behaviors are not predictable or intended.

IETF Routing Area Work Groups

www.rtg.ietf.org

RiNG

www.ist-ring.org EC part-funded think-tank on the routing problem. They have a good reading list wiki.ist-ring.org with various papers, including some on improvements to BGP and on compact routing.

Beyond BGP

www.beyondbgp.net "BGP is generally thought to be reaching the end of its useful lifetime, although this has not been validated by analysis or measurements . . . Due to the lack of a shared understanding of the problem and lack of sufficient data and analysis, there is no consensus on where/when BGP collapses and what (if anything) should be done.". No apparent activity since 2005.

Clean Slate Design for the Internet

Some "back to the drawing board" projects for long-range architectural planning. These look well beyond IPv4 and IPv6. Anick Jesdanun's article biz.yahoo.com/ap/070413/rebuilding_the_internet.html cleanslate.stanford.edu , 100x100network.org .

HLP - critique of and potential replacement for BGP

www.cs.ucl.ac.uk/staff/M.Handley/papers/hlpsigcomm.pdf

Joel Jaeggli's Nanog 39 BOF on Pushing the FIB limits

www.nanog.org/mtg-0702/jaeggli.html This leads to a number of very informative presentations and a report, from February 2007.

On a separate page router-fib/#Cisco_Catalyst_6500_FIB I summarise some parts of these presentations which describe how TCAM or other techniques are used to implement the FIB function.

Apricot presentations on routing and addressing

The Asia Pacific Regional Internet Conference on Operational Technologies in Bali in March 2007 has some good slides: submission.apricot.net/chatter07/slides/future_of_routing/ .

Nick McKeown's papers

Nick McKeown of Stanford tiny-tera.stanford.edu/~nickm/ wrote a paper in 1998, with Pankaj Gupta and Steven Lin: Routing Lookups in Hardware at Memory Access Speeds. (IEEE INFOCOM April 1998, Vol 3, pp. 1240-1247.) This is the first (and so far only) paper I can find which suggests direct use of the most significant 24 bits of the IPv4 destination address to drive RAM for a single cycle lookup of FEC. They have an elegant solution for a second chip to handle those few prefixes which are longer than 24 bits. They also generalise that two-chip process to achieve reductions in total memory size for three level arrangements, depending on the number of prefixes which are needed of the various lengths.

So what I am suggesting in my I-D is much the same as was suggested in 1998, except they suggested using DRAM, and I am suggesting much faster QDR-II SRAM, which is was not available in 1998.

A later paper Algorithms for Packet Classification Pankaj Gupta and Nick McKeown (IEEE Network, March 2001.) doesn't mention this technique of directly mapping 24 bits as a RAM address.

Geoff Huston's web sites

www.potaroo.net Geoff Huston's main site, providing a wealth of analysis of the Internet in general and real-time information on the routing and addressing system. He has a great Introduction to BGP - the Protocol. A June 2007 article on Damping BGP concerns the statistics of updates and problems encountered trying to reduce stability. See also Geoff's Internet Draft with Tony Li on helping to solve these problems #BGP_stability. Also his report with many references, regarding the October 2006 workshop on routing and addressing: www.potaroo.net/ispcol/2006-11/raw.html .

Prof. Zhi-Li Zhang's Fundamentals of Advanced Networking

University of Minnesota reading list with many of the papers archived on the site. Of particular interest is his PowerPoint lecture 14, which I link to and quote in the router-fib/#Zhang page.

Tree-Bitmap and the CRS-1 - George Varghese and colleagues

George Varghese's site is www-cse.ucsd.edu/~varghese/ . Some papers include:

Tree bitmap: Hardware Software IP Lookups with Incremental Updates (2002 version, see below for a May 2004 version.)
W. Eatherton, Z. Dittia, and George Varghese
ACM Computer Communications Review, V34, April 2004

He also writes at www-cse.ucsd.edu/users/varghese/research.html :

"This algorithm is being used in Cisco Routers, and is currently the fastest and most compact IP lookup algorithm known in the literature, allowing both software and hardware implementations. Cisco has confirmed that the latest Cisco HFR Router (officially known as CRS-1), the fastest and largest router in the market, is using Tree Bitmap. TreeBitmap is also being proposed for placement in Cisco IOS; if so it will be widely deployed across several Cisco router platforms."

Page 3-86 of Cisco's IOS XR Troubleshooting manual mentions diagnostics for "tree bitmap hardware-related ingress and egress
information".

A much later and significantly updated version is:

Tree bitmap: hardware/software IP lookups with incremental updates
Source ACM SIGCOMM Computer Communication Review archive
Volume 34 , Issue 2 (April 2004) Pages: 97 - 122
Will Eatherton (Cisco Systems), George Varghese (UCSD), Zubin Dittia (Jibe Networks)
portal.acm.org/citation.cfm?id=997160

This version contemplates more recent memory technologies, including QDR SRAM.

On BGP instabilities and problems with flap-damping:

Route Flap Damping Exacerbates Internet Routing Convergence
Zhuoqing Morley Mao (UC Berkeley), Ramesh Govindan (ICSI), George Varghese(UCSD) and Randy Katz (UC Berkeley)
ACM Sigcomm 2002
www.sigcomm.org/sigcomm2002/papers/routedampening.html

Why the Internet only just works

Mark Handley's paper is here: www.cs.ucl.ac.uk/staff/M.Handley/papers/only-just-works.pdf . His notes csci5221-router-design.ppt (PowerPoint) discuss FIB techniques.

Search tool for selected IETF mailing lists

Lars Eggert's custom search page: www.google.com/coop/cse?cx=006728497408158459967%3Aybxjdw-bjjw

Some of these involve behavioral changes to routers, without changes in the BGP protocol. Some involve changes to the protocol and to the method of handling packets. Others may involve replacing BGP with a rather different system. Maybe this can be done incrementally.

Background on BGP includes Geoff Huston's Introduction to BGP mentioned above and the RFC: RFC 4271.

It would be best to discuss these and other proposals on the RRG list - or the IDR list for questions about current and past BGP behavior and near-term improvements to BGP. Please suggest things to add to this list. My comments are primarily to promote interest in these proposals and should not be regarded as well-informed critiques. BGP is a very deep field and a the best place to get a well informed critique of any of these would be to participate in the RRG or IDR discussions.

BGP Stability Improvements
Tony Li and Geoff Huston June 13 2007

tools.ietf.org/html/draft-li-bgp-stability

I found this to be very helpful, especially the explanation of how changes propagate in the network and how "path hunting" takes place, often causing a flurry of sub-optimal updates before finally settling down. But see a fuller discussion #BGP_hunting_MRAI_disc for the real lowdown on "path hunting" the various behaviors of the MRAI timer etc.

Tony Li (of Cisco, and with Lixia Zhang chair of the Routing Research Group) and Geoff Huston suggests some improvements to filtering changes to reduce the propagation of updates which are generally unhelpful. He also suggests the goal should not be rapid convergence to absolutely optimal routing, but to a workable path so connectivity is restored.

I think these proposals are promising. BGP is a very difficult and subtle field. I encourage anyone who is really interested in this to follow the above link and read all the discussions on the RRG and IDR mailing lists.

Path length damping. When updates cause a router to decide that the best path for a prefix is longer than it was before, this is a good sign that path hunting is taking place, so it might be best not to pass the resulting update on immediately, but set a timer and wait a while to allow multiple other updates to arrive before computing a new best path, which may involve a change and sending out another update. A likely outcome of this is that the final message would be a withdrawal - and it is better to send this than to send out longer paths and then a withdrawal. Optimal Path Hysteresis or Optimal Path Length Hysteresis.
Delayed Best Path Selection. Only compute a new best path after a timer expires, where the timer is reset every time an update arrives for the prefix in question.

BGP-RCN: Improving Convergence through Root Cause Notification
Dan Pei, Matt Azuma, Daniel Massey, and Lixia Zhang. 2005

www.beyondbgp.net/pubs/2005/bbgp_comnet05.pdf

The authors define a theoretical protocol SPVP to represent BGP, and then add their RCN algorithm to it. They write that the RCN approach is deployable in the current infrastructure.

The idea is that updates which result from a link failure will have extra attributes to this effect. This enables a distant router to know that certain paths are no longer working, and so to avoid using alternative paths which have the faulty subset within them. I guess this would work across multiple prefixes, not just within one prefix.

The information would piggyback on existing updates and include a sequence number. The sequence number would enable routers to cope with situations such as:

"For example, after a link flapping (i.e. a link goes down and then comes up quickly), the link down notification might arrive at a node later than the path announced in the link up event. Therefore, when the link down notification arrives, the node can mistakenly remove a valid path that announced in the link up event. SPVP-RCN solves this problem by letting each node v maintain a sequence number, t(v), which is incremented by 1 upon each change of node v s route to prefix p."

I have only scanned this complex paper. I think it is worthy of further consideration. The authors' simulation resulted in vastly improved responses to a failure of a link:

"In the simulation of a 110-AS Internet-derived topology, when a destination becomes unreachable (i.e. a Tdown event defined later), BGP-RCN reduces the convergence time from 715.3 s to 1.3 s, and reduces the number of BGP update messages from 30,483 to 926. When a destination becomes reachable through a longer path (i.e. a Tlong event defined later), in 95% of the cases BGP converges in 56 s or less, and BGP-RCN reduces this time to 32s."

HLP: A Next-generation Interdomain Routing Protocol
Lakshminarayanan Subramanian et al. 2005

www.sigcomm.org/sigcomm2005/paper-SubCae.pdf

This is an ambitious proposal was prompted by earlier work of the RRG, and is intended to be a practical upgrade for the current BGP system.

This is a hierarchical system in which some border routers, for instance, may be part of two or more hierarchies. Hierarchies use link-state routing, for instance between a major ISP and its various customers, which may be other ISPs and AS end-users. This is a complex system and I have only read the first few pages of the paper. I won't try to summarise HLP - please read the Introduction and Design Rationale.

Scaling IP Routing with the Core Router-Integrated Overlay
Xinyang Zhang et al. 2006

www.ieee-icnp.org/2006/papers/s4a4.pdf

By tunneling (for instance with MPLS) traffic to a distant router, with that tunnel containing traffic for multiple prefixes, a factor of 100 reduction in BGP global routing table size can theoretically be achieved, with very little increase in overall path length.

Differentiated BGP Update Processing for Improved Routing Convergence
Wei Sun, Zhuoqing Morley Mao and Kang G. Shin 2006

www.eecs.umich.edu/~zmao/Papers/bgp-updates.pdf

Without changing the BGP protocol or network infrastructure, significant improvements can be gained by algorithms which control the timing of updates according to various priorities. "30% fewer updates and 80% less convergence time."

Papers by Olivier Bonaventure

Reconsidering the Internet Routing Architecture

Internet Draft: www1.tools.ietf.org/html/draft-bonaventure-irtf-rrg-rira
Slides: www.info.ucl.ac.be/~obo/pres/IRTF-Prague.pdf

Also, some slides from presentations or courses on BGP4 interdomain routing: www.info.ucl.ac.be/~obo/pres.html

Limiting Path Exploration in BGP (AKA Path Hunting)
Jaideep Chandrashekar, Zhenhai Duan, Zhi-Li Zhang and Jeff Krasky 2005

www.cs.fsu.edu/~duan/publications/epic_infocom05.pdf

The authors propose an enhanced path vector algorithm EPIC, using forward edge sequence numbers. BGP messages contain too little information for a router to make well informed decisions about alternative routes. See also "Quantifying Path Exploration in the Internet" mentioned below.

A June 2007 paper on BGP update statistics, stability etc. from Geoff Huston

Damping BGP

Damping BGP concerns the statistics of updates and problems encountered trying to reduce stability. This has really interesting analysis of the timing of BGP updates. See also Tony Li's I-D on BGP Stability Improvements mentioned above.

A discussion about BGP path hunting, the MRAI timer and how the spec for this changed

This discussion began on the RRG list on 2007 June 20 (link to approx. start of discussions) and then moved to the IDR list on 2007 June 21 (link to approx start of discussions).

Tony Li defines "path hunting" in www1.ietf.org/mail-archive/web/idr/current/msg02422.html as what I described as the behavior of a small system of BGP routers where a withdrawal of one prefix causes the system to produce firstly a longer "best path" announcement, within a fraction of a second or so, followed by about 30 seconds later (or whatever the MRAI timer value is) a withdrawal announcement. This effect can be "amplified" by passing through further such structures. Timing differences cause more diversity and complications too.

"Path hunting" is believed to be a major contributor to the burden placed on BGP routers. Read these discussions and find out more about how changes to this MRAI timer in the BGP RFC4271, which apparently reflect the behavior of routers around 2002 or so, seem to be one of the major causes of "path hunting", also known as "path exploration".

This paper looks at how many packets are actually delivered while the routing system converges on a new set of paths after an outage.

A Study of Packet Delivery Performance during Routing Convergence
Dan Pei, Lan Wang, Daniel Massey, S. Felix Wu and Lixia Zhang 2003

www.research.att.com/~peidan/peid_packet.pdf
www.research.att.com/%7Epeidan/packet_DSN03.ppt

Another paper with recent measurements of BGP instability

Quantifying Path Exploration in the Internet
Ricardo Oliveira, Beichuan Zhang, Dan Pei, Rafit Izhak-Ratzin and Lixia Zhang 2006

www.imconf.net/imc-2006/papers/p27-pei.pdf

Here are a few wild and woolly ideas of mine. There are no-doubt scaling and perhaps robustness/security problems associated with them. Probably they have already been considered and rejected. However, this is a desperate situation, where we are planning to radically revise the Internet's architecture due to the limitations of BGP. Maybe for the future routing system some of these ideas would be worthwhile. If not, at least they are mentioned here with a note about their scaling problems.

My intention is to put each router in a better position to choose whether to adopt a new path, either by ensuring the path actually works or by fine-tuning the choice between multiple possible paths by probing the routers in each path to determine something like how much congestion there is on that path. The trouble with any such proposals are that they add to the amount of state each router holds and/or that they add to traffic and the need for other routers to respond, in situations where instability would probably focus thousands of similar probes on the one set of routers at the same time.

BGP intentionally simplifies the amount of state each router holds. Each path is not specified in terms of actual routers, but in the numbers of the autonomous systems the packet passes through. This may be a poorer measure of actual path length than if the path specified the actual routers, but it is generally good enough, and enables local policy to help decide the best path, while greatly reducing the amount of data in each BGP message.

Enable a router to know the IP address of the border router which advertises each subnet.

Plan A: Attach this to the advertisement and propagate it to other routers in updates. This has various problems, including (as best I understand BGP) that this prefix may be handled by another update covering a larger range of address space at other points in the network, making it impractical to include border routers' IP addresses. Another problem is that this propagation model involves delays and a remote router may get different versions of the IP address when the prefix is advertised at a new location, not necessarily in proper time order. Timestamps might resolve that.

I have been told that it would be quite easy to add this originating border router IP address to BGP messages. Timestamps would be trickier, though I guess it shouldn't be to hard to sync routers to an NTP server. Once the BGP system settles down, the final update(s) will contain the best path for a given router, and all best paths reported by peers will in fact carry packets to the right border router. My idea was that with an actual IP address, gained from this or the next method, the router would be able to test connectivity to the border router in real-time, rather than taking the work of peers, who took the word of their peers etc.

Plan B: use something like in-addr.arpa to have a securely controlled, real-time updated, distributed database by which any router could look up any prefix to find the IP address of the border router which currently advertises it.

Overview for non-specialists

The following is a brief account of a complex series of problems. If you think this is overly dramatising the situation, please read the report of the October 2006 IAB Workshop on Routing and Addressing, linked to below.

The routers (the electronic devices which handles packets of Internet data in a manner analogous to post-office letter sorting offices) in the core of the Internet, sometimes known as the "Default Free Zone" (DFZ) need to keep up with a growing number of routes, also known as "BGP (Border Gateway Protocol) advertised prefixes". When an ISP or the operator of another network (a user organisation with an "Autonomous System") connects their network to the Internet, they advertise through the BGP routing system that their router is the proper destination for one or more subsets of the Internet's address space. There are 4 billion IP addresses in today's "IPv4" Internet, with about 3 billion ultimately being available. About 1.7 billion are currently advertised. Each subset is known as a "prefix" - for instance a /20 prefix involves the first 20 bits of the address being fixed to a particular pattern of bits, and the other 12 varying - covering, in this example, 2¹² (4096) IP addresses. This information is passed on to neighboring routers, which pass it on to their neighbours.

Each new connection - each new BGP advertised prefix - means that tens of thousands of DFZ routers need more computational and packet handling resources to perform their task. Each router has a limit to how many routes it can handle efficiently. This means that these routers need to be upgraded or replaced with newer routers, in order to do their job. All Internet users collectively pay for these routers, since their ISPs collectively pay for capital and operational costs of the core of the Internet.

There are several serious problems arising from this:

As more and more ISPs, companies, universities, schools etc. connect to the Internet, most of them want their own dedicated ranges of IP addresses. They typically need to split the subset into smaller subsets and advertise each one of them on a separate router which connects to the wider Net via a different ISP. This is primarily to achieve "multihoming" - independence from any one ISP. So they use two or more ISPs, of their choice, and do not need to change the IP addresses of their network when they change ISPs. Most importantly, if one ISP fails, either due to technical failure for a few minutes or by going out of business, the organisation can still maintain their connection(s) to the Net via other ISPs. This need for multihoming is the primary reason for the increase in the number of BGP advertised prefixes.
The increased growth in the traffic of BGP routers passing on changes to the advertised prefixes. This takes more bytes of inter-router traffic, but is generally not prohibitive, since long distance links are getting cheaper and faster.
Problems in the routers collectively reaching a stable state, because this only happens after a complex set of interchanges prompted by each new advertised prefix. This is primarily a problem with changes to advertised prefixes, not with the total number of advertised prefixes.
Router's limited CPU and memory capacity to handle the BGP communications and to configure their routing hardware according to the tens of thousands of rules (generally each routing rule results from one advertised prefix) they must consider when deciding which of their interfaces to send each packet to. Memory capacity and CPU speeds are rapidly increasing, so new models of routers should be able to cope with this.
The hardware "packet forwarding engine" which classifies each packet as it arrives, determining which interface to send it to, based on its 32 bit (for IPv4, or 128 bit for IPv6) destination address. This hardware is expensive and power-hungry. Every packet (up to about 1500 bytes) which travels on the Net typically passes through 8 to 20 routers in the 0.1 to 0.3 seconds packets typically take to reach their destinations anywhere in the world. Each router needs to classify the packet in this way, so the process must be very fast (ideally several hundred million packets a second). Ideally, the electronics for this should be simple, inexpensive and draw little power. The highly flexible architectures used in high-end routers are expensive and power hungry. They are complex, so they can perform a vast range of packet analysis functions very rapidly. These functions are often needed in computer networking in general, but for the task of forwarding packets in the core of the Internet, a much smaller set of capabilities is required, but for a much larger number of routes than is required in almost any other area of computer networking. It is difficult or impossible to expand these routers to cope with a number of routes beyond what they were designed to handle. This is perhaps the most difficult aspect of the "crisis in routing".

Direct RAM-based lookup IP forwarding for IPv4 and IPv6, complemented by LISP

Contents

Overview