Direct RAM-based lookup IP forwarding for IPv4 and IPv6, complemented by LISP
Note
2007-06-16: The information and links below are likely to be of
value, but the actual proposal in the Internet Draft below has been
largely superseded by a new proposal I am currently formulating, called Ivip (Internet Vastly Improved Plumbing).
To the Ivip page
The
rest of this page does not reflect the Ivip proposal, and concentrates
on an earlier one which attempts to improve the FIB problem. It also
contains a lot of information on working groups, mailing lists, LISP, papers
on improving BGP (#BGP_improvements)etc. Also a link to discussions (#BGP_hunting_MRAI_disc)
about BGP's path hunting behavior, the Ivip is
intended to work within current BGP constraints and does not directly
concern the FIB of most DFZ routers. Ivip is based on some key aspects
of LISP but has an important difference which should
make it much easier to
incrementally deploy. I also propose a mobile
extension to Ivip,
which would work fine with LISP too. As I develop more ideas for Ivip,
it is developing in a very different way from LISP, so is best
considered as a separate proposal. Of course, anyone is free to
take ideas from Ivip and use them for any other purpose, including LISP.
A
simple Static (or maybe Dynamic) RAM architecture for routers to
optimise their ability to
route packets on
the Internet, despite the growth in the number of prefixes advertised
in the "global BGP routing table". This would enable "flat
routing
to a granularity of /24" for IPv4. I plan to update
this proposal to use DRAM as well as SRAM, and to extend it to use LISP . . . actually Ivip . . .
as well, and in other ways. See "Next step" below.
Please see the RRG and RAM list for latest discussions on these matters!
Robin Whittle rw@firstpr.com.au 2007-06-15
To the main IP
page
Contents
>>> Overview
>>> Next step for this proposal <<< Please read before seriously considering this proposal
>>> Latest version of my Internet Draft
>>> Where this proposal fits in the scheme of things
>>> News
>>> Updates
>> How do routers implement the FIB function?
>> Do modern high-end routers have enough RAM to flat route IPv4 to /24 granularity?
>> RLDRAM-II and DDR-II SRAM as alternatives to QDR-II SRAM
>>> Links
>> Papers concerning significant improvements to BGP
>> Discussion on BGP's path hunting behavior, the MRAI timer etc.
>>> Overview for non-specialists
Overview
There is an
overview for non-specialists at the end this page.
There
is a looming crisis in Internet routing and addressing. The
system is not
scaling well to the large number of ISPs, corporate and organisational
networks which are being connected. Unless something is changed,
more and more expensive routers will be required in the "core" of the
Internet, and for the border routers of all "multihomed"
organisations: those with "Autonomous System" numbers and with with two
or more connections to upstream ISPs. These routers are in the
"Default Free Zone" (DFZ) and need to be aware of most, or all, of the
200,000 or so separately connected segments of address space.
There are growing
concerns about how far this process can continue without severe
cost
burdens and technical limitations which would degrade the reliability
of the Internet.
I
suggest that the next generation of routers for the DFZ be equipped
with enough RAM (such as one or two chips of high-speed Static RAM) at
each interface (or each forwarding engine) so they can do a direct
lookup in a single memory cycle (4 nanoseconds) to determine which
interface to route an IPv4 packet to, on /24 boundaries. This
requires 14.5 million memory locations.
By 20 April 2007, I became aware (see
#Already_enough_RAM? that the high-end routers of Cisco and Juniper (CRS-1, M120 and MX960) all use an ASIC and RAM for their FIB, not TCAM.
Perhaps
these routers already have enough RAM to handle 14.5 million /24s in
the way I describe - so my proposal may translate into microcode and
operating system changes, rather than hardware redesign.
On or two SRAM chips and the current /24 limit on BGP advertisements would enable
flat
routing to /24 granularity, with millions of separately connected
prefixes (meaning a million or so AS end-users with Provider
Independent addresses) if the BGP system could cope with the increased number of routes, (probably with a mechanism to discourage too-frequent changes). This would enable a
vastly better
utilisation of IPv4 address space.
I also suggest similar
techniques for IPv6, with a granularity and restricted address range to be decided. For instance, 2 million
/35s could be mapped, initially in a /14. This uses spare space
in the SRAM chip, and can be extended in the future to a /13 or /12 as
memory becomes cheaper and as IPv6 adoption increases.
This
is not the first suggestion of directly applying the top 24 bits if
IPv4 destination addresses as an address into RAM. Nick McKeown and
colleagues made an elegant suggestion for this in 1998 (
#Nick_McKeown) using DRAM. Now, large enough SRAMs are available - which operate 8 or so times faster than DRAM. (But see "
Next step"
below - we probably need 16 bits of data to specify the output
interface, rather than the 4 or 8 bits I assume in the Internet Draft.)
I am also
contemplating a version of LISP to complement this SRAM-based flat
routing (to a certain granularity) proposal, since LISP can provide
multihoming and traffic engineering, down to a single IP address. I
think the combination of RAM-based forwarding and LISP would
complement each other very well. This is intended as a good
long-term solution to the crisis in routing and addressing with the
following benefits:
- Requires new router functionality in the
next generation of transit and border routers, in a timeframe which
matches their natural replacement cycle. The extra hardware and
software required should be modest, so the cost of these routers should
not be vastly increased by this proposal.
- LISP
functionality will, in theory, be needed only in each edge network's
border router(s), but could be implemented in internal routers
as well as, or in place of, at the border router(s) to spread the
load of LISP processing.
- Does not require any changes to host operating systems or applications.
- Is
intended to preserve, for users, the current functionality of IP
addresses without them or their software needing to know which ones are routed via the
BGP system and which are handled by LISP.
- Structure
the DNS arrangements for LISP 2 so each IP address can have its mapping
securely controlled by a different user. This means each
LISP-mapped IP address would probably have its own domain.
Next step for this proposal
This
Internet Draft was made on the basis of several assumptions which I now
recognise as being false, so I plan to rework this proposal in the
future. The basic aim will remain - to make the BGP routing
system "flat" to /24 granularity for IPv4 and for IPv6 with some
granularity and range to be determined. (This is not at all
popular amongst RAM list members, in part due to the difficulties it
would present for BGP, which is a legitimate concern.) I
also want to integrate LISP or something similar, to make the most of
both techniques, with particular focus on the practicalities of
implementing LISP's rapidly changing FIB efficiently in the router.
The assumptions were:
- That it was sufficient to store 4 or 8 bits as
the result of the lookup. Although this is true in principle, for
flexibility and speed (no further processing of the stored result) many
routers use 16 bit values for "next hop" or the internal destination
within the router. Some routers may, for generality, store even
more than this, to specify how the packet's header is to be rewritten
etc. But that is overkill in the case of the main body of
Internet traffic, since the packets do not have their headers
rewritten. This means that practical implementations will need,
for instance, 224 16 bit words = 32 Megabytes. My
initial proposal was for 4 bits (small routers) being a single QDR-II
SRAM with 8 Meg x 8 (or 9) bits, or two such chips for 8 bit results.
One
way of implementing the memory required for 16 bit results would
be to use four SRAMs, but it seems that RLDRAM may also be sufficient,
as discussed in the next point.
- I assumed that only
SRAM would be fast enough for high speed routers. Yet Cisco uses
DRAM - presumably RLDRAM - for its 40Gbps CRS-1 Modular Services Card.
I can't be sure how much they use, but it is probably 1 or 2
Gbytes. According to the later version of #Vargese_CRS_1
it is more favourable to use RLDRAM and duplicate the data four times
in four banks of RLDRAM, to compensate for RLDRAM's slow total cycle
rate (primarily the need to restore internal amplifiers and sense lines
in the chip before another read cycle can begin), than to use 1/4 the
memory for a single copy of the data in QDR-II SRAM. This seems
rather odd to me, but if Cisco do it, there must be good arguments for
it.
So perhaps four times the amount of RLDRAM is sufficient.
For 16 bit results, this means 128 Megabytes. This is a
small fraction of the memory used for route lookups in the CRS-1,
although part of the demand for memory in that system may be from using
a longer results record, to encode further information than which
internal interface to send the packet to.
- I
was trying to use this flat routing (to a certain granularity) model to
solve all problems. However, LISP or something like it has major
advantages since it can, in principle, map individual IP addresses for
multihoming and with traffic engineering.
A future version of this proposal will consider more carefully how LISP functions can be integrated into border routers.
- I
assumed that current routers did not have enough RAM to perform the
direct lookup mapping I am proposing - but perhaps they do. The
CRS-1 has a lot of memory (1 Gigabyte at least) - more than enough for
my proposal. According to #Vargese_CRS_1
(page 120 of the final version) the tree-bitmap algorithm requires
about 16 bytes per prefix for the search structures. This is an
approximate ("marketing") figure for real IPv4 routes or for random
/32s. This is not counting four copies of the data to allow
interleaved access to cope with RLDRAM's poor cycle time.
I
doubt that I would be able to show that current routers are capable of
what I am proposing, but perhaps they would be capable of it, with
suitable firmware. Current designs are optimised for a wide range
of prefix lengths. I am proposing a system where most of the
prefixes, and a substantial fraction of the address space, would be
divided as /20 to /24. When designing for that situation, and
knowing that the very few prefixes longer than /24 have low traffic and
can safely be handed off to another system involving one or more extra
memory lookups, it makes no sense to implement the complex, memory
inefficient, multi-memory-access algorithms such as tree-bitmap.
The simplest approach is to eliminate the search algorithm and
use SRAM on the ASIC to map IPv4 unicast addresses to 224 linear lists
of results in external memory. This is a one-access system which
couldn't be faster or simpler. Each list requires 216 results, such as 16 bits. This is 128kbytes * 224 = 56 Megabytes.
I also want to think about:
- Where
will the LISP mapped IP addresses come from? Will they be little
chunks taken out of currently routed BGP chunks? (If so, this will
further fragment the BGP routing table.)
- How to enable
individual users to fully control the LISP mapping of their single IP
address. For instance, using a DNS system where they alone
control a particular domain, which is specific to their address, and
which is the master controller for LISP mapping, as in LISP 2.
So
a future version of this proposal will probably be called something
like "Address management for flat routing to defined granularity,
combined with LISP". It will be based on working towards a global
routing system (currently using BGP) which is capable of supporting
millions of relatively small divisions, meaning that there would be
many routes of /20, /21, /22, /23 and /24 - to help use the address
space very efficiently.
Since we are now talking about a major
change to the Internet's architecture, I think we should aim for a
system which can efficiently be implemented in hardware and software.
Otherwise, like the massive number of address bits in IPv6, we
would be specifying a system which will be impossible, or inordinately
costly (to the tune of billions of dollars) to build and maintain
indefinitely.
My proposals now and in the future are dedicated to:
- Maintaining
the current functions of IP addresses as far as host operating systems
and application programs are concerned. (Although within a restricted
range for IPv6.)
- Ensuring that the IPv4 space can be
divided up freely and finely, without regard to topology, down to a /24
granularity. This will place a greater load on the BGP system,
which I think can be lessened with LISP's assistance and by
administratively or technically reducing the incidence of unnecessarily
frequent changing advertisements . I believe it will be a lot
easier to split the address space finely, put more CPU power and RAM in
routers and work to optimise BGP (or better still, develop a
replacement with less storage requirements, better routing flexibility
and better stability) than it will be to run out of IPv4 address space,
or to rely on LISP to solve all our problems whilst accepting the
current limitations of the BGP routing system and its consequent
requirement for large, often poorly utilised, address allocations.
I
think many people are resistant to the idea that space will be divided
so finely - particularly their space, since this will be taken away
from them and they won't have their current large blocks in an unbroken
fashion any more. But this must happen, as far as I can see, when
IPv4 address space runs out, either to allocate some of this space to
new AS users for BGP routing or to allocate it for LISP mapping.
The alternative is to force newcomers to IPv6, which I think is
not going to happen and which I think would be unfair while large
amounts of IPv4 address space are under-utilised.
In short, I
believe we need to plan for a "flat to /24" global routing system in
the long term, because this is achievable (no-one has convinced me BGP
can't be significantly improved and/or better used) and because it is
the only way to give new users access to the precious IPv4 address
space. Therefore, long-term LISP plans should be aimed at doing
things such a "flat to /24" global system can't do, rather than the
current approach of burdening that system with a much larger amount of
LISP encapsulated packets. At the same time, we need to plan IPv6
address management and BGP prefix admittance limits so that IPv6 will
be routable by efficient means, rather than more and more TCAM or the
complex, memory-hungry, multi-memory-access techniques such as
tree-bitmap.
Latest version of my Internet Draft
Please also
the
Next step ,
Updates and other sections below. This I-D proposes purely
SRAM-based forwarding, but DRAM may be used as well. I am also working on complementing this
with some form of LISP, but the I-D does not mention this. The
title is probably somewhat misleading, since for IPv4, my proposal only
eliminates the need for route aggregation to a granularity of /24 prefixes.
Ideally, the Internet would be "flat routed" to the granularity
of an individual IP address, rather than to the /24 granularity of
chunks of 256 IP addresses, but that would involve every router knowing
4 billion routes, which will probably never be feasible.
This
HTML version is convenient since the references contain
hyperlinks:
The plain text version, at the IETF site and
the HTMLised text version there, does not contain hyperlinks or
even the URLs
of the references:
Please
let me know of any corrections and improvements you can suggest.
I will post a changelog here for versions after 01. The
best place to discuss this proposal and other approaches to solving
routing and addressing problems is the RAM list, as linked to below:
#RAM .
Internet Drafts are works in progress. This one is "experimental" - to stimulate discussion. It is
not currently intended to become an RFC (a lasting IETF document).
Where this proposal fits in the scheme of things
This section will only make sense to someone who is familiar with the discussions on the
RAM list and has at least a rough understanding of the SRAM-forwarding part of my proposal.
This
I-D proposes several things - router functionality, address assignment
policy and BGP management policy - which would enable "flat routing" to
/24 granularity in IPv4 and flat routing to some yet to be determined
granularity in IPv6, within a carefully chosen, restricted
address range. Overall, I am thinking the best solution would be
to
combine direct lookup RAM table forwarding with some form of LISP (the most-discussed of the Location / Identification splitting protocols).
I
intend this be the basis of a complete proposal which will serve the
Internet well for the next 20+ years (with expansion of IPv6 SRAM
capabilities in the mid-2010s). However, this is a big problem,
being worked on by dozens or hundreds of bright people with more
experience than I have. At least, I hope, this work will generate some ideas
which turn out to have lasting value.
So
far, there has been a general lack of interest
in the SRAM proposal, in part due to the way that it would lead to
further burdens on the control-plane of the BGP system: more storage,
more CPU activity, more inter-router communications and (unless
something is done to suppress rapid changes in advertisements) more
instability. I believe there is scope for routers coping with
more BGP traffic, storage and processing than at present, and I believe
that the main burden of BGP at present - most of the frequent updates -
can be greatly reduced. So I think it is worth working on BGP and
installing SRAM in DFZ routers, because of the great improvements it
would bring to the core of the Internet and to our ability to finely
divide and efficiently use the IPv4 address space.
LISP
2 or 3 can extend to any granularity, such as to single IP addresses.
It places a moderate extra burden on the global routing system in
the form of extending the packet length with a new header, and
with the query-response traffic as LISP routers query a central
system to gain mapping information.
LISP
provides multihoming
and traffic engineering (TE) without the serious burden on
the main BGP system of larger FIB, larger RIB, more routes
and more
frequent route updates - all of which would be required if the
same multihoming and TE goals were to be achieved with current
BGP-based techniques.
LISP places serious burdens on
whichever routers perform the LISP encapsulation (and to a lesser
extent, decapsulation) within edge networks and/or at
the borders
of edge networks. These must convert between ordinary IP packets
for LISP-mapped addresses and their LISP-encapsulated form. This
means they need to a lot more packet classification and that they need
substantial caches of mapping data, a well as performing encapsulation
and decapsulation. LISP 2 or 3 also requires a central system to
respond to queries about LISP mapping.
My aim is to use RAM forwarding to handle a significant part of the
problem which LISP is currently being proposed for. In particular, I want the RAM-forwarding system to
enable a larger number of edge networks to be connected and to allow a
much finer breakup of the IPv4 address space than is currently possible, in order that it can be used much more efficiently - without the traditional inefficiencies caused by the hierarchical assignment of addresses, to preserve route aggregation:
RFC1715 and
RFC3193 due to the requirement of reducing the total number of BGP, RIB and FIB routes.
Then, a smaller
proportion of traffic will carry LISP headers than is the case with
current pure-LISP proposals. Likewise, a smaller number of routers
will be burdened by a smaller volume of LISP operations.
I
am trying to find the minimal arrangement for making RAM-based
forwarding and LISP work - the minimal number of routers which need to
be upgraded:
- Transit routers. SRAM-forwarding (flat routing to /24 granularity for IPv4). All transit routers are in the DFZ.
- Multihomed
border routers. SRAM-forwarding and LISP functions, unless the
edge network places all its LISP routers inside its network. All
multihomed border routers are in the DFZ.
- Single-homed
border routers. LISP functions, unless the edge network places
all its LISP routers inside its network. Single-homed
border routers are not in the DFZ and don't need an
SRAM-based FIB.
Large
edge networks where the LISP load is too great to place on the border
routers will also need to upgrade some of their internal routers:
- Routers within the edge networks.
These do not need SRAM forwarding. This would be optional - I am trying to
find a scheme which, for smaller edge networks at least (the ones which
are the hardest to encourage changes in), involves changing only the border router(s).
This
is on the assumption that it is not possible for an edge network to
access LISP-mapped IP addresses unless it has one or more LISP routers.
For this to be possible, it would be necessary to use BGP controlled
edge and transit routers to route every LISP-mapped subnet of
addresses to a LISP router in the core of the Internet, or perhaps
in some other edge network. But that would seem to place a
huge burden on the BGP network in terms of adding greatly to the number
of advertised prefixes.
Here is how my proposal compares with some others being discussed on the RAM list.
- Replacing
the current notion of IP addresses with another incompatible entity, which is
purely an identifier and cannot be stored or regarded as an IP address. This would require major changes to host
operating systems, and probably to host programs. (Perhaps all
host programs would need to be rewritten. This will never
happen.)
LISP does not fall into this category. Neither
does my RAM-based forwarding proposal on its own, or with the form of
LISP I suggest might complement it. Host operating
systems and application programs will use IPv4 and IPv6 addresses
exactly as they do now.
- The
IETF tradition of not mandating
(or perhaps even suggesting) how anything should be implemented.
My suggestion does not mandate how routers implement the new
functionality. It is possible that some current generation
routers, such as the Juniper M120 and MX960 and the Cisco CRS-1 already
have enough RAM to
implement my RAM-based forwarding proposal. If so, they would
require new software to do so. They already have 2 to 4
times the memory I suggest, but they may use that for other purposes in
addition to the FIB function. At least it shows that the cost of
the memory I am proposing is unlikely to be a serious obstacle.
My proposal involves careful alignment of address management policy
with a particular set of router capabilities, for which I suggest there
is at least one economical, elegant and power-efficient solution. My
proposal would involve broad agreement beyond the IETF, including
network operators, RIRs and router vendors.
- Incremental
adoption. Most proposals fail the test of providing benefits
to early adopters: Does an edge network (an AS network of an ISP or
non-ISP organisation) which implements the change derive
sufficient benefit, or any benefit at all, until large numbers -
perhaps all, or almost all edge networks - also implement the
changes? My proposal fails this test too, but I suggest that it
would not involve major expense. I believe it promises a
good and perhaps the best set of outcomes possible without changing
hosts or most of the routers in edge networks. This promise
eventuates over a long time frame - to 2012 to 2014 or so - but is
concrete in the near term in that when purchasing new routers, the
networks will be reasonably confident that the new routers will last for 5 years or more.
My suggestion only makes sense if there is
broad agreement to update all DFZ routers (or perhaps all single-homed
edge routers as well, if LISP functionality is to be built into them
too) to meet a new standard of capability. The timeframe for my
proposal is within the natural replacement cycle of routers - say to be
completed in 2012 to 2014. In order that all edge networks
continue to have full connectivity, it requires complete update of
these routers, after which a significant proportion of IP addresses
could be mapped to LISP and taken out of the BGP system. The
impetus for doing this would be multihoming and traffic engineering
which would not be possible while they remained in the BGP system.
Edge
networks which fail to upgrade their 2007 routers will face problems no
matter what new protocols or router standards are adopted by other
networks. While this proposal does not provide incremental
benefits, the idea is that the cost differential for the new generation
of routers will not be so high, and that with broad agreement the new
routers could be in place in time to resolve the major problems of
routing and addressing.
- Most proposals discussed on the RAM
list involve a new protocol, tunneling etc. based on separating the
concept of Identifier and Locator, which are currently both performed
by a single IP address. Many of the bolder proposals involve a
new form of identifier, which would require major changes to host
operating systems and probably host programs - but LISP does not
require any host changes. Some IP addresses will be globally
routable locators for LISP routers, and others will not be globally
routable, but will be used within edge networks like ordinary IP
addresses, with packets being sent to those addresses to be intercepted
by a LISP router, encapsulated and tunneled to the appropriate LISP
router (with a globally routable Location IP address) near their
destination. There, the header is removed, leaving an ordinary IP
packet, which is routed normally within the edge network to the
destination.
I
am trying to figure out a form of LISP 2 or
3 (lookups by DNS or something similar) which would nicely complement
the RAM-forwarding proposal. So I am not interested in LISP 1.5
- and LISP 1.0 has no practical benefits. I want to use DNS or
something similar so that individual end-users can control the
LISP mapping of their IP address(es).
News
2007-03-31 to
2007-04-02 I wrote about version 01 it to a bunch of people, primarily those who
attended the IAB Workshop on Routing and Addressing in Amsterdam (2006
October - see the links section). Joined the RRG and RAM list - and posted to the lists about this Internet Draft.
2007-04-03 The draft is not being discussed much on the RAM list:
#RAM . Most attention is focused on LISP. I have been corresponding with some members of the list.
2007-04-17... 21
Added updates, "Where this proposal fits in the scheme of
things", new links and a new page on how various routers implement
their FIB.
2007-4-23 After reading the Tree-Bitmap paper (
#Vargese_CRS_1) and thinking about the CRS-1, I wrote a new section
Next step.
2007-5-31
Added a new section on links to papers concerning significant
improvements to BGP. (Various link and BGP paper updates since then.)
Updates
Here are some points in addition to what is in the current version of the I-D - but be sure to read
Next step too .
How do routers implement the FIB function?
There is a separate page
router-fib/ where I discuss how the FIB is implemented in various routers. In summary:
Juniper M120 | ASIC and RLDRAM (64MB) |
Juniper MX960 | ASIC and SRAM? |
Cisco Catalyst 6500 | TCAM |
Cisco CRS-1 Tree-Bitmap algorithm - see #Vargese_CRS_1. | ASIC and RLDRAM?
|
Force-10 | TCAM |
Most high end routers according to Prof. Zhi-Li Zhang | DRAM and SRAM with modifications of multi-bit tries |
This
page also leads to some pictures from IBM, of the Silicon Packet
Processor chip which they built for Cisco - the largest ASIC ever made
at that time (perhaps still true in mid 2007), with 188 independent 32
bit CPUs.
Do modern high-end routers have enough RAM to flat route IPv4 to /24 granularity?
If
(as indicated by the above section and the page it links to) many
of the current generation of high-end routers generally use RAM-based
techniques, I wonder if they
already have enough RAM and the required straightforward indexing ability in their ASICs to implement what I am suggesting in my I-D.
The Tree-Bitmap algorithm is used in the CRS-1:
#Vargese_CRS_1.
RLDRAM-II and DDR-II SRAM as alternatives to QDR-II SRAM
An article by
Anuj Chakrapani at
NetworkSystemsDesignLine
compares the suitability of three families of memory for functions such
as table lookup (FIB) in networking equipment. Here is my summary
of their characteristics. They all have about the same raw cycle
time of 250 (or sometimes a little more) clock cycles per microsecond, with two 8 or 9 bit
reads/writes per clock cycle.
Memory technology | Max Mbit early 2007 | Separate inputs & outputs | Latency | Random addresses | Direct mapped FIB or recursive tree / trie based lookup |
QDR-II SRAM | 72 | Yes | Low | No problem. | Ideal. |
DDR-II SRAM | 72 | No | Low | No problem. | Ideal if it is OK to share the data in and out signals on the same set of pins. |
RLDRAM-II | 576 | Both versions available | High | Controller
must schedule addresses to suit internal structure of chip - 8 blocks,
each of which has a long recovery time after the previous cycle. | Problematic
to schedule the addresses to meet the chip requirements, but chips are 8 or 16 times denser and ~1/4 the price per bit. |
Quad Data Rate SRAM:
www.qdrsram.comDouble Data Rate SRAM:
www.necel.com/memory/en/products/sram/ddr-72m.htmlReduced Latency DRAM:
www.micron.com/products/dram/rldram/ Juniper using Micron RLDRAM:
www.micron.com/applications/networking/juniper.aspx ,
download.micron.com/pdf/case_study/juniper.pdf ,
Links
RAM List - Routing and Addressing Mailing List
IAB
Workshop on Routing and Addressing
RADIR - Routing and Addressing Directorate
LISP - Locator/ID Separation Protocol - draft-farinacci-lisp
RRG - IRTF Routing Research Group
There is a
wiki, as noted above, which links to many other things beyond what you see here. The mailing list archives are:
psg.com/lists/rrg/ . An Internet Draft from this group, started in April 2007, is "Design Goals for Scalable Internet Routing":
tools.ietf.org/html/draft-irtf-rrg-design-goals . There was a sub-group on BGP stability but the best list to discuss BGP is the IDR list, mentioned next.
IDR - Inter-Domain Routing - Working Group
The
IETF's IDR WG is responsible for BGP, the Border Gateway Protocol.
The IDR WG's charter and list of active RFCs, Internet Drafts
etc. is
tools.ietf.org/wg/idr/ and
www.ietf.org/html.charters/idr-charter.html . Mailing list archives are at:
www1.ietf.org/mail-archive/web/idr/current/ .
BGP
is a carefully honed protocol by which tens of thousands of routers
share information to enable them to decide how best to forward packets
to their destination, where there are now over 215,000 separately
advertised prefixes in the global IPv4 BGP system, each of them
requiring a potentially different forwarding path in any single router.
BGP's messages are between peer routers and involve intentionally
reduced information about each router's "best path" for a given prefix.
This is sufficient to robustly protect the system against routing
loops, but too brief for the routers to make well informed decisions
about whether a failure which caused one "best path" to be usable would
also affect another "best path", reported by another peer, which the
router is considering to adopt as its "best path".
The
limitations of BGP and of routers' ability to handle its workload, with
more and more prefixes and with some routers having to maintain a
record of all "best paths" of each of potentially dozens of peers, sets
a practical upper limit on the number of prefixes which can be
advertised in the BGP system. Since there are more and more ISPs
and AS-end-users wanting to advertise more and more prefixes, for their
legitimate needs of having portable IP addresses, multihoming for
robust connections to the Net and for Traffic Engineering, either
routers and/or BGP will have to drastically improve their ability to
scale to large numbers of prefixes or some new routing and addressing
architecture needs to be developed so the needs of ISPs and many
end-users can be accommodated without burdening the BGP system with
more and more prefixes.
Incremental improvements are being
developed and considered for BGP. Some are considered in this IDR
mailing list. Other more experimental approaches are discussed in
the RRG Routing Research Group list or the RAM list.
The BGP
protocol is complex. Each router's behavior is complex and
depends on local policy (rules about where to send or receive traffic
for various prefixes), the network topology and the behavior of its
peers. The behavior of the whole BGP routing system is far more
complex still, and some of its behaviors are not predictable or
intended.
IETF Routing Area Work Groups
RiNG
www.ist-ring.org EC part-funded think-tank on the routing problem. They have a good reading list
wiki.ist-ring.org with various papers, including some on improvements to BGP and on compact routing.
Beyond BGP
www.beyondbgp.net "BGP is generally thought to be
reaching the end of its useful lifetime, although this has
not been validated by analysis or measurements . . . Due to the lack
of a shared understanding of the problem and lack of sufficient
data and analysis, there is no consensus on where/when BGP
collapses and what (if anything) should be done.". No apparent activity since 2005.
Clean Slate Design for the Internet
HLP - critique of and potential replacement for BGP
Joel Jaeggli's Nanog 39 BOF on Pushing the FIB limits
Apricot presentations on routing and addressing
Nick McKeown's papers
Nick McKeown of Stanford
tiny-tera.stanford.edu/~nickm/ wrote a paper in 1998, with Pankaj Gupta and Steven Lin:
Routing Lookups in Hardware at Memory Access Speeds. (
IEEE INFOCOM April 1998, Vol 3, pp. 1240-1247.)
This is the first (and so far only) paper I can find which
suggests direct use of the most significant 24 bits of the IPv4
destination address to drive RAM for a single cycle lookup of FEC.
They have an elegant solution for a second chip to handle those
few prefixes which are longer than 24 bits. They also generalise
that two-chip process to achieve reductions in total memory size for
three level arrangements, depending on the number of prefixes which are
needed of the various lengths.
So what I am suggesting in my
I-D is much the same as was suggested in 1998, except they suggested
using DRAM, and I am suggesting much faster QDR-II SRAM, which is was
not available in 1998.
A later paper
Algorithms for Packet Classification Pankaj Gupta and Nick McKeown
(
IEEE Network, March 2001.) doesn't mention this technique of directly mapping 24 bits as a RAM address.
Geoff
Huston's web sites
www.potaroo.net Geoff
Huston's main site, providing a wealth of analysis of the Internet in
general and real-time information on the routing and addressing system. He has a great
Introduction to BGP - the Protocol. A June 2007 article on
Damping BGP
concerns the statistics of updates and problems encountered trying
to reduce stability. See also Geoff's Internet Draft with Tony Li on helping to solve these problems
#BGP_stability. Also his report with many references, regarding
the October 2006 workshop on routing and addressing:
www.potaroo.net/ispcol/2006-11/raw.html .
Prof. Zhi-Li Zhang's Fundamentals of Advanced Networking
University of Minnesota
reading list
with many of the papers archived on the site. Of particular
interest is his PowerPoint lecture 14, which I link to and quote in the
router-fib/#Zhang page.
Tree-Bitmap and the CRS-1 - George Varghese and colleagues
George Varghese's site is
www-cse.ucsd.edu/~varghese/ . Some papers include:
He also writes at www-cse.ucsd.edu/users/varghese/research.html :
"This algorithm is being used
in Cisco Routers, and is currently the fastest and most compact IP
lookup algorithm known in the literature, allowing both software and
hardware implementations. Cisco has confirmed that the latest Cisco HFR
Router (officially known as CRS-1), the fastest and largest router in the
market, is using Tree Bitmap. TreeBitmap is also being proposed for
placement in Cisco IOS; if so it will be widely deployed across several
Cisco router platforms."
Page 3-86 of Cisco's IOS XR Troubleshooting manual mentions diagnostics for "tree bitmap hardware-related ingress and egress
information".
A much later and significantly updated version is:
Tree bitmap: hardware/software IP lookups with incremental updates
Source ACM SIGCOMM Computer Communication Review archive
Volume 34 , Issue 2 (April 2004) Pages: 97 - 122
Will Eatherton (Cisco Systems), George Varghese (UCSD), Zubin Dittia (Jibe Networks)
portal.acm.org/citation.cfm?id=997160
This version contemplates more recent memory technologies, including QDR SRAM.
On BGP instabilities and problems with flap-damping:
Why the Internet only just works
Search tool for selected IETF mailing lists
My
Feb-March 2007 attempt to estimate various aspects of the density of
hosts (the proportion of IP addresses which are actually in use) by use
of ping. Even allowing for the inaccuracies inherent in using
ping, since some or many hosts do not respond to ping) this survey
makes me think that there are low rates of utilisation of IPv4 IP
addresses in many parts of the 1.7 billion range of currently
advertised addresses. Several people on the RAM list are of the
view that ping gives little or no useful information about usage.
I will update this page to reflect these views.
Papers concerning significant improvements to BGP
Some
of these involve behavioral changes to routers, without changes in the
BGP protocol. Some involve changes to the protocol and to the
method of handling packets. Others may involve replacing BGP
with a rather different system. Maybe this can be done
incrementally.
Background on BGP includes Geoff Huston's Introduction to BGP mentioned
above and the RFC:
RFC 4271.
It would be best to discuss these and other proposals on the
RRG
list - or the
IDR
list for questions about current and past BGP behavior and near-term
improvements to BGP. Please suggest things to add to this list.
My
comments are primarily to promote interest in these proposals and
should not be regarded as well-informed critiques. BGP
is a very deep field and a the best place to get a well informed
critique of any of these would be to participate in the
RRG or
IDR discussions.
BGP Stability Improvements
Tony Li and Geoff Huston June 13 2007
I
found this to be very helpful, especially the explanation of how
changes propagate in the network and how "path hunting" takes place,
often causing a flurry of sub-optimal updates before finally settling
down. But see a fuller discussion
#BGP_hunting_MRAI_disc for the real lowdown on "path hunting" the various behaviors of the MRAI timer etc.
Tony Li (of Cisco, and with Lixia Zhang chair of the
Routing Research Group) and Geoff Huston suggests some improvements to filtering changes
to reduce the propagation of updates which are generally unhelpful.
He also suggests the goal should not be rapid convergence
to absolutely optimal routing, but to a workable path so
connectivity is restored.
I
think these proposals are promising. BGP is a very difficult and subtle
field. I encourage anyone who is really interested in this to
follow the above link and read all the discussions on the RRG and IDR
mailing lists.
- Path
length damping. When updates cause a router to decide that the
best path for a prefix is longer than it was before, this is a good
sign that path hunting is taking place, so it might be best not to pass
the
resulting update on immediately, but set a timer and wait a while to
allow multiple other updates to arrive before computing a new best
path, which may involve a change and sending out another update.
A likely outcome of this is that the final message would be a
withdrawal - and it is better to send this than to send out longer
paths and then a withdrawal. Optimal
Path Hysteresis or Optimal Path Length Hysteresis.
- Delayed
Best Path Selection. Only compute a new best path after a timer
expires, where the timer is reset every time an update arrives for the
prefix in question.
BGP-RCN: Improving Convergence through Root Cause Notification
Dan
Pei, Matt Azuma, Daniel Massey, and Lixia Zhang. 2005
The
authors define a theoretical protocol SPVP to represent BGP, and then
add their RCN algorithm to it. They write that the RCN approach
is deployable in the current infrastructure.
The idea is
that updates which result from a link failure will have extra
attributes to this effect. This enables a distant router to know
that certain paths are no longer working, and so to avoid using
alternative paths which have the faulty subset within them. I
guess this would work across multiple prefixes, not just within one
prefix.
The information would piggyback on existing
updates and include a sequence number. The sequence number would enable
routers to cope with situations such as:
"For
example, after a link flapping (i.e. a link goes down and then comes up
quickly), the link down notification might arrive at a node later than
the path announced in the link up event. Therefore, when the link
down notification arrives, the node can mistakenly remove a valid path
that announced in the link up event. SPVP-RCN solves this problem
by letting each node v maintain a sequence number, t(v), which is
incremented by 1 upon each change of node v s route to prefix p."
I
have only scanned this complex paper. I think it is worthy of
further consideration. The authors' simulation resulted in vastly
improved responses to a failure of a link:
"In
the simulation of a 110-AS Internet-derived topology, when a
destination becomes unreachable (i.e. a Tdown event defined later),
BGP-RCN reduces the convergence time from 715.3 s to 1.3 s, and reduces
the number of BGP update messages from 30,483 to 926. When a
destination becomes reachable through a longer path (i.e. a Tlong event
defined later), in 95% of the cases BGP converges in 56 s or less, and
BGP-RCN reduces this time to 32s."
HLP: A Next-generation Interdomain Routing Protocol
Lakshminarayanan Subramanian et al. 2005
This
is an ambitious proposal was prompted by earlier work of the RRG, and
is intended to be a practical upgrade for the current BGP system.
This
is a hierarchical system in which some border routers, for instance,
may be part of two or more hierarchies. Hierarchies use link-state
routing, for instance between a major ISP and its various customers,
which may be other ISPs and AS end-users. This is a complex
system and I have only read the first few pages of the paper. I
won't try to summarise HLP - please read the Introduction and Design
Rationale.
Scaling IP Routing with the Core Router-Integrated Overlay
Xinyang Zhang et al. 2006
www.ieee-icnp.org/2006/papers/s4a4.pdf
By
tunneling (for instance with MPLS) traffic to a distant router, with
that tunnel containing traffic for multiple prefixes, a factor of 100
reduction in BGP global routing table size can theoretically be
achieved, with very little increase in overall path length.
Differentiated
BGP Update Processing for Improved Routing Convergence
Wei Sun,
Zhuoqing Morley Mao and Kang G. Shin 2006
Without
changing the BGP protocol or network infrastructure, significant
improvements can be gained by algorithms which control the timing of
updates according to various priorities. "30% fewer updates and
80% less convergence time."
Papers by Olivier Bonaventure
Reconsidering the Internet Routing Architecture
Limiting Path Exploration in BGP (AKA Path Hunting)
Jaideep Chandrashekar, Zhenhai Duan, Zhi-Li Zhang and Jeff Krasky 2005
www.cs.fsu.edu/~duan/publications/epic_infocom05.pdfThe
authors propose an enhanced path vector algorithm EPIC, using forward
edge sequence numbers. BGP messages contain too little
information for a router to make well informed decisions about
alternative routes. See also "Quantifying Path Exploration in the
Internet" mentioned below.
A June 2007 paper on BGP update statistics, stability etc. from Geoff Huston
Damping BGP
Damping BGP
concerns the statistics of updates and problems encountered trying
to reduce stability. This has really interesting analysis of the
timing of BGP updates. See also Tony Li's I-D on BGP Stability Improvements mentioned above.
A discussion about BGP path hunting, the MRAI timer and how the spec for this changed
This discussion began on the
RRG list on 2007 June 20 (
link to approx. start of discussions) and then moved to the
IDR list on 2007 June 21 (
link to approx start of discussions).
Tony Li defines "path hunting" in
www1.ietf.org/mail-archive/web/idr/current/msg02422.html
as what I described as the behavior of a small system of BGP routers
where a withdrawal of one prefix causes the system to produce firstly a
longer "best path" announcement, within a fraction of a second or so,
followed by about 30 seconds later (or whatever the MRAI timer value
is) a withdrawal announcement. This effect can be "amplified" by
passing through further such structures. Timing differences cause
more diversity and complications too.
"Path
hunting" is believed to be a
major contributor to the burden placed on BGP routers. Read these
discussions and find out more about how changes to this MRAI timer in
the BGP RFC4271, which apparently reflect the behavior of routers
around 2002 or so, seem to be one of the major causes of "path
hunting", also known as "path exploration".
This
paper looks at how many packets are actually delivered while the
routing system converges on a new set of paths after an outage.
A Study of Packet Delivery Performance during Routing Convergence
Dan Pei, Lan Wang, Daniel Massey, S. Felix Wu and Lixia Zhang 2003
Another paper with recent measurements of BGP instability
Quantifying Path Exploration in the Internet
Ricardo Oliveira, Beichuan Zhang, Dan Pei, Rafit Izhak-Ratzin and Lixia Zhang 2006
Here are a few
wild and woolly ideas
of mine. There are no-doubt scaling and perhaps
robustness/security problems associated with them. Probably they
have already been considered and rejected. However, this is a
desperate situation, where we are planning to radically revise the
Internet's architecture due to the limitations of BGP. Maybe for
the future routing system some of these ideas would be worthwhile.
If not, at least they are mentioned here with a note about their
scaling problems.
My intention is to put each router in a better position
to choose
whether to adopt a new path, either by ensuring the path actually works
or by fine-tuning the choice between multiple possible paths by probing
the routers in each path to determine something like how much
congestion there is on that path. The trouble with any such
proposals are that they add to the amount of state each router holds
and/or that they add to traffic and the need for other routers to
respond, in situations where instability would probably focus thousands
of similar probes on the one set of routers at the same time.
BGP
intentionally simplifies the amount of state each router holds.
Each path is not specified in terms of actual routers, but in the
numbers of the autonomous systems the packet passes through. This
may be a poorer measure of actual path length than if the path
specified the actual routers, but it is generally good enough, and
enables local policy to help decide the best path, while greatly
reducing the amount of data in each BGP message.
- Enable a router to know the IP address of the border router which advertises each subnet.
Plan
A: Attach this to the advertisement and propagate it to other routers
in updates. This has various problems, including (as best I
understand BGP) that this prefix may be handled by another update
covering a larger range of address space at other points in the network,
making it impractical to include border routers' IP addresses.
Another problem is that this propagation model involves delays
and a remote router may get different versions of the IP address when
the prefix is advertised at a new location, not necessarily in proper
time order. Timestamps might resolve that.
I
have been told that it would be quite easy to add this originating
border router IP address to BGP messages. Timestamps would be
trickier, though I guess it shouldn't be to hard to sync routers to an
NTP server. Once the BGP system settles down, the final update(s)
will contain the best path for a given router, and all best paths
reported by peers will in fact carry packets to the right border
router. My idea was that with an actual IP address, gained from
this or the next method, the router would be able to test connectivity
to the border router in real-time, rather than taking the work of peers, who took the word of their peers etc.
Plan B: use
something like in-addr.arpa to have a securely controlled, real-time
updated, distributed database by which any router could look up any
prefix to find the IP address of the border router which currently
advertises it.
- (This
idea has serious, probably prohibitive, scaling problems. As soon
as there is instability and a bunch of routers - say tens of thousands
of them - start to make fresh decisions, they ping the hapless border
router, which cannot necessarily cope, even when it is well connected.)
Based
on this, enable a router to check connectivity to a particular border
router, through one or more of its various interfaces. A simple
ping response would do the trick. A modified ping system may
involve the data of the ping packet requesting the border router
confirm that it is the one advertising this prefix. The reply
could be in the ICMP echo reply data. The border router might get
the response on one of its interfaces, with IP address X, but in fact
now be advertising the prefix on another interface with address Y.
The response could confirm the border router was advertising this
prefix and also list one or more IP addresses of the interface(s) which
should be used for it. Maybe the responses should be sent out of
all DFZ-connected interfaces for Justin - Just In Case.
- (This too has serious, probably prohibitive, scaling problems . . . ) Likewise,
use traceroute to the border router from a particular interface to
figure out which BGP routers are en-route via a path
which arises from using that interface. Then use some other kind
of probe-response mechanism to find out how congested they are.
This is probably impractical, since a router may have dozens of
interfaces and the upstream and downstream traffic on each could be
varying widely. To report on this for a particular path, the
router would need to look at incoming data rates on the egress
interface and outgoing for the exit interface, with their maximum data
rates and then the same for the other direction.
Overview for non-specialists
The
following is a brief account of a complex series of problems. If
you think this is overly dramatising the situation, please read the
report of the October 2006 IAB Workshop on Routing and Addressing,
linked to
below.
The
routers (the electronic
devices which handles packets of Internet data in a manner analogous
to post-office letter sorting offices) in the core of the
Internet, sometimes known as the "Default Free Zone" (DFZ) need to keep
up with a growing number of routes, also known as "BGP (Border Gateway
Protocol) advertised prefixes". When an ISP or the operator of
another network (a user organisation with an "Autonomous System")
connects their network to the Internet, they advertise through the BGP
routing system that their
router is the proper destination for one or more subsets of the
Internet's address space. There are 4 billion IP addresses in today's
"IPv4" Internet, with about 3 billion ultimately being available.
About 1.7 billion are currently advertised. Each subset is known as a
"prefix" -
for instance a /20 prefix involves the first 20 bits of the address
being fixed to a particular pattern of bits, and the other 12 varying -
covering, in this example, 2
12 (4096) IP addresses. This information is passed on to neighboring routers, which pass it on to their neighbours.
Each
new connection - each new BGP advertised prefix - means that tens of
thousands of DFZ routers need more computational and packet handling
resources to perform their task. Each router has a limit to how many
routes it can handle efficiently. This means that these routers need
to be upgraded or replaced with newer routers, in order to do their
job. All Internet users collectively
pay for these routers, since their ISPs collectively pay for
capital and operational costs of the core of the
Internet.
There are several serious problems arising from this:
- As
more and more ISPs, companies, universities, schools etc. connect to
the Internet, most of them want their own dedicated ranges of IP
addresses. They typically need to split the subset into
smaller subsets and advertise each one of them on a separate router
which connects to the wider Net via a different ISP. This is
primarily to achieve "multihoming" - independence from any one
ISP. So they use two or more ISPs, of their choice, and do not
need to change the IP addresses of their network when they change ISPs.
Most importantly, if one ISP fails, either due to technical failure for a few minutes or by
going out of business, the organisation can still maintain their
connection(s) to the Net via other ISPs. This need for
multihoming is the primary reason for the increase in the number of
BGP advertised prefixes.
- The increased
growth in the traffic of BGP routers passing on changes to the
advertised prefixes. This takes more bytes of inter-router
traffic, but is generally not prohibitive, since long distance links
are getting cheaper and faster.
- Problems in the routers
collectively reaching a stable state, because this only happens after a
complex set of interchanges prompted by each new advertised prefix.
This is primarily a problem with changes to advertised prefixes,
not with the total number of advertised prefixes.
- Router's
limited CPU and memory capacity to handle the BGP communications and to
configure their routing hardware according to the tens of thousands of
rules (generally each routing rule results from one advertised prefix)
they must consider when deciding which of their interfaces to send each
packet to. Memory capacity and CPU speeds are rapidly increasing,
so new models of routers should be able to cope with this.
- The
hardware "packet forwarding engine" which classifies each packet as it
arrives, determining which interface to send it to, based on its 32 bit
(for IPv4, or 128 bit for IPv6) destination address. This hardware is
expensive and power-hungry. Every packet (up to about 1500 bytes)
which travels on the Net typically passes through 8 to 20 routers in
the 0.1 to 0.3 seconds packets typically take to reach their
destinations anywhere in the world. Each router needs to classify
the packet in this way, so the process must be very fast (ideally
several hundred million packets a second). Ideally, the
electronics for this should be simple, inexpensive and draw little
power. The highly flexible architectures used in
high-end routers are expensive and power hungry. They are complex, so
they can perform a vast range of packet analysis functions very rapidly.
These functions are often needed in computer networking in general,
but for the task of forwarding packets in the core of the Internet, a
much smaller set of capabilities is required, but for a much larger
number of routes than is required in almost any other area of computer
networking. It is difficult or impossible to expand these routers to
cope with a number of routes beyond what they were designed to handle.
This is perhaps the most difficult aspect of the "crisis in
routing".
I propose a
new, simple, router hardware architecture based on powerful new memory
chips, purely to handle the main bulk of Internet traffic. This
will be an addition to the current, complex, hardware mechanisms, but
will mean that those mechanisms don't need to be expanded to handle the
continuing growth in the number of advertised prefixes. By
carefully aligning the router design with
address management policies for the whole Internet, I believe the
hardware forwarding problem can be solved elegantly and forever.
This requires coordination between router manufacturers and the
Internet Engineering Task Force.
I
am also working on combining the new proposed protocol "LISP" with
SRAM-based forwarding, to enable a finer control of IP addresses than
is possible with router hardware.
.