Ivip IPTM - encapsulation overhead, MTU, fragmentation and PMTUD

The material on this page up until 21 April 2008 (from October 2007 and updated somewhat to February 2008) was for a proposal which is now obsolete. I have archived that material here /archive-2008-04-20/ . That old page may still be of interest - such as for how the Sending Host (SH) would traceroute the routers in the tunneled portion of the path. Some of that page's links to older discussions might still be valuable, but the new proposal below is what I am currently working on.

On 19 April 2008 I devised a completely different approach to handling these problems. Below is a reasonably descriptive summary of the proposal, which I will write up in an Internet Draft as soon as I can, probably later in 2008.

For now, I am calling the new proposal the same as the old: IPTM: ITR Probes Tunnel MTU.

Part of this new proposal is a robust method of probing the PMTU, and possibly delivering a traffic packet at the same time. This sub-protocol is called RPD2 Robust PMTU Discovery and traffic Packet Delivery Protocol. (Don't blame me for these acronyms - I need a term for these things and I am trying not to reuse any existing IETF acronym!)

The ITR doesn't do anything special with traffic packets which, once encapsulated (ENCAPS overhead with IPv4 Ivip, is the 20 byte IP-in-IP header) are still short enough to be sent with confidence to any ETR in the world. There would be some globally agreed constant MPMTU (Minimum Path MTU) for this, such as 1200 bytes, enforced by a BCP (Best Current Practice recommendation) that all ITRs and ETRs be placed at locations where they can exchange packets of at least this length with routers in the DFZ.

The task is how to handle traffic packets which are longer than this when the ITR has no knowledge, or has still incomplete knowledge, of the Real PMTU (Path Maximum Transmission Unit) to the ETR. The Real PMTU is typically some value which remains reasonably stable, due to a stable path of routers, with their various individual MTU limits according to link MTUs, any fancy tunneling arrangements for inter-router links etc. However, the Real PMTU may vary over time, as the path changes to other routers, as the routers change their behavior. Also, due to routing vagaries, sometimes packets from one ITR to one ETR may be sent by one path with one Real PMTU and sometimes over another path with another Real PMTU. These scenarios of flaky variable Real PMTU are a problem for any system. I intend the following proposal cope reasonably well with them, generally choosing lower limits and so limiting efficiency somewhat, rather than causing packets to be lost without any appropriate PTB (Packet Too Big) message being sent to the SH (Sending Host). However any path which, for instance, has a lower PMTU for 1% of packets, is probably going to cause trouble for 1% of packets which are longer than that PMTU.

The example below assumes a single sending host, a single ITR and a single ETR. There will usually be multiple ITRs sending packets to one ETR, but each ITR is operating in isolation from other ITRs, and performs the following IPTM and RPD2 processes with the ETR. The one or more ITRs may be sending encapsulated packets to the ETR because it handles the micronet (EID prefix in LISP or APT terminology) of a single DH (Destination Host), but more likely, there will be multiple DHs in each micronet, and probably multiple micronets handled by the one ETR. Any PMTU problems between the ETR and the DH are outside the scope of this IPTM process. They would be handled either by the ETR, or by the PMTU limiting router between the ETR and DH, sending a PTB to the SH.

For a given ETR, the ITR might be sending packets from a single SH, including from a single application in that SH, which sends to a single DH which is currently reachable via that ETR. However, there may be multiple applications in that SH sending to that DH, or to other DHs in the same micronet, or other DHs in other micronets which are currently mapped to the same ETR. The ITR will often be accepting packets from multiple SHs.

Relies only on RFC 1191, not on RFC 4821

The following proposal only assumes ordinary, currently ubiquitous (I understand) sending host support for RFC 1191 (1990) style PMTUD.

With this, it is vital that for any DH, the SH only get PTBs with MTU values which do not unreasonably restrict the packet lengths that host will send to that DH. This is because RFC 1191 requires each SH to send packets no longer than that value to that DH for the next 10 minutes. After that, the SH can try its luck with larger packets, in an effort to discover whether the Real PMTU to that DH has increased since it got the last PTB message.

I understand that most hosts send IPv4 packets with DF=1 (Do not Fragment), as part of their RFC 1191 approach to PMTUD. This is supported well by this IPTM proposal. IPv6 packets are non-fragmentable as well. A section below discusses how the ITR might handle IPv4 packets with DF=0.

RFC 4821 Packetization Layer PMTUD is not required for this new IPTM - RPD2 process to work fine. However, RFC 4821 PMTUD should work fine over any map-encap system (or any other tunneling system) which uses IPTM - RPD2.

RFC 4821 is a recent RFC (March 2007), and I understand from Fred Templin that it has been implemented in the Linux kernel. I haven't verified this, however I believe that the adoption rate of RFC 4821 will be slow at best. (Try Googling "RFC 4821 adoption" etc.) It requires the OS to work with applications which choose packet lengths and which are able to determine the success or not of packets being sent to the application at the other end - in the DHs. The applications are supposed to report this back to the OS on the success or otherwise of packets of different lengths being sent to particular DHs. The OS is supposed to support the applications by feeding back an estimate of the PMTU to each DH, so each application can make a choice about what sized packets to send. (I think the TCP part of the stack counts as an "application" in this context, because like the other applications, it has a "Packetization Layer" which needs to decide what length packets to break long streams of data into.)

This sounds really messy, and I think it will in practice be a slow process, if it happens at all, for a significant number of applications to be adapted to this, in most or all major operating systems. Not least, each OS would need a new method of achieving this unusual two-way flow of information between applications and the OS. Debugging this could be highly problematic, with each host having different applications and operates in different conditions.

RFC 4821 does not rely on PTB messages, but can use them, since it builds on the basic RFC 1191 approach. It was prompted in part (or very largely?) by an upsurge in unwise filtering which prevents PTB messages from reaching the SH. IPTM's RPD2 protocol does not rely on PTBs to successfully discover PMTU to each ETR, but it would work somewhat better with them.

Advantages

The new process looks quite promising for reasons including the following points. Maybe there are gotchas and problems I haven't seen yet - please let me know your critiques, concerns and suggestions for improvement.

There are costs in complexity, but it doesn't seem inordinately complex or expensive considering the difficulty of the problem it is solving.

Hopefully there are no major security problems in what I propose below. Please point out any weaknesses I have missed.

These notes on advantages are somewhat simplified, and are on the basis that there is not some unfortunate and typically rare packet losses which upset the process. Generally, the process should be robust against random packet loss.

Application data is not lost due to PMTU problems. Whenever the ITR can't get a traffic packet to the ITR, it knows about this and sends a PTB message to the sending host, which will try again with a shorter packet.
No requirement for hosts to use RFC 4821.
Does not embody any assumptions about the currently widely used ~1500 PMTU limit (100Mbps Ethernet etc.) or of jumbo-frame supporting equipment, which may have MTUs of 9000 bytes (1Gbps Ethernet and beyond). So the system should work optimally now and in the future as more or the Net adopts jumbo-frame compatible equipment.
The ITR automatically and rapidly adjusts its two variables to converge on the Real PMTU to each ETR. As the remaining Zone of Uncertainty diminishes, ultimately perhaps to zero, so does the need for sending traffic packets as probes. So the PMTUD process is self limiting, for each ITR to ETR tunnel.
There seems to be no need for Synthetic Probe packets (long probes to test PMTU limits, but which do no carry traffic data). (But see the discussion here: psg.com/lists/rrg/2008/msg01208.html - it looks like the will be needed for probing if the only too-long packets are IPv4 DF=0.) This might have been useful for detecting if the PMTU his risen, but it looks like they will not be needed. They are not required during the most important phase, when the ITR is trying to quickly figure out the PMTU to an ETR it has never heard of before, and to which it needs to send packets which exceed MPMTU (1200) bytes.
Every traffic packet which, when encapsulated, would be of a length the ITR is uncertain about, in terms of whether it can be sent to the ETR without PMTU problems (the Zone of Uncertainty), will be used as a probe and will generally enable the ITR to adjust its variables to improve its estimated PMTU in a robust and helpful manner.
When a probe packet is received by the ETR, the traffic packet it embodies is sent to the DH, and the ITR gets a message that the packet was successfully received by the ETR.
When a probe packet runs into PMTU difficulties, the ITR will get a PTB message, or the probe will not be received by the ETR, or both. In either case the ITR sends a PTB to the SH, and the SH tries again, with a shorter packet. No loss of application data results from these probe packets which were too long for the PMTU of the ITR --> ETR tunnel.
The ITR does very little additional work, such as sending Synthetic Probe packets. (But see psg.com/lists/rrg/2008/msg01208.html ) It only probes PMTUD to the ETR when traffic packets of suitable length need to be tunneled to it.
The protocol is not overly complex for either the ITR or ETR. The state requirements in both are modest. In particularly the ITR does not hold state for most tunneled packets. It only needs to hold some state for those few which are probe packets.
The natural behavior of a SH to a sequence of PTB messages, each with a lower still MTU value (as this IPTM will generally create) will be to send shorter and shorter packets, which is ideal for the ITR discovering more about the Real PMTU, and so fine-tuning its two variables towards each other, diminishing the Zone of Uncertainty.
There is no fragmentation or segmentation (a non-fragmentation approach to chopping them into smaller packets) of traffic packets, except for fragmentation of IPv4 DF=0 packets as noted below:

Fragmentation

Explanation by way of example

3 variables for each ETR

For each ETR it needs to send packets to which are longer than MPMTU, the ITR creates 3 variables, initialised as follows:

IMTU (Interface MTU for the Interface via which the ITR sends packets to this ETR)

Initial value:

Whatever the MTU is for that interface.

In this example the ITR uses a jumbo-frame capable interface, so the value is 9000 bytes.

IMTU is not adjusted as part of the process, so it is effectively a constant. There are circumstances in which the value does change, and in all these cases, the ITR starts with a clean slate: setting IMTU to the new value and the other two variables to their initial values as noted below. These changes of IMTU include:

The Interface itself changes its MTU (seems pretty unlikely).
The ITR has multiple interfaces, and due to routing or whatever other changes, now uses a different interface to send the packets to the ETR - and that interface has a different MTU. (Still, sending the packets out another interface is a good reason to restart the PMTUD process by intialising the variables again, since the path could be quite different from what it was from the initial interface.)

UPME (Upper Path MTU Estimate)

Initial value:

IMTU (MTU of the ITR's interface on which it sends packets to this ETR).

In our example, the ITR uses a jumbo-frame-capable interface, so the initial value of UPME is 9000 bytes.

As the ITR initialises and later may reduce the value of UPME, this variable specifies the longest length packet the ITR considers it might be able to send to the ETR without PMTU limits. For instance, if 6011 bytes is the length of the shortest packet the ITR knows it knows it cannot get to the ETR, due to one or more failed attempts to do this, or due to an attempt to send a packet of this or a longer length, which resulted in a router in the ITR --> ETR tunnel returning a PTB with MTU = 6010, then the ITR would set UPME to 6010.

By comparing the length L of a traffic packet with (UPME - ENCAPS) and with the other variable (LPME - ENCAPS), the ITR decides how to handle each traffic packet it needs to encapsulate and tunnel to this ETR.

LPME (Lower Path MTU Estimate)

Initial value:

MPMTU: a worldwide agreed-to packet length which all ITRs and ETRs can send to each other without PMTU problems.

In our example, LPME is initialised to 1200 bytes.

As the ITR initialises and later typically increases the value of LPME, this variable specifies the longest length packet the ITR can confidently send to this ETR. Therefore, if a traffic packet has length L which is equal to or less than (LPME - ENCAPS) then the ITR will send it normally. For Ivip, this means ordinary IP-in-IP encapsulation with a 20 byte header overhead. Such packets are not a part of the ITR's process of PMTUD.

Adjusting LPME and UPME to reduce the Zone of Uncertainty

Outer algorithm for handling traffic packets

Here is the algorithm by which the ITR decides how to handle each traffic packet, according to how its length compares with these variables.
With Ivip's 20 byte ENCAPS overhead, this means the ITR will treat traffic packets in these ways according to their length L. The green is the in-principle test. The black shows the test with values derived from the constants as intialised above.

Short enough to send normally:

L <= (LPME - ENCAPS)

L <= 1180

Tunnel the packet to the ETR by the normal method: Ivip's IP-in-IP encapsulation with the outer source address being that of the Sending Host (SH). (This is to assist with the ETR supporting the source address filtering of border routers of the ISP network it within, and also could be used by modified traceroute code in the SH to trace into the ITR --> ETR tunnel.)

Length is within the Zone of Uncertainty:

( L > (LPME - ENCAPS) )
&&( L <= (UPME - ENCAPS) )

( L > 1180 )
&&( L <= 8980 )

Perform the following RPD2 protocol with the traffic packet. In almost all cases, not counting unusual and unfortunate instances of packet loss, this will result in:

Packet received: LPME will be increased to (L + ENCAPS).

Packet not delivered, meaning one of the following occurred:

A PTB arrived from a router in the ITR --> ETR tunnel.
The ETR told the ITR the packet did not arrive.
(The ITR heard nothing, in which case it may try again, and assume the packet did not arrive if there is no response the second time. Actually, this probably means the ETR is unreachable)

In this case, the ITR will adjust the value of UPME to the size of the encapsulated packet, or to the MTU value reported in the PTB message, whichever is lower. The ITR then sends the SH a PTB with an MTU value of UPME.

Too long for known PMTU limit:

( L > (UPME - ENCAPS) )
&&( L < (IMTU - ENCAPS) )

( L > 8980 )
&&( L < 8980 )

Later, UPME may be adjusted to a lower value than IMTU. If the packet length matches this test, then in general, it will be dropped and a PTB sent to the SH, with an MTU value of (UPME - ENCAPS). However, some of these packets may be sent with the RPD2 protocol to explore the possibility that the Real MTU has grown since the last tests indicated it was as low as UPME currently indicates.

Too long for the ITR's outgoing Interface:

L > (IMTU - ENCAPS)

L > 8980

The packet, once encapsulated, is too long for the interface which the ITR currently uses to send packets to this ETR. Drop the packet and send a PTB to the SH with an MTU value equal to (UPME - ENCAPS).

The above four options explains the overall algorithm for handling packets, and it can be imagined how a succession of packets which meet the Zone of Uncertainty criteria will result in the ITR learning more about the Real PMTU to this ETR, and so adjusting LPME and UPME accordingly to reduce the span of the Zone.

RPD2 protocol for reliable probing and potential delivery

Before giving an example of IPTM operation, here are the details of the RPD2 protocol, which is normally used so that the full contents of a traffic packet are sent to the ETR, in a way which both reliably probes the PMTU at a length that packet would have if sent via normal encapsulation, and which will deliver the traffic packet to the ETR if no PMTU problems are encountered.

At present, there are two ways this could be done. The first approach splits the original packet, sending most if it in a long Packet B probe packet, and the rest in a short Packet A, of which 2 or 3 are sent for Justin - Just In Case one of the Packet A's is dropped due to random packet loss.

A potential second approach involves sending the traffic packet via ordinary encapsulation, to test the PMTU, but has the Packet A role performed in a different way. This has some elements of simplicity, but I think it is not such a good idea, since it would be difficult to reliably tell the ETR exactly which encapsulated traffic packet to look out for. Also, this second approach would burden the FIB, main traffic path or whatever of the ETR with the task of always looking out for such probe packets. I describe this second approach later, but in the meantime, please consider that the rather awkward looking initial approach is probably a good approach, since each Packet B can easily and unambiguously be associated with its Packet A.

As described here, the RPD2 algorithm in the ITR takes as its input the traffic packet:

A packet from a SH to a DH, which the ITR has already looked up the mapping for this DH, found which micronet it is in, and thereby found the ETR address it needs to be sent to. From that, it knows which interface to send it out of.

and therefore the ETR address, and the three variables for this ETR listed above.

In the examples which follow, the traffic packets handled by RPD2 are all of a length which means that once encapsulated, each would be of a length which matches the Zone of Uncertainty. However, as discussed below, there may be a need to send packets with RPD2 when they are outside this zone:

Shorter than (LPME - ENCAPS) - to detect the possibility that the Real PMTU has fallen below LPME.
Longer than (UPME - ENCAPS) - to detect the possibility that the Real PMTU has risen above UPME.

RPD2 could also be used with Synthetic Probe packets, in which case the Packet B is mainly zeros, and Packet A contains only the required headers, most particularly the same nonce as is in Packet B. With a Synthetic Probe, Packet A does not contain any data which didn't fit into Packet B. I am not yet sure whether we need Synthetic Probe packets. At most they would be used for occasionally testing to see whether the Real PMTU has increased. Maybe we don't need them at all.

IPTM uses this RPD2 protocol only when needed. Most traffic packets are sent with ordinary IP-in-IP, and any PMTU problems those packets encounter in the ITR --> ETR tunnel are not visible to the ITR. This is because the outer source address is that of the SH. With LISP, APT and TRRP, the outer source address is that of the ITR, so the ITR would get a PTB (except for packet losses and any filtering of such messages). However, it is highly impractical to require ITRs to cache sufficient details of all encapsulated packets they send sufficient to securely identify a PTB as being genuine and then to do something useful with this information, including sending a suitably crafted PTB to the sending host. See:

www.ietf.org/mail-archive/web/ram/current/msg01766.html
www.ietf.org/mail-archive/web/ram/current/msg01769.html

Nonetheless, see a section below where I discuss a "Protocol X" alternative to RPD2, in which it might be feasible for a non-Ivip ITR to achieve the same detection of PMTU problems in the tunnel to each ETR, while only storing state for a fraction of the encapsulated traffic packets.

First, I describe the construction of two packets A and B. Then the way in which they are sent and how the ETR reports back to the ITR. Following that is an example which puts together the outer IPTM algorithm for handling packets of various lengths, with the 3 Variables for this ETR, and the RPD2 probing protocol to show how an ITR would rapidly adjust its variables according to what it discovers about the Real PMTU to this ETR.

Packet B

Packet B is the Big packet. It has exactly the same length as the traffic packet would have had if sent by ordinary encapsulation. For IPv4 Ivip, this is 20 bytes longer than the raw traffic packet.

It is vital that Packet B have the same length as a normally encapsulated traffic packet would have, since a SH may get a PTB regarding this traffic packet, and it is important that this only result from a genuine problem in the network which relates to a packet of that length. For instance, if the traffic packet length L was used as part of a probe to create a probe packet of length longer than (L + ENCAPS) then if the probe failed, the ITR would need to send a PTB to the SH (unless the ITR tried to resend the traffic packet, which is messy and probably too slow to be workable). In order for PTB values to be valid from the point of view of the SH's RFC 1191 or RFC 4821 functions, the packet which caused the trouble on the ITR --> ETR tunnel needs to be exactly the same length as would result from ordinary, non-probe, encapsulation.

The ITR takes the initial traffic packet (7000 bytes in this example) and chops it into two pieces.

From the start, CHUNK1 bytes are kept to form the basis of "Packet A". The remaining CHUNK2 bytes are kept to form the basis of "Packet B".

CHUNK1 is relatively small, and so is "Packet A". Most of the original packet winds up in "Packet B", which has the following format:

-------
  Outer IP header    (Actually, there is no inner header, but I keep
                      this term because it fits with the Packet A
                      structure.)

Outer Source Address = ITR's address
Outer Dest   Address = ETR's address
Next header: UDP

     UDP Header (well known port on ETR, length of all that follows,
                 etc.)

     RPD2 header:

          Flags etc. to the effect:

                   This is a "Packet B". Therefore, the ETR should
                   link it with the first "Packet A" it correctly
                   receives with the same nonce.

                   ETR must acknowledge the receipt of this packet
                   by a procedure noted below.

                   CRC for the whole packet. (Maybe put this at
the end, after the large slice of the original
traffic packet?)

          NONCE    A nonce which is unique for this traffic packet
and is also used in the Packet A.

          CHUNK2   Length of the segment of the original packet
                   contained herein.

    CHUNK2 bytes of the original packet.
-------

CHUNK1 + CHUNK2 = original traffic packet length = 7000.

The total length of Packet B is defined as the original packet's length plus ENCAPS. In this case: 7020 bytes. CHUNK2 is chosen to
be 7020 bytes minus the length of Packet B's:

Outer IP header                20 bytes
UDP header                      8 bytes
RPD2 header     Let's assume 20 bytes
                               = 48 bytes total

So in this example, CHUNK2 = 7020 - 48 = 6972 bytes.

Therefore, CHUNK1 = 28 bytes.

Packet A

Sending Packets A and B

The ITR now does this:

Send a Packet A.

Send another Packet A.

Send Packet B (the long one).

If it receives no response from the ETR within some short time, say 0.1 seconds, it sends another Packet A.

It is really important the ETR get at least one of these packets so it sends a report to the ITR. Without a report from the ETR, ITR would have to wait for some time, maybe half a second or a second or two, waiting for a response, and then would have to abandon this attempt at RPD2 probing.

Sending two little packets before the big one, and then another little packet a moment later, seems like a good way of getting at least one packet to the ETR. The ETR ignores any second or third Packet A it receives with the same nonce.

When the ETR receives either its first Packet A or the Packet B, it waits for a moment (maybe 0.1 secs?) for the other to arrive.

Then if only one has arrived, it reports back to the ITR what has happened, in a UDP packet, with some suitable flags and fields to the effect:

Packet A received (yes / no)
Packet B received (yes / no)
NONCE

The ETR keeps listening for the other packet, and after some other time - say 0.3 seconds - times out and forgets all about the one it received. (However, for a few seconds it continues to send report messages to the ITR until it gets an Ack from the ITR.)

Whenever it receives the other packet, it reports back similarly, but with "yes" for both fields.

When it gets both Packets A and B, the ETR reassembles the original traffic packet and verifies that its source address matches the outer source address of Packet A. (This is Ivip only - other map-encap schemes and SEAL don't need this.) Then the ETR passes the packet on to the destination host.

Acknowledging Packets A, B and ETR messages.

Responses to the ITR

PTB from router in the tunnel

ETR report: Packet A arrived OK, but not Packet B

ETR report: Packet B arrived OK but none of the 3 Packet As

ETR report: Packet B arrived OK and at least one Packet A arrived OK

No response after 0.5 seconds or so

Example with multiple longer packets

Here is a sequence of packet lengths, the first set of which which appear in this order while the initial and subsequent RPD2 processes are in progress.

This example shows a flurry of traffic packets, perhaps from the same SH and perhaps not, which all go to micronets which are currently mapped to one ETR, for which the ITR initially has no PMTU information.

If a single SH with a single application was the only host sending packets through the ITR to this ETR, the sequence of packet lengths would be simpler. Assuming the SH made the most of its (assumed to be 9000 byte MTU) path to the ETR, it would start with some high value, such as a 9000 byte packet, which the ITR would reject with a PTB (MTU = 8980) because once encapsulated, the packet would be 9020, and therefore too big for its outgoing interface. Then the SH would probably send an 8980 byte long packet and the ITR would try to send that to the ETR. If there was one or more PMTU limits in the path, usually a PTB would come back indicating the first limit below 9000. The ITR would adjust down its UPME variable accordingly and the SH would get a PTB (MTU = UPME - ENCAPS), and so would try again with a packet which, once encapsulated, should pass this first router which has a PMTU limit. Perhaps that is the only router, in which case, the packet would arrive, the ITR would set its LPME variable to this length, the Zone of Uncertainty would cease to exist for this ETR, and all subsequent packets (for 10 minutes at least) from this SH would be the new lower value, and therefore be sent with ordinary IP-in-IP encapsulation.

So the following example is inordinately complex and unrepresentative of a typical initial PMTUD process for any one ETR. (To-do: make a simpler example.)

The values of the variables are in the left two columns, and don't change until a report comes back from the ETR:

LPME UPME    Traffic RPD2    Action or event
               packet packet
               length length

1200   9000    7000    7020    Send with RPD2 #1.

1200   9000    6000    6020    Send with RPD2 #2.

1200   9000    6500    6520    Send with RPD2 #3.

1200   9000    7500    7520    Send with RPD2 #4.

1200   9000    7000    7020    Send with RPD2 #5 - because the
                               ITR hasn't yet got a report from the
ETR. Maybe the #1 process failed due
to packet loss, in which case this #5
will be a second attempt at testing
the path with a 7020 byte packet.

1200   9000    8000    8020    Send with RPD2 #6.

Now some reports from the ETR come in. In reality, responses and traffic packets would probably be arriving at the same time, but for simplicity, I am showing the replies arriving after this initial flurry of traffic packets.

LPME UPME    Traffic RPD2 Action or event
               packet packet
               length length

1200   9000                    #1 reply (7020) A & B received.
                               Write 7020 to LPME.

                               If this reply had been received
     before the ITR handled the 6000 and
   6500 length packets described above,
   both of those would have been sent
     with ordinary IP-in-IP encapsulation.

                               Now the range of packet sizes for
                               ordinary IP-in-IP encapsulation has
   been dramatically expanded from 1180
                               to 7000 - and the Zone of Uncertainty
                               greatly decreased.

7020   9000                    #2 reply (6020) A & B received.
                               Nothing to do, since LPME is already
     above this value.

7020   9000                    #3 reply (6520) A & B received.
                               Nothing to do, since LPME is already
     above this value.

7020   9000   6800    6820     Send with ordinary IP-in-IP.

7020   9000   8800    8820     Send with RPD2 #7.

7020   9000                    #4 (7520) PTB received from
                               router in tunnel. MTU = 7400.

                               Write this value to UPME.

7020   7400                    #4 (8020) PTB received from
                               router in tunnel. MTU = 7400.

                               Nothing to do since UPME is
                               already set to 7400.

Now the span of the Uncertain Zone has been reduced dramatically, from 1200 <--> 9000 to 7020 <-- 7400.

Traffic packets of 7000 or less bytes will now routinely be sent with IP-in-IP encapsulation with very high confidence they won't be
clobbered by some PMTU restriction.

Packets longer than 7380 bytes will be dropped and a PTB sent to the SH with an MTU of (UPME - ENCAPS) = 7380.

Packets of length 7001 to 7380 inclusive will be sent with RPD2 and so will help the ITR further adjust these two variables, further narrowing the Zone of Uncertainty.

Unless there is an unfortunate and unlikely series of lost packets, all packets sent with RPD2 will result in one of these outcomes:

1 - The packet is delivered.

The ITR will increase LPME. (Uncertain Zone diminished.)

2 - The packet is not delivered.

The ITR may learn something about the PMTU to this ETR, in which case the UPME will be reduced. (Uncertain Zone diminished.)

There will be no application data loss - the SH will get a PTB message and so will resend the data in smaller packets.

The MTU value sent in successive PTBs to any SH from this ITR will always decrease, as its UPME is decreased. These decreases only happen due to either a PTB (which is highly reliable) or one or two instances where the ETR received one or more copies of Packet A, but did not receive a Packet B. This is pretty good evidence there is a PMTU problem in the path to this ETR.

In Ivip, there won't be ordinarily recognisable PTBs sent to the ITR for any packets sent with ordinary IP-in-IP encapsulation, from any routers in the tunneled part of the path, since Ivip's IP-in-IP encapsulation uses the SH for the outer header address. Those PTBs would be addressed to the SH, but a properly implemented SH won't recognise them. Other map-encap schemes use the ITR's address in the outer header, but this is not much help with PTBs, since it is not practical for the ITR to keep enough state about all the packets it tunnels sufficient to reliably distinguish a genuine PTB from a spoofed one.

IPTM requires the ITR to keeping state about those few packets sent with RPD2 - and this number rapidly diminishes as the ITR closes the
Zone of Uncertainty for each ETR.

As long as the Zone of Uncertainty is greater than 1, and as long as packets arrive which once encapsulated would have a length matching that zone, then there will still be RPD2 packets being sent. But this is a self-limiting process. As long as RPD2 packets are being sent, the ITR is reliably probing the real PMTU at that time and so adjusting its UPME and/or its LPME variables towards each other.

According to the above algorithm, before long, the Zone of Uncertainty would be reduced or disappear, so that the the ITR would never send any packets by RPD2. This means the ITR would not discover any change in the Real PMTU: It would not confirm or test the validity of any mistakenly low value of UPME (as would be the case if the Real PMTU has increased) or a mistakenly high value of LPME and/or UPME (as would the be case if the Real PMTU has decreased).

Discovering changes in Real PMTU

As long as the Real PMTU lies in range of LPME to UPME inclusive, the system is working fine. Ideally, the Zone of Uncertainty (UPME - LPME) will be small, or zero, which means that the ITR will be sending all the traffic packets which would fit into the tunnel (once encapsulated) by the highly efficient ordinary encapsulation (IP-in-IP for Ivip), and that this would be making the most of the Real PMTU of the tunnel.

As long as the Real PMTU has dropped below the current value of LPME, packets sent via ordinary IP-in-IP would be dropped, without the ITR or the SH being aware of it. This would lead to packet loss, without the SH getting a PTB message. This would be a serious failing, so the system needs to try to avoid this as much as possible.

If the Real PMTU rises above the current value of UPME, there is no loss of packets, but it would be best to discover this sooner rather than later so larger packets could be sent via ordinary encapsulation.

Detecting an increase in the Real PMTU

If the Real PMTU rises above the current values of LPME and UPME, then no packets will be dropped due to PMTU problems. The application will continue to work with the efficiency it already has, but it would be missing out on some potential efficiency gain by sending longer packets. For an RFC 1191 host (I think all hosts, for all practical purposes in the foreseeable future) the SH will not try a longer packet length for 10 minutes after it received a PTB message which set its current limit. There's nothing the ITR can do to improve this.

What happens when the Real PMTU is higher than the current value of LPME and UPME, and some other SH tries sending a packet longer than both LPME and UPME, but which would in fact fit (after encapsulation) in the Real PMTU, which the ITR is currently unaware of? As described above, the ITR would always, or at least usually, drop any such traffic packet and send the SH a PTB with a value (UPME - ENCAPS). But this would be a bad outcome, since these packets could be coming from either a new SH, or from the original SH which has been sending packets for 10 minutes and is now testing to see whether it can send longer ones.

As defined above, the ITR wouldn't let any of these packets out to the tunnel, because it "knows" they are too long . . . but does it really know what is happening in the tunnel right now? Not necessarily.

Reasons for not sending these long packets, and for sending the SH a PTB, include:

It looks like they won't get to the ETR, so to save the ITR's outgoing bandwidth and the burden on at least some routers in the tunnel, the packet is not sent via ordinary encapsulation.
Likewise, the packet is not sent with RPD2 encapsulation, to save ITR resources and so as not to burden the ETR control plane with Packet As and the requirement to send a report to the ITR, which the ITR must acknowledge.

If the ITR established just a few seconds ago that a 7050 byte packet in the tunnel results in a PTB, or if it repeatedly finds it can get 7020 byte packets to the ETR, but not 7021 byte packets, then why should it try again by sending a traffic packet via RPD2 when the resulting packet in the tunnel would be longer than 7020 bytes? There is no good reason to do this.

However, if it was a few minutes (1, 5, 10?) since the last RPD2 attempt to send a packet longer than 7020 bytes into the tunnel, maybe it would be a good idea to send this traffic packet as an RPD2 probe, to see if the Real PMTU has changed.

There is more work to do on this IPTM proposal, of course. I think that it would be good to have some algorithm, including with parameters set by the operator of the ITR, to allow some adventurously long traffic packets (those which would exceed UPME once encapsulated) into the tunnel, as RPD2 probes.

Initially, I thought there was a role for the ITR occasionally sending Synthetic Probe packets, longer than UPME, with RPD2 to periodically test whether the Real PMTU has risen since the last tests set the value of UPME. But why should an ITR send these bulky things into the tunnel just in case the Real PMTU has changed? There probably isn't a good reason. (But see psg.com/lists/rrg/2008/msg01208.html ). It only needs to know if a SH is able to send, and so is actually sending, traffic packets which would be longer than UPME once encapsulated. Perhaps there is no role for Synthetic Probes at all . . . except as discussed in psg.com/lists/rrg/2008/msg01208.html (2008-04-22).

The ITR should use some longer traffic packets as RPD2 probes, but with some algorithm to ensure this is not done "too often".

Detecting a decrease in the Real PMTU

This is what the ITR needs to detect as rapidly as possible.

To do this with all ordinarily encapsulated packets, some brute force approaches include:

Have the ETR acknowledge every encapsulated packet it gets from this ETR, or at least those longer than some specified length.
With the outer header's source address being that of the ITR (LISP, APT and TRRP - not Ivip) cache sufficient details of all encapsulated packets (or those beyond a certain length) to be able to reliably detect PTBs which arrive from routers in the tunnel to the ETR, if one of these packets is too big. (As previously noted, this is extremely onerous - probably completely impractical.)

Lighter-weight approaches include expecting the ETR to acknowledge every 10th packet above a certain length, every 100th, or at least one packet above a certain length every minute. These approaches look promising for non-Ivip systems (LISP, APT, TRRP, maybe SEAL) - but for an Ivip ETR, ordinarily encapsulated traffic packets have no sign in them of which ITR they were encapsulated by.

For this discussion, lets say there is a continual stream of packets to some ETR, which are longer than MPMTU (1200 bytes once encapsulated) and which are (as they should be) shorter than the current value of LPME. Ivip ITRs will never get a PTB if one or more of those packets is too long for a router in the tunnel, and non-Ivip ITRs can't cache enough information about all such packets.

For non-Ivip ITRs, it may be sufficient to watch out for PTBs in general, and if some come in, then to start caching sufficient information for at least some of the packets going to that ETR (the PTBs will have the ETR's address as the destination address of the initial bytes of the offending packet) that a secure test can be run on one or a few of those PTBs, to make sure they are not from an attacker. Then, the ITR could decide that the Real PMTU has dropped below LPME. Probably the best response would be to re-initialise LPME and UPME as if the ITR knew nothing about the path to this ETR, and let the usual process of traffic packets in RPD2 probes adjust these variables to values which reflect the current Real PMTU.

For Ivip ITRs, one approach would be to send a small proportion of traffic packets - ideally close to, or at the length limit imposed by (LPME - ENCAPS) - to each ETR with RPD2 encapsulation. That will detect any downwards change in the Real PMTU, by a PTB arriving and/or the ETR reporting that the Packet B did not arrive.

Exactly how often to do this is a difficult question, which would depend a lot on the circumstances. In practice, the PMTU to most ETRs might remain stable from one year to the next, so it would be undesirable to pepper the ETR with repeated RPD2 packets (which tie up the ETR's CPU - while ordinary IP-in-IP packets are handled by its fast data-path, FIB or whatever) in a forlorn search for the Real PMTU increasing. Maybe some algorithm to send such a packet as an RPD2 probe every 10 minutes or so would be an acceptable trade-off between burdening the ITR and ETR with RPD2 chores, and detecting a drop in Real PMTU to this ETR.

For Ivip and non-Ivip ITRs, it seems that occasional sending of traffic packets by some more expensive means (RPD2 for Ivip) would be sufficient to catch decreases in Real PMTU in a reasonable time. The idea of non-Ivip ITRs taking special trouble to do this only if they get PTBs is promising, but an attacker can easily generate packets which look enough like PTBs to trigger this activity, so such an approach opens a door to DoS attacks.

Various low-key matters

An alternative to the RPD2 approach of splitting the traffic packet

RPD2 uses an odd-looking arrangement of sending most of the traffic packet in Packet B and the rest in three or so identical, small, Packet As.

Here is an attempt at an alternative "Protocol X" approach, which involves the Packet B function being performed by a packet which contains the entire original traffic packet. This does not look like a good idea to me at present, but perhaps it will be of interest.

In Protocol X, the sole purpose of the one or more Packet A received by the ETR would be to ensure the ETR gets instructions to tell the ITR whether or not Packet B arrives properly.

First I will consider a non-Ivip setting.

Packet B should have the same length as the traffic packet would have with ordinary IP-in-IP (or whatever else) encapsulation. The most obvious way of doing that is to use that ordinary encapsulation. For IP-in-IP, the only way is to use IP-in-IP. However, all non-Ivip map-encap schemes use a more elaborate encapsulation scheme. Here I will discuss LISP. LISP sends an IP header (source = ITR's address, destination = ETR's address) followed by a UDP packet (destination port = some particular port for all ETRs) followed by a special LISP header. The LISP header has variable length, which the ETR figures out from its initial bytes. Following the LISP header is the entire traffic packet.

If every LISP header which precedes an encapsulated traffic packet (or every traffic packet longer than ~1200 bytes) had a nonce which was unique to that traffic packet, then the ETR would have no trouble receiving a Packet A, with the same nonce, and finding which incoming encapsulated packet matches it. Then, there would be no trouble generating the report back to the ITR. However, there are two arguments against this:

The encapsulated packet which matches a particular Packet A arrives at the ETR in the same way as every other encapsulated traffic packet. In a high capacity ETR (such as a "big-iron" router from Cisco, Juniper etc. - not an ETR implemented in software in a server) those encapsulated packets are not going to be seen by the ETR's CPU, which gets the Packet As and has to report to the ITR. How is the fast data-path, FIB etc. going to trawl through all the incoming encapsulated traffic packets looking for those with particular nonces? That would be expensive. The RPD2 approach gets around this, since the Packet B is addressed to a UDP port which can lead straight to CPU involvement, while the main body of encapsulated traffic packets (IP-in-IP) can be handled by the fast data-path.
Having a 32 bit nonce in each encapsulated traffic packet is a waste of space, unless it is already needed for some aspect of LISP - which it may well be. (Still, it remains a waste if Ivip can do it without any such extra baggage in encapsulated traffic packets.)

So a nonce in every traffic packet, or at least in every traffic packet which, when encapsulated, is longer than ~1200 bytes, is a good way to do Protocol B for non-Ivip map-encap schemes. Nonce creation at the ITR involves some cost, so it would be acceptable just to have the space for a nonce, and to include one only when the packet is acting as a Protocol X probe.

Side-note on nonce security:

A 32 bit nonce is a very powerful tool for simply securing all sorts of queries and responses. However, if an attacker knew the algorithm which generated the stream of nonces, he or she could probably predict the nonces emitted by a particular ITR by sending some traffic packets to the nonce, with a destination address such that each would be encapsulated and sent to the attacker's own "ETR". Then, the attacker might be able to make a good guess of the nonce in packets being sent to an ETR by this ITR, which would drastically reduce the security of the nonce protection against DoS attacks. A physical noise source generating genuinely random numbers would fix this problem decisively. Any PRNG (Pseudo-Random Number Generator) approach might be vulnerable to compromise like this.

A lightweight approach to this might be to have a 32 bit field, which if zero, indicates it is a non-probe encapsulated traffic packet. If the field is non-zero, the ETR's fast data-path identifies this and informs the CPU of the value, and the fact that the packet arrived correctly. This is a "self-reporting" probe approach, where all packets are ordinarily encapsulated, and a few of them are also probes . If the ITR sends a number of these and gets no report of their arrival, it can reasonably conclude that either the ETR is unreachable, or that the packets were too long for some PMTU limit en-route. The ITR may already know the ETR is reachable by exchanging shorter packets with it.

Still, I think that something like "Packet A" to tell the ETR to make a report is a pretty good idea.

As long as the goal is to have no additional headers in each normally encapsulated traffic packet, then for Ivip, there is no way of sending the traffic packet in a packet of the same length with an added nonce or any other distinctive thing the ETR can recognise this Packet B by, as instructed by the one or more Packet As it receives. This is why RPD2 uses the odd looking splitting of the traffic packet.

One potential approach might be to have the ETR generate a 32 bit hash of every encapsulated traffic packet it receives. That could be expensive in a software-only router, but perhaps it would not be excessively expensive on suitably programmed forwarding hardware. The ETR's CPU could be given a list of received traffic packets, with their lengths, and could fish through the list looking for those which matched the hash and length in the Packet A. With Ivip, the ordinarily encapsulated packet does not include the ITR's address, but non-Ivip map-encap schemes do include this, enabling the ETR to only do this for packets with a particular ITR's source address.

Would it be good enough to do something even lighter weight, and perhaps not so secure?

What if the Packet A told the ETR to report, on all conventionally encapsulated traffic packets which arrived from a given ITR in the next 0.5 seconds, or which arrived with a particular length in such a time-frame? (With Ivip, there is no way of doing this, since these packets have the sending host's address in their outer header's source address. So perhaps Packet A specifies a sending host address instead, which would do the same job of greatly narrowing the search for the particular incoming encapsulated traffic packet.) Packet A would contain a nonce which would secure the ETR's reply against such replies being spoofed by an attacker. If the ETR reported packets coming from a particular ITR (actually, just packets arriving with the ITR's address in their outer source address) and one or more packets were received at the requisite length, that would be pretty good proof the packets were getting through.

But a DoS attack is still possible by firing spoofed traffic packets - with the same ITR's address in their outer source address - at the ETR on the basis that one of them will match the length of a packet the ITR really sent, and then being able to trick the ITR into deciding it could send longer packets to this ETR than it actually could. Also, the attacker could be generating the traffic packets the ITR is unwittingly using as probes . . .

This approach could be hardened against attack by requiring the ETR to report back not just that such packets were received, but to include a CRC of the entire packet, outer header and all. This would be pretty robust. The attacker could be generating the traffic packets and firing spoofed packets at the ETR which pretend to be the encapsulated traffic packets from the ITR. But if the attacker can predict the contents of the encapsulating header, then he/she could construct an identical spoofed encapsulated packet which would generate the CRC the ITR is expecting. The simpler the encapsulation format, the easier this would be to do. The attacker may be able to predict things about the ITR's encapsulating headers by running its own "ETR" and having the ITR encapsulate his/her traffic packets and tunnel them to this "ETR".

There may well be some lighter-weight approaches to the RPD2 approach outlined above, but without a nonce in the actual probe packet, it could be quite difficult to do it well, in terms of ETR resources and in terms of robustness against DoS attack.

The Sending Host's perspective

The SH's experience of all this depends on a number of factors.

As long as the mapping stays the same, packets sent by a SH to a particular DH will always go via one ETR. (This is Ivip-only, LISP, APT and TRRP do explicit load sharing in the ITR, so in principle packets could be sent to multiple ETRs, each with a different PMTU for the different tunnels. However, I think these other map-encap schemes try to keep the packets from any one SH going to the one ETR.)

As long as the ITR uses the one interface for reaching that ETR (it could change, if the ITR is a router and the routing situation changes so packets are sent via another interface) and as long as the routing path to that ETR doesn't change (it could change at any time, including perhaps to some routers which result in a final Real PMTU for the tunnel to be greater or less than it was a moment ago) and provided there is no flaky routing (sending packets randomly via different paths with different Real PMTUs) then the SH will have a pretty easy time with all this.

When the SH first sends packets to this DH, the SH presumably has no cached MTU limit associated with this DH. Either the ITR has no 3 Variables (described below) for the ETR these initial packets are to be tunneled to, or it has been sending longer than MPMTU packets to this ETR in the recent past. So the ITR probably has UPME and LPME set to values which are close to the Real PMTU at present.

Generally, the SH's experience will be of a stable Real PMTU to its DH. However, the Real PMTU will change, and probably the ITR's conception of the PMTU (in the 3 Variables for each ETR) to whatever ETR handles this DH, will change in circumstances including the following:

The mapping of this DH's micronet changes to another ETR, and that ETR has either a different Real PMTU from this ETR and/or the ITR's 3 Variables for that ETR differ from that of the original ETR. Unless the SH does its own ITR functions (Ivip's ITFH) - and this will not typically be the case - the SH has no idea about changed mapping because it has no knowledge of the map-encap system. So the Real PMTU will simply change, without notice, just as it would now if the routing system sent packets via a path with a different overall PMTU. Mapping changes in Ivip typically occur due to multihoming service restoration and mobility, but also due to slower changes such as the destination network using a different ISP (portability).
If an end-user uses frequent Ivip mapping changes to dynamically move traffic from one ETR to another, in order to balance traffic over multiple links, and if these ETRs involve different PMTUs from any of the ITRs handling this traffic, then this end-user will create considerable havoc for the ITRs and the SHs regarding PMTUD. Ideally, the end-user would be able to use (for a fee) some commercial, global, widely distributed, monitoring system to look at the PMTUs to its ETR from various parts of the Net, in order to understand whatever PMTU impacts their frequent changes of mapping might create.
The SH's outgoing packets go to a different ITR. This new ITR might have a different Real PMTU to the ETR. This could happen due to initially a nearby ITR for some reason letting the packet go to some other ITR, perhaps an OITRD (Open ITR in the DFZ) and later handling the packets itself. (A similar situation occurs with APT, where initial packets are encapsulated by the Default Mapper and the rest by a local caching ITR.)
The ITR and ETR remain stable, but there are either routing changes between them (quite likely) in a way which affects Real PMTU, or the one stable routing path has a changed Real PMTU (unlikely). The SH won't be able to detect this, and it is important the the ITR adapt to it reasonably quickly. If the ITR is a real router, it may be able to detect some changed path, and so know that this would be a good time to retest the PMTU to this ETR. However, this can't be assured for any router, and some or many ITRs may be servers, which only participate in the local or BGP routing system sufficiently to attract traffic packets which are destined for mapped addresses (Ivip micronets, LISP/APT EID prefixes etc.) When the Real PMTU increases, there is not much trouble, since the existing communication sessions continue with the current packet lengths. All that is lost is some efficiency. When the Real PMTU drops, there could be more serious trouble. See the discussion above about how the ITR might best discover changes to the Real PMTU to each ETR it is currently tunneling packets to.

IPTM - Ivip's approach to solving the problems with encapsulation overhead, MTU, fragmentation and Path MTU Discovery

Introduction - new proposal in April 2008

Relies only on RFC 1191, not on RFC 4821

Advantages

Fragmentation

Explanation by way of example

3 variables for each ETR

Adjusting LPME and UPME to reduce the Zone of Uncertainty

Outer algorithm for handling traffic packets

Short enough to send normally:

Length is within the Zone of Uncertainty:

Too long for known PMTU limit:

Too long for the ITR's outgoing Interface:

RPD2 protocol for reliable probing and potential delivery

Packet B

Packet A

Sending Packets A and B

Acknowledging Packets A, B and ETR messages.

Responses to the ITR

PTB from router in the tunnel

ETR report: Packet A arrived OK, but not Packet B

ETR report: Packet B arrived OK but none of the 3 Packet As

ETR report: Packet B arrived OK and at least one Packet A arrived OK

No response after 0.5 seconds or so

Example with multiple longer packets

Discovering changes in Real PMTU

Detecting an increase in the Real PMTU

Detecting a decrease in the Real PMTU

Various low-key matters

An alternative to the RPD2 approach of splitting the traffic packet

The Sending Host's perspective