IPTM - Ivip's approach to
solving the problems with encapsulation overhead, MTU, fragmentation
and Path MTU Discovery
To the main Ivip page
Robin Whittle
rw@firstpr.com.au 2010-01-28
The remainder of this page is unchanged from 2008-04-22:
Please see these
messages or discussions on the RRG list for things I will probably add
to this page soon:
psg.com/lists/rrg/2008/msg01207.html
(2008-04-22)
Discussion of how
my goals with IPTM differ significantly from what I understand are the
goals of SEAL.
Also, some thoughts about sending the ordinarily
encapsulated (IPv4-in-IPv4) packets into the tunnel with DF=0, in case
there is a PMTU limit there which is lower then the ITR expects,
enabling those packets to be fragmented and so most likely be delivered
to the ETR, which will defragment them, and decapsulate the original
packet.
psg.com/lists/rrg/2008/msg01208.html
(2008-04-22)
Discussion of the
ITR fragmenting IPv4 DF=0 traffic packets before encapsulation, and the
need to send Synthetic Probe packets while doing this to explore the
Real PMTU to the ETR, if there are no DF=1 packets to do this with.
Sending
->--(R1)----(ITR2)==(R3)==>==(R4)===(ETR5)---(R6)---->- Destination
Host
Host
\_______Tunnel_________/
With Ivip, routers R3 & R4
in the ITR to ETR tunnel can't send
ICMP Packet Too Big (PTB) messages to the sending host in a form
which the
sending host recognises. With LISP, APT and TRRP,
the outer header's source
address is the ITR's address, so R3
and R4
can't send PTBs to the sending host - and it is highly
impractical for the ITR
to securely respond to R3's or R4's
PTB messages in a way
which enables the ITR to send a PTB
message to the sending host in a form
it will recognise.
So, if the encapsulated packet is
too big for any router (or
other tunnel arrangement) in the ITR --> ETR tunnel, then the
packet will be dropped, without
either the Sending Host knowing
about it.
Introduction
- new proposal in April 2008
The
material on this page up until 21 April 2008 (from October 2007 and
updated somewhat to February 2008) was for a proposal which is now
obsolete. I have archived that material here
/archive-2008-04-20/ . That old
page may still be of interest - such as for how the Sending Host
(SH) would traceroute the routers in the tunneled portion of the
path. Some of that page's links to older discussions
might still be valuable, but the
new proposal below is what I am currently working on.
On 19
April 2008 I devised a completely different approach to handling these
problems. Below is a reasonably descriptive summary of the
proposal, which I will write up in an Internet Draft as
soon as I can, probably later in 2008.
For now, I am calling
the new proposal the same as the
old:
IPTM: ITR Probes Tunnel
MTU.
Part of this new proposal is a robust method of
probing the PMTU,
and possibly delivering a traffic packet at the same time. This
sub-protocol is called
RPD2
Robust PMTU Discovery and traffic Packet Delivery Protocol.
(Don't blame me for these acronyms - I need a term for these
things and I am trying not to reuse any existing IETF acronym!)
The
ITR doesn't do anything special with traffic packets
which, once encapsulated (
ENCAPS
overhead with IPv4 Ivip, is the
20
byte IP-in-IP header) are still short enough to be sent with confidence
to any ETR in the world. There would be some globally agreed
constant
MPMTU (Minimum Path
MTU) for this, such as
1200
bytes, enforced by a BCP (Best Current Practice recommendation) that
all ITRs and ETRs be placed at locations where they can exchange
packets of at least this length with routers in the DFZ.
The
task is how to handle traffic packets which are longer than this when
the ITR has no
knowledge, or has still incomplete knowledge, of the
Real PMTU (Path Maximum
Transmission Unit) to the ETR. The Real PMTU is typically
some value which remains reasonably stable, due to a stable path of
routers, with their various individual MTU limits according to link
MTUs, any fancy tunneling arrangements for inter-router links etc.
However, the Real PMTU may vary over time, as the path changes to
other routers, as the routers change their behavior. Also, due to
routing vagaries, sometimes packets from one ITR to one ETR may be
sent by one path with one
Real PMTU and sometimes over another path with another Real PMTU.
These scenarios of flaky variable Real PMTU are a problem for any
system. I intend the following proposal cope reasonably well with
them, generally choosing lower limits and so limiting efficiency
somewhat, rather than causing packets to be lost without any
appropriate
PTB (Packet Too
Big) message being sent to the
SH
(Sending Host). However any path which, for instance, has a lower
PMTU for 1% of packets, is probably going to cause trouble for 1%
of packets which are longer than that PMTU.
The example below
assumes a single sending host, a single ITR and a single ETR.
There will usually be multiple ITRs sending packets to one ETR,
but each ITR is operating in isolation from other ITRs, and performs
the following IPTM and RPD2 processes with the ETR. The one
or more ITRs may be
sending encapsulated packets to the ETR because it handles the
micronet (EID prefix in LISP or APT terminology) of a single
DH (Destination Host), but more
likely, there will be multiple DHs in each micronet, and probably
multiple micronets handled by the one ETR. Any PMTU problems
between the ETR and the DH are outside the scope of this IPTM process.
They would be handled either by the ETR, or by the PMTU limiting
router between
the ETR and DH, sending a PTB to the SH.
For a given ETR, the
ITR
might be sending packets from a single SH, including from a single
application in that SH, which sends to a single DH which is currently
reachable via that ETR. However, there may be multiple
applications in that SH sending to that DH, or to other DHs in the same
micronet, or other DHs in other micronets which are currently mapped to
the same ETR. The ITR will often be accepting packets from
multiple SHs.
Relies only on RFC 1191, not on RFC 4821
The following proposal only assumes
ordinary, currently ubiquitous (I understand) sending host support for
RFC 1191 (1990) style PMTUD.
With this, it is vital that for any DH, the SH only get
PTBs with
MTU values which do not unreasonably restrict the packet lengths that
host will send to that DH. This is because RFC 1191 requires each
SH to send packets no longer than that value to that DH for the next 10
minutes. After that, the SH can try its luck with larger packets,
in an effort to discover whether the Real PMTU to that DH has increased
since it got the last PTB message.
I understand that most
hosts send IPv4 packets with DF=1 (Do not Fragment), as part of their
RFC 1191 approach to PMTUD. This is supported well by this IPTM
proposal. IPv6 packets are non-fragmentable as well. A
section below discusses how the ITR might handle IPv4 packets with DF=0.
RFC
4821 Packetization Layer
PMTUD is not required for this new IPTM - RPD2 process to work
fine.
However, RFC 4821 PMTUD should work fine over any map-encap
system (or any other tunneling system) which uses IPTM - RPD2.
RFC
4821 is a recent RFC (March 2007), and I understand from Fred Templin
that it has been implemented in the Linux kernel. I haven't
verified this, however I believe that the adoption rate of RFC 4821
will be slow at best. (Try Googling "RFC 4821 adoption"
etc.) It requires the OS to work with
applications which choose packet lengths and which are able to
determine the success or not of packets being sent to the application
at the other end - in the DHs. The applications are supposed to
report this back to the
OS on the success or otherwise of packets of different lengths being
sent to particular DHs. The OS is supposed to support the
applications by feeding back
an estimate of the PMTU to each DH, so each application can make a
choice about what
sized packets to send. (I think the TCP part of the stack counts
as an "application" in this context, because like the other
applications, it has a "Packetization Layer" which needs to decide what
length packets to break long streams of data into.)
This
sounds really messy, and I think it
will in practice be a slow process, if it happens at all, for a
significant number of applications to be adapted to this, in most or
all major operating systems. Not least, each OS would need a new
method of achieving this unusual two-way flow of information
between applications and the OS. Debugging this could be highly
problematic, with each host having different applications and operates
in different conditions.
RFC 4821 does not rely on PTB
messages, but can use them, since it builds on the basic RFC 1191
approach. It was prompted in part (or very largely?) by an
upsurge in unwise filtering which prevents PTB messages from reaching
the SH. IPTM's RPD2 protocol does not rely on PTBs to
successfully discover PMTU to each ETR, but it would work somewhat
better with them.
Advantages
The new process looks quite promising
for reasons including the following points. Maybe there are
gotchas and problems I haven't seen yet - please let me know your
critiques, concerns and suggestions for improvement.
There are
costs in complexity, but it doesn't seem inordinately complex or
expensive considering the difficulty of the problem it is solving.
Hopefully
there are no major security problems in what I propose below.
Please point out any weaknesses I have missed.
These notes
on advantages are somewhat simplified, and are on the basis that there
is not some unfortunate and typically rare packet losses which upset
the process.
Generally, the process should be robust against random packet
loss.
- Application data is not lost due to PMTU problems.
Whenever the ITR can't get a traffic packet to the ITR, it knows
about this and sends a PTB message to the sending host, which will try
again with a shorter packet.
- No
requirement for hosts to use RFC 4821.
- Does not embody any
assumptions about the currently widely used ~1500 PMTU limit (100Mbps
Ethernet etc.) or of jumbo-frame supporting equipment, which may have
MTUs of 9000 bytes (1Gbps Ethernet and beyond). So the system
should work optimally now and in the future as more or the Net adopts
jumbo-frame compatible equipment.
- The ITR
automatically and rapidly adjusts its two variables to converge on the
Real PMTU to each ETR. As the remaining Zone of Uncertainty
diminishes, ultimately perhaps to zero, so does the need for sending
traffic packets as probes. So the PMTUD process is self limiting,
for each ITR to ETR tunnel.
- There seems to be no need for
Synthetic Probe packets (long
probes
to test PMTU limits, but which do no carry traffic data). (But see the
discussion here: psg.com/lists/rrg/2008/msg01208.html -
it looks like the will be needed for probing if the only too-long
packets are IPv4 DF=0.) This might
have been useful for detecting if the PMTU his risen, but it looks like
they will not be needed. They are not required
during the most important phase, when the ITR is trying to quickly
figure out the PMTU to an ETR it has never heard of before, and to
which it needs to send packets which exceed MPMTU (1200) bytes.
- Every
traffic packet which, when encapsulated, would be of a length the ITR
is uncertain about, in terms of whether it can be sent to the ETR
without PMTU problems (the Zone of Uncertainty), will be used as a
probe and will generally
enable the ITR to adjust its variables to improve its estimated PMTU in
a robust and helpful manner.
- When
a probe packet is received by the ETR, the traffic packet it embodies
is sent to the DH, and the ITR gets a message that the packet was
successfully received by the ETR.
- When a probe packet runs into
PMTU difficulties, the ITR will get a PTB message, or the probe will
not be received by the ETR, or both. In either case the ITR sends
a PTB to
the SH, and the SH tries again, with a shorter packet. No loss of
application data results from these probe packets which were too long
for the PMTU of the ITR --> ETR tunnel.
- The ITR does
very little additional work, such as sending Synthetic Probe packets.
(But see psg.com/lists/rrg/2008/msg01208.html
) It only probes PMTUD to the ETR when traffic packets of
suitable
length need to be tunneled to it.
- The protocol is not
overly complex for either the ITR or ETR. The state
requirements in both are modest. In particularly the ITR does not
hold state for most tunneled packets. It only needs to hold some
state for those few which are probe
packets.
- The natural behavior of a SH to a sequence of PTB
messages, each with a lower still MTU value (as this IPTM will
generally create) will be to send shorter and shorter packets, which is
ideal for the ITR discovering more about the Real PMTU, and so
fine-tuning its two variables towards each other, diminishing the Zone
of Uncertainty.
- There is no fragmentation
or segmentation (a non-fragmentation approach to chopping them into
smaller packets) of traffic packets, except for fragmentation of IPv4
DF=0 packets as noted
below:
Fragmentation
This discussion and the example below
assumes the traffic packet is non-fragmentable (IPv6 or
IPv4 with DF=1). I understand that most IPv4 traffic packets are
sent according to RFC 1191 with DF=1.
A fragmentable IPv4 packet
could be
fragmented
by the ITR, into however many pieces of a size, once encapsulated, that
its current knowledge of the PMTU to this ETR indicates could be safely
sent. (This maximum safe size to send into the tunnel is the
LPME variable for this ETR, as described below.) These would be
reassembled by the destination host.
There may be some problems with reassembly of really
large numbers of such packets due to the 16 bit
fragment ID wraparound problem (delayed fragments matching a fragment
identifier of a packet generated 64k fragments later. Fred
Templin's SEAL proposal (
SEAL ID)
outlines those problems and provides a solution which involves extra
headers. I will look into this and try to find a solution to
whatever problem there may be, hopefully without extra or longer
headers. Probably the solution would be to do something like SEAL
does: break the packet into "segments" and reassemble them at the ETR,
using a more robust protocol, with a 32 bit identification system than
ordinary IPv4 fragmentation provides with its 16 bit fragment
identifier.
Explanation
by way of example
For each ETR that
an ITR needs to do PMTUD with (Path MTU Discovery) with, here is an
outline, by way of an example, of what happens. This example
has some details for Ivip, but the same process is adaptable
to other map-encap schemes (LISP, APT and TRRP). I think it could
also be adapted as the basis for Fred Templin's SEAL proposal.
An
ITR takes no action when tunneling packets of length L
<= MPMTU to any ETR. So as long as the traffic packet was 1180
bytes or less, then the ITR encapsulates it with
Ivip's ordinary IP-in-IP encapsulation, making a packet no longer than
1200 bytes, and sends it to the ETR. All the example below
relates to what the ITR does after it receives one or more traffic
packets which once encapsulated, would be longer then MPMTU (1200
bytes).
3 variables for each ETR
For each ETR it needs to send packets to
which are
longer than MPMTU, the ITR creates
3
variables, initialised as
follows:
IMTU
(Interface MTU for the Interface via which the ITR sends packets to
this ETR) Initial value: Whatever the MTU is for that interface.
In
this example the ITR uses a jumbo-frame capable interface, so the value
is 9000 bytes.
IMTU
is not adjusted as part of the process, so it is effectively a
constant. There are circumstances in which the value does change,
and
in all these cases, the ITR starts with a clean slate: setting IMTU to
the new value and the other two variables to their initial values as
noted below. These changes of IMTU include: - The
Interface itself changes its MTU (seems pretty unlikely).
- The
ITR has multiple interfaces, and due to routing or whatever other
changes, now uses a different interface to send the packets to the ETR
- and that interface has a different MTU. (Still, sending the
packets out another interface is a good reason to restart the PMTUD
process by intialising the variables again, since the path could be
quite different from what it was from the initial interface.)
|
UPME
(Upper Path MTU Estimate) Initial
value: IMTU (MTU
of the ITR's interface on which it sends packets to this ETR).
In
our example, the ITR uses a
jumbo-frame-capable interface, so the initial value of UPME is 9000 bytes. As
the ITR initialises and later may reduce the value of UPME, this
variable specifies the longest length packet the ITR considers it might
be
able to send to the ETR without PMTU limits. For instance, if
6011 bytes is the length of the shortest packet the ITR knows it
knows
it cannot get to the ETR, due to one or more
failed attempts to do this, or due to an attempt to send a packet of
this or a longer length, which resulted in a router in the ITR -->
ETR tunnel returning a PTB with MTU = 6010, then the ITR would set UPME
to 6010.
By comparing the length L of a traffic packet with
(UPME - ENCAPS) and with the other variable (LPME - ENCAPS), the
ITR decides how to handle each traffic packet it needs to encapsulate
and tunnel to this ETR. |
LPME
(Lower Path MTU Estimate) Initial
value:
MPMTU:
a worldwide agreed-to
packet
length which all ITRs and ETRs can send to each other without PMTU
problems.
In our example, LPME is
initialised to 1200 bytes.
As
the ITR initialises and later typically increases the value of LPME,
this variable specifies the longest length packet the ITR can
confidently send to this ETR. Therefore, if a traffic packet has
length L which is equal to or less than (LPME - ENCAPS) then the ITR
will send it normally. For Ivip, this means ordinary IP-in-IP
encapsulation with a 20 byte header overhead. Such packets are
not a
part of the ITR's process of PMTUD.
|
Adjusting
LPME and UPME to reduce the Zone of Uncertainty
As PMTUD progresses, the ITR typically
adjusts these variables towards each other. UPME will move down
(unless there is a clear jumbo-frame supporting path to the ETR)
and LPME will move up (unless perhaps the ITR is unable to send a
packet longer than 1200 bytes to the ETR, which in practice will almost
never be the case).
The range of packet sizes between (LPME + 1)
and UPME inclusive is called the Zone
of Uncertainty. The
ITR does not have adequate knowledge yet to predict whether a packet of
length (L + ENCAPS) is within the Zone of Uncertainty could be
sent to the ETR
without PMTU problems.
It is possible, and in many cases likely,
that within a few seconds, with various traffic packets being sent as
probes, the ITR will raise LPME and/or drop UPME until they are the
same. In that case, there would be no Zone of Uncertainty.
Outer
algorithm for handling traffic packets
Here is the algorithm by which the ITR
decides how to handle each traffic packet, according to how its length
compares with these variables.
With Ivip's 20 byte
ENCAPS overhead, this means the ITR
will treat traffic packets in these ways according to their length L.
The green is the in-principle test. The black shows the
test with values derived from the constants as intialised above.
Short
enough to send normally:
L
<= (LPME - ENCAPS)
L
<= 1180
Tunnel
the packet to the ETR by the normal method: Ivip's IP-in-IP
encapsulation with the outer source address being that of the Sending
Host
(SH). (This is to assist with the ETR supporting the source
address
filtering of border routers of the ISP network it within, and also
could be used by modified traceroute code in the SH to trace into the
ITR --> ETR tunnel.)
Length is within the Zone of Uncertainty:
( L > (LPME -
ENCAPS) )
&&( L
<= (UPME -
ENCAPS) )
( L > 1180 )
&&(
L
<= 8980 )
Perform the following RPD2 protocol with
the traffic packet. In almost all cases, not counting unusual and
unfortunate instances of packet loss, this will result in:
Packet received: LPME will be increased to
(L + ENCAPS).
Packet not delivered, meaning one of the following
occurred:
- A PTB arrived from a router in the ITR --> ETR
tunnel.
- The ETR told the ITR the packet did not arrive.
- (The
ITR heard nothing, in which case it may try again, and assume the
packet did not arrive if there is no response the second time.
Actually, this probably means the ETR is unreachable)
In
this case, the ITR will adjust the value of UPME to the size of the
encapsulated packet, or to the MTU value reported in the PTB message,
whichever is lower. The ITR then sends the SH a PTB with an MTU
value of UPME.
Too long for known PMTU limit:
( L > (UPME
- ENCAPS) )
&&(
L < (IMTU - ENCAPS) )
( L > 8980 )
&&(
L < 8980 )
Later, UPME may be adjusted to a lower
value than IMTU. If the packet length matches this test, then in
general, it will be dropped and a PTB sent to the SH, with an MTU value
of (UPME - ENCAPS). However, some of these packets may be sent
with the RPD2 protocol to explore the possibility that the Real MTU has
grown since the last tests indicated it was as low as UPME currently
indicates.
Too long for the ITR's outgoing Interface:
L > (IMTU - ENCAPS)
L > 8980
The packet, once encapsulated, is too long
for the interface which the ITR currently uses to send packets to this
ETR. Drop the packet and send a PTB to the SH with an MTU value
equal to (UPME - ENCAPS).
The above four options
explains the overall algorithm for handling
packets, and it can be imagined how a succession of packets which meet
the Zone of Uncertainty criteria will result in the ITR learning more
about
the Real PMTU to this ETR, and so adjusting LPME and UPME accordingly
to reduce the span of the Zone.
RPD2
protocol for reliable probing and potential delivery
Before giving an example of IPTM operation,
here are the
details of the RPD2 protocol, which is normally used so that the full
contents of a traffic packet are sent to the ETR, in a way which both
reliably probes the PMTU at a length that packet would have if sent via
normal encapsulation, and which will deliver the traffic packet to the
ETR if no PMTU problems are encountered.
At present, there are
two ways this could be done. The first approach splits the
original packet, sending most if it in a long Packet B probe packet,
and the rest in a short Packet A, of which 2 or 3 are sent for Justin -
Just In Case one of the Packet A's is dropped due to random packet loss.
A
potential second approach involves sending the traffic packet via
ordinary encapsulation, to test the PMTU, but has the Packet A role
performed in a different way. This has some elements of
simplicity, but I think it is not such a good idea, since it would be
difficult to reliably tell the ETR exactly which encapsulated traffic
packet to look out for. Also, this second approach would burden
the FIB, main traffic path or whatever of the ETR with the task of
always looking out for such probe packets. I describe this second
approach later, but in the meantime, please consider that the rather
awkward looking initial approach is probably a good approach,
since each Packet B can
easily and unambiguously be associated with its Packet A.
As
described here, the RPD2 algorithm in the ITR takes as its input the
traffic packet:
A packet from a
SH to a DH, which the ITR has already looked up the mapping for this
DH, found which micronet it is in, and thereby found the ETR address it
needs to be sent to. From that, it knows which interface to send it out
of.
and
therefore the ETR address, and the three variables for this ETR listed
above.
In the examples which follow, the traffic packets handled
by RPD2 are all of a length which means that once
encapsulated, each would be of a length which matches the Zone of
Uncertainty.
However, as discussed below, there may be a need to send packets
with
RPD2 when they are outside this zone:
- Shorter than
(LPME - ENCAPS) - to detect the possibility that the Real PMTU has
fallen below LPME.
- Longer than (UPME -
ENCAPS) - to detect the possibility that the Real PMTU has risen above
UPME.
RPD2 could also be used with Synthetic Probe
packets, in which case the Packet B is mainly zeros, and Packet A
contains only the required headers, most particularly the same nonce as
is in Packet B. With a Synthetic Probe, Packet A does not contain
any data which didn't fit into
Packet B. I am not yet sure whether we need Synthetic Probe
packets. At most they would be used for occasionally testing to
see whether the Real PMTU has increased. Maybe we don't need
them at all.
IPTM uses this RPD2 protocol only when needed.
Most traffic packets are sent with ordinary IP-in-IP, and any
PMTU problems those packets encounter in the ITR --> ETR tunnel are
not visible to the ITR. This is because the outer source address
is that of the SH. With LISP, APT and TRRP, the outer source
address is that of the ITR, so the ITR would get a PTB (except for
packet losses and any filtering of such messages). However, it is
highly impractical to require ITRs to cache sufficient details of all
encapsulated packets they send sufficient to securely identify a PTB as
being genuine and then to do something useful with this information,
including sending a suitably crafted PTB to the sending host. See:
Nonetheless,
see a section below where I discuss a "Protocol X" alternative to RPD2,
in which it might be feasible for a non-Ivip ITR to achieve the same
detection of PMTU problems in the tunnel to each ETR, while only
storing state for a fraction of the encapsulated traffic packets.
First,
I describe the construction of two packets A and B. Then the way
in which they are sent and how the ETR reports back to the ITR.
Following that is an example which puts together the outer
IPTM algorithm for handling packets of various lengths, with the 3
Variables for this ETR, and the RPD2 probing protocol to show how an
ITR would rapidly adjust its variables according to what it discovers
about the Real PMTU to this ETR.
Packet B
Packet B is the Big packet. It has
exactly the same length as
the traffic packet would have had if sent by ordinary encapsulation.
For IPv4 Ivip, this is 20 bytes longer than the raw traffic
packet.
It is vital that Packet B have the same length as a
normally encapsulated traffic packet would have, since a SH may
get a PTB regarding this traffic packet, and it is important that this
only result from a genuine problem in the network which relates to a
packet of that length. For instance, if the traffic packet length
L was used as part of a probe to create a probe packet of length longer
than (L + ENCAPS) then if the probe failed, the ITR would need to send
a PTB to the SH (unless the ITR tried to resend the traffic packet,
which is messy and probably too slow to be workable). In order
for PTB values to be valid from the point of view of the SH's RFC
1191 or RFC 4821 functions, the packet which caused the trouble on the
ITR --> ETR tunnel needs to be exactly the same length as would
result from ordinary, non-probe, encapsulation.
The ITR takes
the initial traffic packet (7000
bytes in this example) and chops it into two pieces.
From the
start, CHUNK1 bytes are kept
to form the basis of "Packet A". The remaining CHUNK2 bytes are kept to form the
basis of "Packet B".
CHUNK1 is relatively small, and so is
"Packet A". Most of the original packet winds up in "Packet B",
which has the following format:
-------
Outer
IP header (Actually, there is no inner header, but I
keep
this term because it fits with the Packet A
structure.)
Outer Source Address = ITR's address
Outer Dest Address = ETR's address
Next header: UDP
UDP Header (well known port on ETR, length of all that follows,
etc.)
RPD2 header:
Flags etc. to the effect:
This is a "Packet B". Therefore, the ETR should
link it with the first "Packet A" it correctly
receives with the same nonce.
ETR must acknowledge the receipt of this packet
by a procedure noted below.
CRC for the whole packet. (Maybe put this at
the
end, after the large
slice of the original
traffic packet?)
NONCE A nonce which is unique for this traffic packet
and is
also used in the Packet A.
CHUNK2 Length of the segment of the original packet
contained herein.
CHUNK2
bytes of the original packet.
-------
CHUNK1 +
CHUNK2 = original traffic packet length = 7000.
The total length
of Packet B is defined as the original packet's length plus
ENCAPS. In this case: 7020 bytes. CHUNK2 is chosen to
be
7020 bytes minus the length of Packet B's:
Outer IP
header
20 bytes
UDP
header
8 bytes
RPD2 header
Let's assume 20 bytes
= 48 bytes total
So in
this example, CHUNK2 = 7020 - 48 = 6972 bytes.
Therefore, CHUNK1
= 28 bytes.
Packet A
The ITR assembles a "Packet A":
-------
Outer IP header
Outer Source Address = SH's address (This is for Ivip. For
other
map-encap schemes or for
SEAL, use the ITR's address.)
Outer Dest
Address = ETR's address
Next header: UDP
UDP Header
(well known port on ETR where RPD2 packets are expected
length of all
that follows, etc.)
RPD2 header:
Flags etc. to the effect:
This is a "Packet A". Therefore, the ETR should
link it with a "Packet B" it correctly receives
with the same nonce.
ETR must acknowledge the receipt of this packet
by a procedure noted below.
CRC for the whole packet. (Maybe put this at the
end?)
NONCE Same as for the matching Packet B.
CHUNK1 Length of the segment of the original packet
contained herein.
CHUNK1 bytes of the original traffic packet. This will consist
of:
Inner IP header:
Inner source address = SH's address
Inner dest address = Destination Host's address
etc.
The rest of the original traffic packet, up to CHUNK1 bytes
into that packet.
-------
Sending
Packets A and B
The ITR now does
this:
Send a Packet A.
Send
another Packet A.
Send Packet B (the long one).
If it
receives no response from the ETR within some short time, say 0.1
seconds, it sends another Packet A.
It is really important
the ETR get at least one of these packets so it sends a report to the
ITR. Without a report from the ETR, ITR would have to wait
for some time, maybe half a second or a second or two, waiting for a
response, and then would have to abandon this attempt at RPD2 probing.
Sending
two little packets before the big one, and then another little packet a
moment later, seems like a good way of getting at least one packet to
the ETR. The ETR ignores any second or third Packet A it receives
with the same nonce.
When the ETR receives either its first
Packet A or the Packet B, it waits for a moment (maybe 0.1 secs?) for
the other to arrive.
Then if only one has arrived, it reports
back to the ITR what has happened, in a UDP packet, with some suitable
flags and fields to the effect:
Packet
A received (yes / no)
Packet B received (yes / no)
NONCE
The
ETR keeps listening for the other packet, and after some other time -
say 0.3 seconds - times out and forgets all about the one it
received. (However, for a few seconds it continues to send report
messages to the ITR until it gets an Ack from the ITR.)
Whenever
it receives the other packet, it reports back similarly, but with "yes"
for both fields.
When it gets both Packets A and B, the ETR
reassembles the original traffic packet and verifies that its source
address matches the outer source address of Packet A. (This is
Ivip only - other map-encap schemes and SEAL don't need this.)
Then the ETR passes the packet on to the destination host.
Acknowledging
Packets A, B and ETR messages.
The
ETR is expected to send one or two report messages to the ITR, as
described above when it either receives one of the ITR's packets A
and B, or both of them. The ETR is required to resend those
report messages, every 0.5 second for 1.5 secs or so (depending on
how quickly the ITR times out of this RPD2 process) until it gets
an Ack from the ITR. The nonce secures the ETR -> ITR
message and likewise the ITR --> ETR acknowledgment of that
message.
The ITR sends a single Ack for each report message it
receives from the ETR. If the first report message or Ack is
lost, the ETR will send the report message again and
probably the second Ack will be received by the ETR.
Responses
to the ITR
After sending these 2 or
3 copies of Packet A, and the Packet B, the ITR could receive a number
of things:
PTB from router
in the tunnel
The ITR caches the
initial bytes of Packet B, so it can verify the validity of this
PTB. (Maybe put NONCE up front in the RPD2 header to make this
easier?)
Assuming the PTB checks out OK, the ITR has now found
out something concrete about the real PMTU to this ETR: it is equal to
or less than the MTU value in this PTB.
This value is written to
UPME (Upper Path MTU Estimate).
Then, the ITR sends back a PTB
to the SH, with an MTU value (UPME - ENCAPS).
That ends this
instance of RPD2, other than the ITR acknowledging any report it
receives from the ETR.
No application data has been lost.
Assuming the sending host gets the PTB message from the ITR, it will
resend the data in smaller packets.
ETR report: Packet A arrived OK, but not
Packet B
If this report arrives and
no other report arrives a moment later, then the ITR can reasonably
suspect that Packet B was either lost (or delayed excessively) or that
it was too long for some router in the tunnel, but that no PTB has yet
been received. In the later case, maybe the PTB was not sent,
maybe it was filtered or maybe it was dropped or delayed somewhere.
If
this is the situation, the ITR might try again - a fresh RPD2 cycle
with a new nonce, and the same traffic packet. If the same thing
happens the second time, the ITR is probably justified in concluding
that the Real PMTU to this ETR is lower than the length of Packet B.
(Maybe use the same nonce again, so in case the ETR does receive
the first Packet A and B OK, that there won't be duplicate traffic
packets sent to the DH.)
If in the second attempt, a ETR
reports that Packet B arrived, then this means the length of this
Packet B can be written into LPME, raising it and therefore reducing
the ITR's Zone of Uncertainty for this ETR.
ETR report: Packet B arrived OK but none of
the 3 Packet As
In Ivip, the most
likely cause of this is that the network in which the ETR is located
has its border routers set up to drop incoming packets which match one
of the network's prefixes. In this case it means that the
sending host's address matched one of these prefixes, in which case the
ISP's network considers it to be a packet with a spoofed source
address. Ivip ETRs will drop any inner packet with a source
address which does not match the source address in the outer header.
So this indicates the traffic packet had a spoofed source address.
The
ITR could retry the RPD2 approach again, with the same traffic packet
and a different (same?) nonce, on the off-chance that the failure of
any of the Packet As to arrive was just bad luck with random packet
loss.
At least the ITR has got a new higher value to write into
LPME, again closing the Zone of Uncertainty between this and UPME.
If
the same result occurs on a second attempt, the Ivip ITR should forget
about this traffic packet, since it is reasonable to assume it has a
source address the ETR will reject, due to the filtering of the border
routers in the network within which the ETR is located.
ETR report: Packet B arrived OK and at
least one Packet A arrived OK
This
might be the initial report, or it may arrive up to half a second or so
later. (I am keen to have ITRs not waiting around for bedraggled
packets which arrive well after they should.)
The ITR writes the
length of Packet B to LPME.
The ITR has successfully delivered
the packet and established a new lower limit for the final value of
LPME.
No response after 0.5
seconds or so
The ITR concludes
that either the ETR is unreachable at present - or that the device
there is not an ETR.
Ivip ITRs generally are not concerned
with the reachability of ETRs. However, as part of this PMTU
stuff, they may discover an ETR is unreachable. If this happens
repeatedly, the ITR could drop all traffic packets which should
be sent to this ETR. Maybe it should send back an ICMP host
unreachable message to the SH, but the ITR would never discover
this unreachability if the packets were no longer than MPMTU, or no
longer than the current value of LPME. So the SH should not rely
on any such ICMP host unreachable messages from the ITR.
Any
end-user who manages their mapping wisely will soon find it is not
working - for instance if they point it at some device which is not an
ETR, or at an ETR which is unreachable - so generally the mapping
will soon be changed to point to a reachable ETR.)
Example
with multiple longer packets
Here
is a sequence of packet lengths, the first set of which which appear in
this order while the initial and subsequent RPD2 processes are in
progress.
This example shows a flurry of traffic packets,
perhaps from the same SH and perhaps not, which all go to micronets
which are currently mapped to one ETR, for which the ITR initially has
no PMTU information.
If a single SH with a single application
was the only host sending packets through the ITR to this ETR, the
sequence of packet lengths would be simpler. Assuming the SH made
the most of its (assumed to be 9000 byte MTU) path to the ETR, it would
start with some high value, such as a 9000 byte packet, which the ITR
would reject with a PTB (MTU = 8980) because once encapsulated, the
packet would be 9020, and therefore too big for its outgoing interface.
Then the SH would probably send an 8980 byte long packet and the
ITR would try to send that to the ETR. If there was one or more
PMTU limits in the path, usually a PTB would come back indicating the
first limit below 9000. The ITR would adjust down its UPME
variable accordingly and the SH would get a PTB (MTU = UPME - ENCAPS),
and so would try again with a packet which, once encapsulated, should
pass this first router which has a PMTU limit. Perhaps that is
the only router, in which case, the packet would arrive, the ITR would
set its LPME variable to this length, the Zone of Uncertainty would
cease to exist for this ETR, and all subsequent packets (for 10 minutes
at least) from this SH would be the new lower value, and therefore be
sent with ordinary IP-in-IP encapsulation.
So the
following example is inordinately complex and unrepresentative of a
typical initial PMTUD process for any one ETR. (To-do: make
a simpler example.)
The values of the variables are in the left
two columns, and don't change until a report comes back from the ETR:
LPME
UPME Traffic RPD2 Action or event
packet packet
length length
1200
9000 7000 7020
Send with RPD2 #1.
1200
9000 6000 6020
Send with RPD2 #2.
1200
9000 6500 6520
Send with RPD2 #3.
1200
9000 7500 7520
Send with RPD2 #4.
1200
9000 7000 7020
Send with RPD2 #5 - because the
ITR hasn't yet got a report from the
ETR. Maybe the #1 process failed due
to packet loss, in which case this #5
will be a second attempt at testing
the path with a 7020 byte packet.
1200
9000 8000 8020
Send with RPD2 #6.
Now some
reports from the ETR come in. In reality, responses and traffic
packets would probably be arriving at the same time, but for
simplicity, I am showing the replies arriving after this initial flurry
of traffic packets.
LPME UPME Traffic RPD2 Action or event
packet packet
length length
1200
9000
#1 reply (7020) A & B received.
Write 7020 to LPME.
If this reply had been received
before the ITR
handled the 6000 and
6500 length packets
described above,
both
of those would have been sent
with ordinary IP-in-IP encapsulation.
Now the range of packet sizes for
ordinary IP-in-IP encapsulation has
been dramatically
expanded from 1180
to 7000 - and the Zone of Uncertainty
greatly decreased.
7020 9000
#2 reply (6020) A & B received.
Nothing to do, since LPME is already
above this value.
7020 9000
#3 reply (6520) A & B received.
Nothing to do, since LPME is already
above this value.
7020 9000
6800 6820 Send with ordinary
IP-in-IP.
7020 9000
8800 8820 Send with RPD2 #7.
7020 9000
#4 (7520) PTB received from
router in tunnel. MTU = 7400.
Write this value to UPME.
7020 7400
#4 (8020) PTB received from
router in tunnel. MTU = 7400.
Nothing to do since UPME is
already set to 7400.
Now
the span of the Uncertain Zone has been reduced dramatically, from 1200
<--> 9000 to 7020 <-- 7400.
Traffic packets of 7000 or
less bytes will now routinely be sent with IP-in-IP encapsulation with
very high confidence they won't be
clobbered by some PMTU
restriction.
Packets longer than 7380 bytes will be dropped and
a PTB sent to the SH with an MTU of (UPME - ENCAPS) = 7380.
Packets
of length 7001 to 7380 inclusive will be sent with RPD2 and so will
help the ITR further adjust these two variables, further narrowing the
Zone of Uncertainty.
Unless there is an unfortunate and unlikely
series of lost packets, all packets sent with RPD2 will result in one
of these outcomes:
1 - The packet is delivered.
The ITR will increase LPME.
(Uncertain Zone diminished.)
2 - The packet is not
delivered.
The ITR may learn
something about the PMTU to this ETR, in which case the UPME will be
reduced. (Uncertain Zone diminished.)
There will be no
application data loss - the SH will get a PTB message and so will
resend the data in smaller packets.
The MTU value sent in
successive PTBs to any SH from this ITR will always decrease, as its
UPME is decreased. These decreases only happen due to either a
PTB (which is highly reliable) or one or two instances where the ETR
received one or more copies of Packet A, but did not receive a Packet
B. This is pretty good evidence there is a PMTU problem in the
path to this ETR.
In Ivip, there won't be ordinarily
recognisable PTBs sent to the ITR for any packets sent with ordinary
IP-in-IP encapsulation, from any routers in the tunneled part of the
path, since Ivip's IP-in-IP encapsulation uses the SH for the outer
header address. Those PTBs would be addressed to the SH, but a
properly implemented SH won't recognise them. Other map-encap schemes
use the ITR's address in the outer header, but this is not much help
with PTBs, since it is not practical for the ITR to keep enough
state about all the packets it tunnels sufficient to reliably
distinguish a genuine PTB from a spoofed one.
IPTM
requires the ITR to keeping state about those few packets sent
with RPD2 - and this number rapidly diminishes as the ITR closes
the
Zone of Uncertainty for each ETR.
As long as the Zone
of Uncertainty is greater than 1, and as long as packets arrive which
once encapsulated would have a length matching that zone, then there
will still be RPD2 packets being sent. But this is a
self-limiting process. As long as RPD2 packets are being sent,
the ITR is reliably probing the real PMTU at that time and so adjusting
its UPME and/or its LPME variables towards each other.
According
to the above algorithm, before long, the Zone of Uncertainty would be
reduced or disappear, so that the the ITR would never send any packets
by RPD2. This means the ITR would not discover any change
in the Real PMTU: It would not confirm or test the validity of
any mistakenly low value of UPME (as would be the case if the Real PMTU
has increased) or a mistakenly high value of LPME and/or UPME (as would
the be case if the Real PMTU has decreased).
Discovering
changes in Real PMTU
As long as the
Real PMTU lies in range of LPME to UPME inclusive, the system is
working fine. Ideally, the Zone of Uncertainty (UPME - LPME) will
be small, or zero, which means that the ITR will be sending all the
traffic packets which would fit into the tunnel (once encapsulated) by
the highly efficient ordinary encapsulation (IP-in-IP for Ivip), and
that this would be making the most of the Real PMTU of the tunnel.
As
long as the Real PMTU has dropped below the current value of LPME,
packets sent via ordinary IP-in-IP would be dropped, without the ITR or
the SH being aware of it. This would lead to packet loss, without
the SH getting a PTB message. This would be a serious failing, so
the system needs to try to avoid this as much as possible.
If
the Real PMTU rises above the current value of UPME, there is no loss
of packets, but it would be best to discover this sooner rather than
later so larger packets could be sent via ordinary encapsulation.
Detecting
an increase in the Real PMTU
If the
Real PMTU rises above the current values of LPME and UPME, then no
packets will be dropped due to PMTU problems. The application
will continue to work with the efficiency it already has, but it would
be missing out on some potential efficiency gain by sending longer
packets. For an RFC 1191 host (I think all hosts, for all
practical purposes in the foreseeable future) the SH will not try a
longer packet length for 10 minutes after it received a PTB message
which set its current limit. There's nothing the ITR can do to
improve this.
What happens when the Real PMTU is higher than the
current value of LPME and UPME, and some other SH tries sending a
packet longer than both LPME and UPME, but which would in fact fit
(after encapsulation) in the Real PMTU, which the ITR is currently
unaware of? As described above, the ITR would always, or at least
usually, drop any such traffic packet and send the SH a PTB with a
value (UPME - ENCAPS). But this would be a bad outcome, since
these packets could be coming from either a new SH, or from the
original SH which has been sending packets for 10 minutes and is now
testing to see whether it can send longer ones.
As defined
above, the ITR wouldn't let any of these packets out to the tunnel,
because it "knows" they are too long . . . but does it really know what
is happening in the tunnel right now? Not necessarily.
Reasons
for not sending these long packets, and for sending the SH a PTB,
include:
- It looks like they
won't get to the ETR, so to save the ITR's outgoing bandwidth and
the burden on at least some routers in the tunnel, the packet is not
sent via ordinary encapsulation.
- Likewise, the packet is not
sent with RPD2 encapsulation, to save ITR resources and so as not to
burden the ETR control plane with Packet As and the requirement to
send a report to the ITR, which the ITR must acknowledge.
If the ITR established just a few seconds
ago that a 7050 byte packet in the tunnel results in a PTB, or if it
repeatedly finds it can get 7020 byte packets to the ETR, but not 7021
byte packets, then why should it try again by sending a traffic packet
via RPD2 when the resulting packet in the tunnel would be longer
than 7020 bytes? There is no good reason to do this.
However,
if it was a few minutes (1, 5, 10?) since the last RPD2 attempt to send
a packet longer than 7020 bytes into the tunnel, maybe it would be a
good idea to send this traffic packet as an RPD2 probe, to see if the
Real PMTU has changed.
There is more work to do on this IPTM
proposal, of course. I think that it would be good to have some
algorithm, including with parameters set by the operator of the ITR, to
allow some adventurously long traffic packets (those which would exceed
UPME once encapsulated) into the tunnel, as RPD2 probes.
Initially,
I thought there was a role for the ITR occasionally sending Synthetic
Probe packets, longer than UPME, with RPD2 to periodically test whether
the Real PMTU has risen since the last tests set the value of UPME.
But why should an ITR send these bulky things into the tunnel
just in case the Real PMTU has changed? There probably isn't a
good reason. (But see
psg.com/lists/rrg/2008/msg01208.html
). It only needs to know if a SH is able to send, and
so is actually sending, traffic packets which would be longer than UPME
once encapsulated.
Perhaps there
is no role for Synthetic Probes at all . . . except as discussed
in
psg.com/lists/rrg/2008/msg01208.html
(2008-04-22).
The ITR
should use some longer traffic packets as RPD2 probes, but with some
algorithm to ensure this is not done "too often".
Detecting
a decrease in the Real PMTU
This is
what the ITR needs to detect as rapidly as possible.
To do this
with all ordinarily encapsulated packets, some brute force approaches
include:
- Have the ETR acknowledge every encapsulated packet
it gets from this ETR, or at least those longer than some specified
length.
- With the outer header's source address being that of
the ITR (LISP, APT and TRRP - not Ivip) cache sufficient details of all
encapsulated packets (or those beyond a certain length) to be able to
reliably detect PTBs which arrive from routers in the tunnel to the
ETR, if one of these packets is too big. (As previously noted,
this is extremely onerous - probably completely impractical.)
Lighter-weight
approaches include expecting the ETR to acknowledge every 10th packet
above a certain length, every 100th, or at least one packet above a
certain length every minute. These approaches look promising for
non-Ivip systems (LISP, APT, TRRP, maybe SEAL) - but for an Ivip ETR,
ordinarily encapsulated traffic packets have no sign in them of which
ITR they were encapsulated by.
For this discussion, lets say
there is a continual stream of packets to some ETR, which are longer
than MPMTU (1200 bytes once encapsulated) and which are (as they should
be) shorter than the current value of LPME. Ivip ITRs will never
get a PTB if one or more of those packets is too long for a router in
the tunnel, and non-Ivip ITRs can't cache enough information about all
such packets.
For non-Ivip ITRs, it may be sufficient to watch
out for PTBs in general, and if some come in, then to start caching
sufficient information for at least some of the packets going to that
ETR (the PTBs will have the ETR's address as the destination address of
the initial bytes of the offending packet) that a secure test can be
run on one or a few of those PTBs, to make sure they are not from an
attacker. Then, the ITR could decide that the Real PMTU has
dropped below LPME. Probably the best response would be to
re-initialise LPME and UPME as if the ITR knew nothing about the path
to this ETR, and let the usual process of traffic packets in RPD2
probes adjust these variables to values which reflect the current Real
PMTU.
For Ivip ITRs, one approach would be to send a small
proportion of traffic packets - ideally close to, or at the length
limit imposed by (LPME - ENCAPS) - to each ETR with RPD2 encapsulation.
That will detect any downwards change in the Real PMTU, by a PTB
arriving and/or the ETR reporting that the Packet B did not arrive.
Exactly
how often to do this is a difficult question, which would depend a lot
on the circumstances. In practice, the PMTU to most ETRs might
remain stable from one year to the next, so it would be undesirable to
pepper the ETR with repeated RPD2 packets (which tie up the ETR's CPU -
while ordinary IP-in-IP packets are handled by its fast data-path, FIB
or whatever) in a forlorn search for the Real PMTU increasing.
Maybe some algorithm to send such a packet as an RPD2 probe every
10 minutes or so would be an acceptable trade-off between burdening the
ITR and ETR with RPD2 chores, and detecting a drop in Real PMTU to this
ETR.
For Ivip and non-Ivip ITRs, it seems that occasional
sending of traffic packets by some more expensive means (RPD2 for Ivip)
would be sufficient to catch decreases in Real PMTU in a reasonable
time. The idea of non-Ivip ITRs taking special trouble to do this
only if they get PTBs is promising, but an attacker can easily generate
packets which look enough like PTBs to trigger this activity, so such
an approach opens a door to DoS attacks.
Various low-key
matters
The following sections
consider some issues related to the PMTUD problem, but which are not
essential to understanding the current IPTM - RPD2 proposal.
An
alternative to the RPD2 approach of splitting the traffic packet
RPD2 uses an odd-looking arrangement
of sending most of the traffic packet in Packet B and the rest in
three or so identical, small, Packet As.
Here is an attempt at
an alternative "Protocol X" approach, which involves the Packet B
function being performed by a packet which contains the entire
original traffic packet. This does not look like a good idea to
me at present, but perhaps it will be of interest.
In Protocol
X, the sole purpose of the one or more Packet A received by the ETR
would be to ensure the ETR gets instructions to tell the ITR
whether or not Packet B arrives properly.
First I will consider
a non-Ivip setting.
Packet B should have the same length as the
traffic packet would have with ordinary IP-in-IP (or whatever else)
encapsulation. The most obvious way of doing that is to use
that ordinary encapsulation. For IP-in-IP, the only way is to use
IP-in-IP. However, all non-Ivip map-encap schemes use a more
elaborate encapsulation scheme. Here I will discuss LISP.
LISP sends an IP header (source = ITR's address, destination =
ETR's address) followed by a UDP packet (destination port = some
particular port for all ETRs) followed by a special LISP header.
The LISP header has variable length, which the ETR figures out
from its initial bytes. Following the LISP header is the entire
traffic packet.
If every LISP header which precedes an
encapsulated traffic packet (or every traffic packet longer than ~1200
bytes) had a nonce which was unique to that traffic packet, then the
ETR would have no trouble receiving a Packet A, with the same nonce,
and finding which incoming encapsulated packet matches it. Then,
there would be no trouble generating the report back to the ITR.
However, there are two arguments against this:
- The
encapsulated packet which matches a particular Packet A arrives at the
ETR in the same way as every other encapsulated traffic packet.
In a high capacity ETR (such as a "big-iron" router from Cisco,
Juniper etc. - not an ETR implemented in software in a server) those
encapsulated packets are not going to be seen by the ETR's CPU, which
gets the Packet As and has to report to the ITR. How is the
fast data-path, FIB etc. going to trawl through all the incoming
encapsulated traffic packets looking for those with particular nonces?
That would be expensive. The RPD2 approach gets around
this, since the Packet B is addressed to a UDP port which can lead
straight to CPU involvement, while the main body of encapsulated
traffic packets (IP-in-IP) can be handled by the fast data-path.
- Having
a 32 bit nonce in each encapsulated traffic packet is a waste of space,
unless it is already needed for some aspect of LISP - which it may well
be. (Still, it remains a waste if Ivip can do it without any such
extra baggage in encapsulated traffic packets.)
So a nonce in
every traffic packet, or at least in every traffic packet which, when
encapsulated, is longer than ~1200 bytes, is a good way to do Protocol
B for non-Ivip map-encap schemes. Nonce creation at the ITR
involves some cost, so it would be acceptable just to have the space
for a nonce, and to include one only when the packet is acting as a
Protocol X probe.
Side-note on nonce security:
A 32 bit nonce is a very powerful tool for
simply securing all sorts of queries and responses. However, if
an attacker knew the algorithm which generated the stream of nonces, he
or she could probably predict the nonces emitted by a particular ITR by
sending some traffic packets to the nonce, with a destination address
such that each would be encapsulated and sent to the attacker's own
"ETR". Then, the attacker might be able to make a good guess of
the nonce in packets being sent to an ETR by this ITR, which would
drastically reduce the security of the nonce protection against DoS
attacks. A physical noise source generating genuinely random
numbers would fix this problem decisively. Any PRNG
(Pseudo-Random Number Generator) approach might be vulnerable to
compromise like this.
A lightweight approach to this might be
to have a 32 bit field, which if zero, indicates it is a non-probe
encapsulated traffic packet. If the field is non-zero, the ETR's
fast data-path identifies this and informs the CPU of the value, and
the fact that the packet arrived correctly. This is a
"self-reporting" probe approach, where all packets are ordinarily
encapsulated, and a few of them are also probes . If the ITR
sends a number of these and gets no report of their arrival, it can
reasonably conclude that either the ETR is unreachable, or that the
packets were too long for some PMTU limit en-route. The ITR
may already know the ETR is reachable by exchanging shorter packets
with it.
Still, I think that something like "Packet A" to tell
the ETR to make a report is a pretty good idea.
As long as the
goal is to have no additional headers in each normally encapsulated
traffic packet, then for Ivip, there is no way of sending the traffic
packet in a packet of the same length with an added nonce or any
other distinctive thing the ETR can recognise this Packet B by, as
instructed by the one or more Packet As it receives. This is why
RPD2 uses the odd looking splitting of the traffic packet.
One
potential approach might be to have the ETR generate a 32 bit hash of
every encapsulated traffic packet it receives. That could be expensive
in a software-only router, but perhaps it would not be excessively
expensive on suitably programmed forwarding hardware. The ETR's
CPU could be given a list of received traffic packets, with their
lengths, and could fish through the list looking for those which
matched the hash and length in the Packet A. With Ivip, the
ordinarily encapsulated packet does not include the ITR's address, but
non-Ivip map-encap schemes do include this, enabling the ETR to only do
this for packets with a particular ITR's source address.
Would
it be good enough to do something even lighter weight, and perhaps not
so secure?
What if the Packet A told the ETR to report, on all
conventionally encapsulated traffic packets which arrived from a given
ITR in the next 0.5 seconds, or which arrived with a particular length
in such a time-frame? (With Ivip, there is no way of doing this,
since these packets have the sending host's address in their outer
header's source address. So perhaps Packet A specifies a sending
host address instead, which would do the same job of greatly narrowing
the search for the particular incoming encapsulated traffic packet.)
Packet A would contain a nonce which would secure the ETR's reply
against such replies being spoofed by an attacker. If the ETR
reported packets coming from a particular ITR (actually, just packets
arriving with the ITR's address in their outer source address) and one
or more packets were received at the requisite length, that would be
pretty good proof the packets were getting through.
But a DoS
attack is still possible by firing spoofed traffic packets - with the
same ITR's address in their outer source address - at the ETR on the
basis that one of them will match the length of a packet the ITR
really sent, and then being able to trick the ITR into deciding it
could send longer packets to this ETR than it actually could.
Also, the attacker could be generating the traffic packets the
ITR is unwittingly using as probes . . .
This approach could
be hardened against attack by requiring the ETR to report back not just
that such packets were received, but to include a CRC of the entire
packet, outer header and all. This would be pretty robust.
The attacker could be generating the traffic packets and firing
spoofed packets at the ETR which pretend to be the encapsulated traffic
packets from the ITR. But if the attacker can predict the
contents of the encapsulating header, then he/she could construct an
identical spoofed encapsulated packet which would generate the CRC the
ITR is expecting. The simpler the encapsulation format, the
easier this would be to do. The attacker may be able to predict
things about the ITR's encapsulating headers by running its own "ETR"
and having the ITR encapsulate his/her traffic packets and tunnel them
to this "ETR".
There may well be some lighter-weight approaches
to the RPD2 approach outlined above, but without a nonce in the actual
probe packet, it could be quite difficult to do it well, in terms of
ETR resources and in terms of robustness against DoS attack.
The
Sending Host's perspective
The
SH's experience of all this depends on a number of factors.
As
long as the mapping stays the same, packets sent by a SH to a
particular DH will always go via one ETR. (This is Ivip-only,
LISP, APT and TRRP do explicit load sharing in the ITR, so in principle
packets could be sent to multiple ETRs, each with a different PMTU for
the different tunnels. However, I think these other map-encap
schemes try to keep the packets from any one SH going to the one
ETR.)
As long as the ITR
uses
the one interface for reaching that ETR (it could change, if the ITR is
a router and the routing situation changes so packets are sent via
another interface) and as long as the routing path to that ETR doesn't
change (it could change at any time, including perhaps to some routers
which result in a final Real PMTU for the tunnel to be greater or
less
than it was a moment ago) and provided there is no flaky routing
(sending packets randomly via different paths with different Real
PMTUs) then the SH will have a pretty easy time with all this.
When
the SH first sends packets to this DH, the SH presumably has no cached
MTU
limit associated with this DH. Either the ITR has no 3 Variables
(described
below) for the ETR these initial packets are to be tunneled to, or it
has been sending longer than MPMTU packets
to this ETR in the recent past. So the ITR probably has UPME and
LPME set to values which are close to the Real PMTU at present.
Generally,
the SH's
experience
will be of a stable Real PMTU to its DH. However, the Real PMTU
will
change, and probably the ITR's conception of the PMTU (in the 3
Variables for each ETR) to whatever ETR handles this DH, will change in
circumstances including the following:
- The mapping of this
DH's micronet changes to another ETR, and that ETR has either a
different Real PMTU from this ETR and/or the ITR's 3 Variables
for that ETR differ from that of the original ETR. Unless the SH
does
its own ITR functions (Ivip's ITFH) - and this will not typically be
the case - the SH has no idea about changed mapping because it has no
knowledge of the map-encap system. So the Real PMTU will simply
change, without notice, just as it would now if the routing system sent
packets via a path with a different overall PMTU. Mapping changes
in
Ivip typically occur due to multihoming service restoration and
mobility, but also due to slower changes such as the destination
network using a different ISP (portability).
- If an
end-user uses frequent Ivip mapping changes to dynamically move traffic
from one ETR to another, in order to balance traffic over multiple
links, and if these ETRs involve different PMTUs from any of the ITRs
handling this traffic, then this end-user will create considerable
havoc for the ITRs and the SHs regarding PMTUD. Ideally, the
end-user would be able to use (for a fee) some commercial, global,
widely distributed, monitoring system to look at the PMTUs to its ETR
from various parts of the Net, in order to understand whatever PMTU
impacts their frequent changes of mapping might create.
- The
SH's outgoing
packets go to a different ITR. This new ITR might have a
different
Real PMTU to the ETR. This could happen due to initially a nearby
ITR
for some reason letting the packet go to some other ITR, perhaps an
OITRD (Open ITR in the DFZ) and later handling the packets itself. (A
similar situation occurs with APT, where initial packets are
encapsulated by the Default Mapper and the rest by a local caching ITR.)
- The
ITR and ETR remain stable, but there are either routing changes between
them (quite likely) in a way which affects Real PMTU, or the one stable
routing path has a changed Real PMTU (unlikely). The SH won't be
able
to detect this, and it is important the the ITR adapt to it reasonably
quickly. If the ITR is a real router, it may be able to
detect some
changed path, and so know that this would be a good time to retest the
PMTU to this ETR. However, this can't be assured for any router,
and
some or many ITRs may be servers, which only participate in the local
or BGP routing system sufficiently to attract traffic packets which are
destined for mapped addresses (Ivip micronets, LISP/APT EID
prefixes
etc.) When the Real PMTU increases, there is not much trouble,
since
the existing communication sessions continue with the current packet
lengths. All that is lost is some efficiency. When the Real
PMTU
drops, there could be more serious trouble. See the discussion above
about how the ITR might best discover changes to the Real PMTU to each
ETR it is currently tunneling packets to.