From 6d9b41fba5326dddbe87e5f58ef0eec50bc54371 Mon Sep 17 00:00:00 2001 From: Randy Bush Date: Sun, 7 Jul 2019 10:27:54 -0700 Subject: [PATCH] full pass --- draft-ietf-lsvr-l3dl.xml | 332 ++++++++++++++++++++------------------- 1 file changed, 174 insertions(+), 158 deletions(-) diff --git a/draft-ietf-lsvr-l3dl.xml b/draft-ietf-lsvr-l3dl.xml index 7356b47..973c8d2 100644 --- a/draft-ietf-lsvr-l3dl.xml +++ b/draft-ietf-lsvr-l3dl.xml @@ -11,14 +11,14 @@ - + Layer 3 Discovery and Liveness - Arrcus & IIJ + Arrcus & Internet Initiative Japan
5147 Crystal Springs @@ -60,9 +60,9 @@ protocols are used to build topology and reachability databases. These protocols need to discover IP Layer 3 attributes of links, such as logical link IP encapsulation abilities, IP neighbor address - discovery, and link liveness. The Layer 3 Discovery and Liveness - protocol specified in this document collects these data, which are - then disseminated using BGP-SPF and similar protocols. + discovery, and link liveness. This Layer 3 Discovery and Liveness + protocol collects these data, which may then be disseminated using + BGP-SPF and similar protocols. @@ -83,10 +83,10 @@
The Massive Data Center (MDC) environment presents unusual - problems of scale, e.g. O(10,000) devices, while its homogeneity - presents opportunities for simple approaches. Approaches such as - Jupiter Rising use a central controller to - deal with scaling, while BGP-SPF use a + central controller to deal with scaling, while BGP-SPF provides massive scale-out without centralization using a tried and tested scalable distributed control plane, offering a scalable routing solution in Clos Layer 3 Discovery and Liveness (L3DL) provides brutally simple mechanisms for devices to - Discover unique identities of devices/ports/... on a logical - link, - Run Layer 2 keep-alive messages for session continuity, Discover each other's unique endpoint identification, - Discover mutually supported encapsulations, e.g. IP/MPLS, + Discover mutually supported layer 3 encapsulations, + e.g. IP/MPLS, Discover Layer 3 IP and/or MPLS addressing of interfaces of the encapsulations, - Enable layer 3 link liveness such as BFD, and finally Present these data, using a very restricted profile of a BGP-LS API, to BGP-SPF which computes the - topology and builds routing and forwarding tables. + topology and builds routing and forwarding tables, + Enable layer 3 link liveness such as BFD, and finally + Provide Layer 2 keep-alive messages for session continuity. This protocol may be more widely applicable to a range of routing @@ -133,7 +132,7 @@ external components using the BGP routing protocol. See . A hybrid protocol using BGP transport but - a Dijkstra SPF decision process. See . A hierarchic subset of a crossbar switch topology commonly used in data centers. @@ -141,7 +140,7 @@ frame. A full L3DL PDU may be packaged in multiple Datagrams. Address Family Indicator and Subsequent Address Family Indicator (AFI/SAFI). I.e. classes of - layer 2.5 and 3 addresses such as IPv4, IPv6, MPLS, ... + layer 2.5 and 3 addresses such as IPv4, IPv6, MPLS, etc. A Layer 2 packet. A logical connection between two logical ports on two devices. E.g. two VLANs between the same @@ -153,8 +152,8 @@ since they are used by all widely deployed Layer 2 network technologies of interest, especially Ethernet. See . - Massive Data Center, commonly thousands of - TORs. + Massive Data Center, commonly composed of + thousands of Top of Rack Switches (TORs). Maximum Transmission Unit, the size in octets of the largest packet that can be sent on a medium, see 1.3.3. @@ -201,7 +200,7 @@ in interfaces with thousands of disaggregated prefixes. Therefore the L3DL protocol is session oriented and uses - incremental announcement and widrawal with hot restart, a la BGP + incremental announcement and widrawal with session restart, a la BGP ().
@@ -247,7 +246,7 @@ There are two protocols, the inter-device per-link layer 3 - discovery and the interface to the upper level BGP-like API: + discovery and the API to the upper level BGP-like routing prototol: Inter-device PDUs are used to exchange device and logical link @@ -272,21 +271,21 @@
Two devices discover each other and their respective identities - by sending multicast HELLO PDUs (). To allow + by sending multicast HELLO PDUs (). To assure discovery of new devices coming up on a multi-link topology, devices on such a topology send periodic HELLOs forever, see . Once a new device is recognized, both devices attempt to - negotiate and establish peering by sending unicast OPEN PDUs (). In an established peering, the Encapsulations - () configured on an end point may be - announced and modified. Note that these are only the encapsuation - and addresses on the announcing interface; though a device's - loopback interface(s) may also be announced. When two devices on a - link have compatible Encapsulations and addresses, i.e. the same - AFI/SAFI and the same subnet, the link is announced via the BGP-LS - API. + negotiate and establish a session by sending unicast OPEN PDUs + (). In an established session, the + Encapsulations () configured on an end point + may be announced and modified. Note that these are only the + encapsuation and addresses configured on the announcing interface; + though a device's loopback and overlay interface(s) may also be + announced. When two devices on a link have compatible + Encapsulations and addresses, i.e. the same AFI/SAFI and the same + subnet, the link is announced via the BGP-LS API.
@@ -302,7 +301,7 @@ PDUs are optional; though at least one encapsulation SHOULD be agreed at some point. - The following is a ladder-style sketch of the L3DL protocol + The following is a ladder-style diagram of the L3DL protocol exchanges:
@@ -380,8 +379,8 @@
L3DL PDUs are carried by a simple transport layer which allows - long PDUs to occupy many Ethernet frames. An L3DL frame is referred - to as a Datagram. + PDUs to occupy many Ethernet frames. An L3DL Ethernet frame is + referred to as a Datagram. The L3DL Transport Layer encapsulates each Datagram using a common transport header. @@ -402,7 +401,7 @@ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Datagram Length | Checksum ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -~ | Payload... | +~ | Payload... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
@@ -411,9 +410,9 @@ Seven-bit Version number of the protocol, - currently 0. Values other than 0 are treated as errors. The - protocol version nees to be in one and only one place, so it is in - the datagram as opposed to, for example, the PDU header. + currently 0. Values other than 0 MUST BE treated as an error. + The protocol version nees to be in one and only one place, so it + is in the datagram as opposed to, for example, the PDU header.
A bit that set to one if this Datagram is the last Datagram of the PDU. For a PDU which fits in only one @@ -436,6 +435,12 @@ thereof. + + To avoid the need for a receiver to reassemble two PDUs at the + same time, a sender MUST NOT send a subsequent PDU when a PDU is + already in flight and not yet acknowledged if it is an ACKed PDU + Type. +
@@ -528,7 +533,7 @@ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n) +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + -~ Signature | +~ Signature ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ @@ -557,7 +562,7 @@ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n) The length of the Signature, possibly including padding, in octets. If Sig Type is 0, - Signature Length must be 0. + Signature Length MUST BE 0. The result of running the signature algorithm specified in Sig Type over all octets of the PDU except @@ -636,10 +641,6 @@ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n)
- WARNING: The second multicast address below is incorrect. We - need to get a new assignment. , which is what we really wanted with the second address - below. - The HELLO PDU is unique in that it is encapsulated in a multicast Ethernet frame. It solicits response(s) from other LLEI(s) on the link. See for why multicast is used. The @@ -649,13 +650,15 @@ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n) Nearest Bridge = Propagation constrained to a single physical link; stopped by all types of - bridges (including MPRs (media converters)). + bridges (including MPRs (media converters)). This SHOULD BE used + when the link is known to be a simple point to point link. When a switch receives a frame with a multicast destination MAC it does not recognize, it forwards to all ports. This destination MAC is to be sent when the interface is known to be connected to a switch. See . + target="ieee"/>. This SHOULD BE used when the link may be a + multi-point link. @@ -664,11 +667,12 @@ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n) exchange. When an interface is turned up on a device, it SHOULD issue a - HELLO. + HELLO if it is to participate in L3DL sessions. - If a constrained destination address configured, see above, then - the HELLO need not be repeated once a session has been created by an - exchange of OPENs. + If a constrained Nearest Bridge destination address is configured + for a point-to-point interface, see above, then the HELLO SHOULD NOT + be repeated once a session has been created by an exchange of + OPENs. If the configured destination address is one that is propagated by switches, the HELLO SHOULD be repeated at a configured interval, @@ -696,8 +700,8 @@ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n) separate logical link. When a HELLO is received from a source MAC address with which - there is no established L3DL adjacency, the receiver SHOULD respond - with an OPEN PDU. The two devices establish an L3DL adjacency by + there is no established L3DL session, the receiver SHOULD respond + with an OPEN PDU. The two devices establish an L3DL session by exchanging OPEN PDUs. The Payload Length is zero as there is no payload. @@ -711,7 +715,7 @@ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n) Each device has learned the other's MAC Address from the HELLO exchange, see . Therefore the OPEN and - subsequent PDUs are unicast, as opposed to the HELLO's multicast + subsequent PDUs MUST BE unicast, as opposed to the HELLO's multicast frame. My LLEI is the sender's LLEI, see . AttrCount is the number of attributes in the Attribute List. - Attributes are single octets whose semantics are user-defined. + Attributes are single octets the semantics of which are + operator-defined. - A node may have zero or more user-defined attributes, e.g. + A node may have zero or more operator-defined attributes, e.g.: spine, leaf, backbone, route reflector, arabica, ... Attribute syntax and semantics are local to an operator or @@ -767,19 +772,19 @@ q--> target="tlv"/>. Key Length is a 16-bit field denoting the length in octets of the - Key itself, not including the Auth Type or the Key Lengths. If - there is no Key, the Auth Type and key Length MUST both be zero. + Key itself, not including the Auth Type or the Key Length. If there + is no Key, the Auth Type and key Length MUST both be zero. The Key is specific to the operational environment. A failure to - authenticate is a failure to start the L3DL session, an ERROR PDU is - sent (Error Code 2), and HELLOs MUST be restarted. + authenticate is a failure to start the L3DL session, an ERROR PDU + MUST BE sent (Error Code 2), and HELLOs MUST be restarted. - The Serial Number is that of the last received and processed - Encapsulation PDU. This allows a receiver sending an OPEN to tell - the sender that the receiver wants to resume a session and the - sender only needs to send data more recent than the Serial Number. - If this OPEN is not trying to restart a lost session, the Serial - Number MUST be set to zero. + The Serial Number is that of the last received and processed PDU. + This allows a receiver sending an OPEN to tell the sender that the + receiver wants to resume a session and the sender only needs to send + data more recent than the Serial Number. If this OPEN is not trying + to restart a lost session, the Serial Number MUST BE set to + zero. The Signature fields are described in and in an asymmetric key environment serve as a proof of possession of the @@ -791,19 +796,29 @@ q--> keep the session semantics alive. The timing and acceptable drop of KEEPALIVE PDUs are discussed in . - If a sender of OPEN does not receive an ACK of the OPEN PDU Type, - then they MUST resend the same OPEN PDU, with the same Nonce. - Resending an unacknowledged OPEN PDU, like other ACKed PDUs, SHOULD - use exponential back-off, see . + If a sender of OPEN does not receive an ACK of the OPEN PDU, then + they MUST resend the same OPEN PDU, with the same Nonce. Resending + an unacknowledged OPEN PDU, like other ACKed PDUs, SHOULD use + exponential back-off, see . If a properly authenticated OPEN arrives with a new Nonce from an LLEI with which the receiving logical link endpoint believes it - already has an L3DL session (OPENs have already been exchanged), the - receiver MAY assume that the sending LLEI or entire device has been - reset. If the Serial Number in the OPEN is zero, then all - discovered encapsulation data SHOULD be withdrawn via the BGP-LS API - and the recipient MUST respond with a new OPEN. In this - circumstance encapsulations SHOULD NOT be kept. + already has an L3DL session (OPENs have already been exchanged), and + the Serial Number in the OPEN is non-zero, the receiver SHOULD + establish a new session by sending an OPEN with the Serial Number of + the last data it received. Each party MUST resume sending + encapsulations etc. subsequent to the other party's Sequence Number. + And each MUST retain all previously discovered encapsulation and + other data. + + If a properly authenticated OPEN arrives with a new Nonce from an + LLEI with which the receiving logical link endpoint believes it + already has an L3DL session (OPENs have already been exchanged), and + the Serial Number in the OPEN is zero, then the receiver MUST assume + that the sending LLEI or entire device has been reset. All + previously discovered encapsulation data MUST NOT be kept and MUST + be withdrawn via the BGP-LS API and the recipient MUST respond with + a new OPEN.
@@ -836,7 +851,7 @@ q--> PDU, etc.
The ACKed PDU is the PDU Type of the PDU being acknowledged, - e.g., OPEN or one of the Encapsulations. + e.g., OPEN, one of the Encapsulations, etc. If there was an error processing the received PDU, then the EType is non-zero. If the EType is zero, Error Code and Error Hint MUST @@ -848,12 +863,21 @@ q--> error. The decimal value of EType gives a strong hint how the receiver - sending the ACK believes things should proceed. The ETypes are - listed in . Someone stuck in the 1990s - might think of the error codes as 0x1zzz, 0x2zzz, etc. They might - be right. Or not. + sending the ACK believes things should proceed: + + + 0 - No Error, Error Code and Error Hint MUST be zero + 1 - Warning, something not too serious happened, continue + 2 - Session should not be continued, try to restart + 3 - Restart is hopeless, call the operator + 4-15 - Reserved + + - The Error Code indicates the type of error. + The Error Codes, noting protocol failures listed in thi document, + are listed in . Someone stuck in the + 1990s might think the catenation of EType and Error Code as an echo + of 0x1zzz, 0x2zzz, etc. They might be right; or not. The Error Hint is any additional data the sender of the error PDU thinks will help the recipient or the debugger with the particular @@ -873,8 +897,7 @@ q--> case of this ACK failure. If the link is broken at layer 2, retransmission MAY BE retried - when the link comes back up if data have not changed in the - interim. + when the link is restored.
@@ -887,11 +910,10 @@ q--> session is considered established, and the devices SHOULD exchange L3 interface encapsulations, L3 addresses, and L2.5 labels. - The Encapsulation types the peers exchange may be IPv4 - Announcement (), IPv6 Announcement (), MPLS IPv4 Announcement (), - MPLS IPv6 Announcement (), and/or possibly - others not defined here. + The Encapsulation types the peers exchange may be IPv4 (), IPv6 (), MPLS IPv4 (), MPLS IPv6 (), and/or + possibly others not defined here. The sender of an Encapsulation PDU MUST NOT assume that the peer is capable of the same Encapsulation Type. An ACK ( - The 24-bit Count is the number of Encapsulations in the - Encapsulation list. - An Encapsulation PDU describes zero or more addresses of the encapsulation type. + The 24-bit Count is the number of Encapsulations in the + Encapsulation list. + The Serial Number is a monotonically increasing 32-bit value representing the sender's state in time. It may be an integer, a timestamp, etc. On session restart (new OPEN), a receiver MAY @@ -950,7 +972,7 @@ q--> send newer data. If a sender has multiple links on the same interface, separate - state: data, ACKs, etc. must be kept for each peer. + state: data, ACKs, etc. must be kept for each peer session. Over time, multiple Encapsulation PDUs may be sent for an interface as configuration changes. @@ -988,9 +1010,10 @@ q--> - An Encapsulation PDU of Type T may announce new and/or withdraw - old encapsulations of Type T. It indicates this with the Ann/With - Encapsulation Flag, Announce == 1, Withdraw == 0. + Each encapsulation in an Encapsulation PDU of Type T may + announce new and/or withdraw old encapsulations of Type T. It + indicates this with the Ann/With Encapsulation Flag, Announce == + 1, Withdraw == 0. Each Encapsulation interface address in an Encapsulation PDU is either a new encapsulation be announced (Ann/With == 1) (yes, a la @@ -1006,20 +1029,18 @@ q--> be marked as primary for a particular encapsulation type. An Encapsulation interface address in an Encapsulation PDU MAY - be marked as a loopback, in which case the Loopback bit is - set. - - Loopback addresses are generally not seen directly on an - external interface. One or more loopback addresses MAY be exposed - by configuration on one or more L3DL speaking external interfaces, + be marked as a loopback, in which case the Loopback bit is set. + Loopback addresses are generally not seen directly on an external + interface. One or more loopback addresses MAY be exposed by + configuration on one or more L3DL speaking external interfaces, e.g. for iBGP peering. They SHOULD be marked as such, Loopback Flag == 1. Each Encapsulation interface address in an Encapsulation PDU is that of the direct 'underlay interface (Under/Over == 1), or an 'overlay' address (Under/Over == 0), likely that of a VM or - container guest bridged on to the interface with an underlay - address. + container guest bridged or configured on to the interface already + having an underlay address.
@@ -1053,7 +1074,8 @@ q--> - The 24-bit Count is the number of IPv4 Encapsulations. + The 24-bit Count is the number of IPv4 Encapsulations being + announced and/or withdrawn. @@ -1094,7 +1116,8 @@ q--> - The 24-bit Count is the number of IPv6 Encapsulations. + The 24-bit Count is the number of IPv6 Encapsulations being + announced and/or withdrawn. @@ -1160,7 +1183,8 @@ q--> - The 24-bit Count is the number of MPLSv4 Encapsulations. + The 24-bit Count is the number of MPLSv4 Encapsulation being + announced and/or withdrawns. @@ -1169,7 +1193,7 @@ q--> The MPLS IPv4 Encapsulation describes a logical link's ability to exchange labeled IPv4 packets on one or more subnets. It does so by stating the interface's addresses, the corresponding prefix - lengths, and the corresponding labels which will be accepted fpr + lengths, and the corresponding labels which will be accepted for each address. - The 24-bit Count is the number of MPLSv6 Encapsulations. + The 24-bit Count is the number of MPLSv6 Encapsulations being + announced and/or withdrawn. - The MPLS IPv6 Encapsulation describes a logical link's ability - to exchange labeled IPv6 packets on one or more subnets. It does - so by stating the interface's addresses, the corresponding prefix - lengths, and the corresponding labels which will be accepted fpr - each address. - @@ -1256,26 +1275,6 @@ q-->
- L3DL devices SHOULD beacon frequent Layer 2 KEEPALIVE PDUs to - ensure session continuity. A receiver may choose to ignore - KEEPALIVE PDUs. - - An operational deployment MUST BE configured whether to use - KEEPALIVEs or not, either globally, or down to per-link granularity. - Disagreement MAY result in repeated session break and - reestablishment. - - KEEPALIVEs SHOULD be beaconed at a configured frequency. One per - second is the default. Layer 3 liveness, such as BFD, may be more - (or less) aggressive. - - If a KEEPALIVE is not received from a peer with which a receiver - has an open session for a configurable time (default 30 seconds), - the link SHOULD BE presumed down. The devices MAY keep - configuration state and restore it without retransmission if no data - have changed. Otherwise, a new session SHOULD BE established and - new Encapsulation PDUs exchanged. - @@ -1292,6 +1291,31 @@ q--> + L3DL devices SHOULD beacon frequent Layer 2 KEEPALIVE PDUs to + ensure session continuity. A receiver may choose to ignore + KEEPALIVE PDUs. + + An operational deployment MUST BE configured whether to use + KEEPALIVEs or not, either globally, or down to per-link granularity. + Disagreement MAY result in repeated session break and + reestablishment. + + KEEPALIVEs SHOULD be beaconed at a configured frequency. One per + second is the default. Layer 3 liveness, such as BFD, may be more + (or less) aggressive. + + When a sender transmits a PDU which is not a KEEPALIVE, the + sender SHOULD reset the KEEPALIVE timer. I.e. sending any PDU acts + as a keepalive. Once the last fragment has been sent, the + KEEPALIVE timer SHOULD BE restarted. Do not wait for the ACK. + + If a KEEPALIVE or other PDUs have not been received from a peer + with which a receiver has an open session for a configurable time + (default 30 seconds), the link SHOULD BE presumed down. The devices + MAY keep configuration state and restore it without retransmission + if no data have changed. Otherwise, a new session SHOULD BE + established and new Encapsulation PDUs exchanged. +
@@ -1303,7 +1327,7 @@ q--> technique. This protocol assumes that one or more Encapsulation addresses - will be used to ping, run BFD, or whatever the operator + may be used to ping, run BFD, or whatever the operator configures.
@@ -1317,7 +1341,7 @@ q--> LLEIs and Encapsulations on each logical link interface.
Full topology discovery is not appropriate at the L3DL layer, so - Dijkstra à la IS-IS etc. is assumed to be done by higher level + Dijkstra a la IS-IS etc. is assumed to be done by higher level protocols such as BGP-SPF. Therefore the LLEIs, link Encapsulations, and state changes are @@ -1370,24 +1394,15 @@ q-->
- - A device with multiple Layer 2 interfaces, traditionally called a switch, may be used to forward frames and therefore packets from multiple devices to one logical interface (LLEI), I, on an L3DL speaking device. Interface I could discover a peer J across the switch. Later, a prospective peer K could come up across the switch. If I was not still sending and listening for HELLOs, the - potential peering with K could not be discovered. Therefore, - interfaces MUST continue to send HELLOs as long as they are turned - up. + potential peering with K could not be discovered. Therefore, on + multi-link interfaces MUST continue to send HELLOs as long as they + are turned up.
@@ -1444,15 +1459,15 @@ q--> encapsulation, the implementation MAY mark it as primary by default.
- An implementation SHOULD allow optional configuration which - updates the local forwarding table with overlay and underlay data - both learned from L3DL peers and configured locally. + An implementation MAY allow optional configuration which updates + the local forwarding table with overlay and underlay data both + learned from L3DL peers and configured locally.
- The protocol as it is MUST NOT be used outside a datacenter or + The protocol as is MUST NOT be used outside a datacenter or similarly closed environment due to lack of formal definition of the authentication and authorization mechanism. Sufficient mechanisms may be described in separate documents. @@ -1588,12 +1603,13 @@ q-->
- The authors thank Cristel Pelsser for multiple reviews, Jeff Haas - for review and comments, Joe Clarke for a useful review, John - Scudder for deeply serious review and comments, Larry Kreeger for a - lot of layer 2 clue, Martijn Schmidt for his contribution, Neeraj - Malhotra for review, Russ Housley for checksum discussion and sBox, - and Steve Bellovin for checksum advice. + The authors thank Cristel Pelsser for multiple reviews, Harsha + Kovuru for comments during implementation, Jeff Haas for review and + comments, Joe Clarke for a useful review, John Scudder for deeply + serious review and comments, Larry Kreeger for a lot of layer 2 + clue, Martijn Schmidt for his contribution, Neeraj Malhotra for + review, Russ Housley for checksum discussion and sBox, and Steve + Bellovin for checksum advice.