Layer 3 Discovery and Liveness Arrcus & IIJ
5147 Crystal Springs Bainbridge Island WA 98110 United States of America randy@psg.com
Arrcus, Inc
sra@hactrn.net
Arrcus
2077 Gateway Place, Suite #400 San Jose CA 95119 United States of America keyur@arrcus.com
Used in Massive Data Centers (MDCs), BGP-SPF and similar protocols need IP neighbor discovery, logical link encapsulation data, and Layer 2 liveness. The Layer 3 Discovery and Liveness protocol provides discovery of the neighbor on a logical link, exchanges supported encapsulations (IPv4, IPv6, ...) with neighbors, discovers encapsulation addresses (Layer 3 / MPLS identifiers), and provides layer 2 liveness checking. The interface data are pushed directly to a BGP API (for LSVR), obviating the need for centralized topology distribution architectures. This protocol is intended to be more widely applicable to other upper layer routing protocols which need logical link discovery and characterisation. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.
The Massive Data Center (MDC) environment presents unusual problems of scale, e.g. O(10,000) devices, while its homogeneity presents opportunities for simple approaches. Approaches such as Jupiter Rising use a central controller to deal with scaling, while BGP-SPF provides massive scale-out without centralization using a tried and tested scalable distributed control plane, offering a scalable routing solution in Clos and similar environments. But BGP-SPF and similar higher level device-spanning protocols, e.g. , need logical link state and addressing data from the network to build the routing topology. They also need prompt reaction to (logical) link failure. Layer 3 Discovery and Liveness (L3DL) provides brutally simple mechanisms for devices to Discover unique identities of devices/ports/... on a logical link, Run Layer 2 keep-alive messages for session continuity, Discover each other's unique IDs (ASN, RouterID, ...), Discover mutually supported encapsulations, e.g. IP/MPLS, Discover Layer 3 IP and/or MPLS addressing of interfaces of the encapsulations, Enable layer 3 link liveness such as BFD, and finally Present these data, using a very restricted profile of a BGP-LS API, to BGP-SPF which computes the topology and builds routing and forwarding tables. This protocol may be more widely applicable to a range of routing and similar protocols which need layer 3 discovery and characterisation.
Even though it concentrates on the inter-device layer, this document relies heavily on routing terminology. The following attempts to clarify the use of some possibly confusing terms: Autonomous System Number , a BGP identifier for an originator of Layer 3 routes, particularly BGP announcements. A mechanism by which link-state and TE information can be collected from networks and shared with external components using the BGP routing protocol. See . A hybrid protocol using BGP transport but a Dijkstra SPF decision process. See . A hierarchic subset of a crossbar switch topology commonly used in data centers. The L3DL content of a single Ethernet frame. A full L3DL PDU may be packaged in multiple Datagrams. Address Family Indicator and Subsequent Address Family Indicator (AFI/SAFI). I.e. classes of layer 2.5 and 3 addresses such as IPv4, IPv6, MPLS, ... An Ethernet Layer 2 packet. A logical connection between two logical ports on two devices. E.g. two VLANs between the same two ports are two links. Logical Link Endpoint Identifier, the unique identifier of one end of a logical link, see . Media Access Control Address, essentially an Ethernet address, six octets. See . Massive Data Center, commonly thousands of TORs. Maximum Transmission Unit, the size in octets of the largest packet that can be sent on a medium, see 1.3.3. Protocol Data Unit, an L3DL application layer message. A PDU may need to be broken into multiple Datagrams to make it through MTU or other restrictions. An 32-bit identifier unique in the current routing domain, see updated by . An established, via OPEN PDUs, session between two L3DL capable link end-points, Shortest Path First, an algorithm for finding the shortest paths between nodes in a graph; AKA Dijkstra's algorithm. Top Of Rack switch, aggregates the servers in a rack and connects to aggregation layers of the Clos tree, AKA the Clos spine. Zero Touch Provisioning gives devices initial addresses, credentials, etc. on boot/restart.
L3DL assumes a Clos type datacenter scale and topology, but can accommodate richer topologies which contain potential cycles. While L3DL is designed for the MDC, there are no inherent reasons it could not run on a WAN; though, as it is simply a discovery protocol, it is not clear that this would be useful. The authentication and authorisation needed to run safely on a WAN need to be considered, and the appropriate level of security options chosen. L3DL assumes a new IEEE assigned EtherType (TBD). The number of addresses of the Encapsulations on a link may be fairly large given a TOR with more than 20 servers, each server possibly having on the order of a hundred micro-services resulting in an inordinate number of addresses. And security will further add to the length of PDUs. PDUs with lengths over 10,000 octets are likely or quite possible.
Devices discover each other on logical links Logical Link Endpoint Identifiers are exchanged Layer 2 Liveness Checks may be started Encapsulation data are exchanged and IP-Level Liveness Checks enabled A BGP-like protocol is assumed to use these data to discover and build a topology database
+-------------------+ +-------------------+ +-------------------+ | Device | | Device | | Device | | | | | | | |+-----------------+| |+-----------------+| |+-----------------+| || || || || || || || BGP-SPF <+---+> BGP-SPF <+---+> BGP-SPF || || || || || || || |+--------^--------+| |+--------^--------+| |+--------^--------+| | | | | | | | | | | | | | | | | | | |+--------+--------+| |+--------+--------+| |+--------+--------+| || Encapsulations || || Encapsulations || || Encapsulations || || Addresses || || Addresses || || Addresses || || L2 Liveness || || L2 Liveness || || L2 Liveness || |+--------^--------+| |+--------^--------+| |+--------^--------+| | | | | | | | | | | | | | | | | | | |+--------v--------+| |+--------v--------+| |+--------v--------+| || || || || || || ||Inter-Device PDUs<+---+>Inter-Device PDUs<+---+>Inter-Device PDUs|| || || || || || || |+-----------------+| |+-----------------+| |+-----------------+| +-------------------+ +-------------------+ +-------------------+
There are two protocols, the inter-device per-link layer 3 discovery and the interface to the upper level BGP-like API: Inter-device PDUs are used to exchange device and logical link identities and layer 2.5 and 3 identifiers (not payloads), e.g. device IDs, port identities, VLAN IDs, Encapsulations, and IP addresses. A Link Layer to BGP API presents these data up the stack to a BGP protocol or an other device-spanning upper layer protocol, presenting them using the BGP-LS BGP-like data format. The upper layer BGP family routing protocols cross all the devices, though they are not part of these L3DL protocols. To simplify this document, Layer 2 Ethernet framing is not shown. L3DL is about layer 3.
L3DL discovers neighbors on logical links and establishes sessions between the two ends of all consenting discovered logical links. A logical link is described by a pair of Logical Link Endpoint Identifiers, LLEIs. An L3DL deployment will choose and define an LLEI which suits their needs, simple or complex. Two extremes are as follows: A simplistic view of a link between two devices is two ports, identified by unique MAC addresses, carrying a layer 3 protocol conversation. In this case, the MAC addresses might suffice for the LLEIs. Unfortunately, things can get more complex. Multiple VLANs can run between those two MACS addresses. In practice, real devices use the same MAC address on multiple ports and/or sub-interfaces. Therefore, in extreme circumstances, a fully described LLEI might be as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ifIndex | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | System MAC | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | VLAN ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ifIndex is the SNMP identifier of the (sub-)interface, see . This uniquely identifies the port. System MAC is an identifier unique in the entore operational space. Routers and switches have internal system MACs which can be used. If none exists on a device, the local L3DL configuration SHOULD create and assign a unique one by configuration. The VLAN ID is the 802.1Q identifier of the virtual link's VLAN if a VLAN is configured, otherwise zero.
Two devices discover each other and their respective identities by sending multicast HELLO PDUs (). To allow discovery of new devices coming up on a multi-link topology, devices send periodic HELLOs forever, see . Once a new device is recognized, both devices attempt to negotiate and establish peering by sending unicast OPEN PDUs (). In an established peering, Encapsulations () may be announced and modified. When two devices on a link have compatible Encapsulations and addresses, i.e. the same AFI/SAFI and the same subnet, the link is announced via the BGP-LS API.
The HELLO, , is a priming message. It is a small L3DL PDU encapsulated in an Ethernet multicast frame with the simple goal of discovering the identities of logical link endpoint(s) reachable from an LLEI. The HELLO and OPEN, , PDUs, which are used to discover and exchange detailed LLEIs, are mandatory; other PDUs are optional; though at least one encapsulation MUST be agreed at some point. The following is a ladder-style sketch of the L3DL protocol exchanges:
| HELLO | Logical Link Peer discovery |---------------------------->| | HELLO | Mandatory |<----------------------------| | | | | | OPEN | Session Open LLEIs |---------------------------->| | OPEN | Mandatory |<----------------------------| | | | | | Interface IPv4 Addresses | Interface IPv4 Addresses |---------------------------->| Optional | ACK | |<----------------------------| | | | Interface IPv4 Addresses | |<----------------------------| | ACK | |---------------------------->| | | | | | Interface IPv6 Addresses | Interface IPv6 Addresses |---------------------------->| Optional | ACK | |<----------------------------| | | | Interface IPv6 Addresses | |<----------------------------| | ACK | |---------------------------->| | | | | | Interface MPLSv4 Labels | Interface MPLSv4 Labels |---------------------------->| Optional | ACK | |<----------------------------| | | | Interface MPLSv4 Labels | Interface MPLSv4 Labels |<----------------------------| Optional | ACK | |---------------------------->| | | | | | Interface MPLSv6 Labels | Interface MPLSv6 Labels |---------------------------->| Optional | ACK | |<----------------------------| | | | Interface MPLSv6 Labels | Interface MPLSv6 Labels |<----------------------------| Optional | ACK | |---------------------------->| | | | | | L3DL KEEPALIVE | Layer 2 Liveness |---------------------------->| Optional | L3DL KEEPALIVE | |<----------------------------|
L3DL PDUs are carried by a simple transport layer which allows long PDUs to occupy many Ethernet frames. The L3DL data in each frame is referred to as a Datagram. The L3DL Transport Layer encapsulates each Datagram using a common transport header.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version |L|Datagram Num.| Datagram Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields of the L3DL Transport Header are as follows: Version number of the protocol, currently 0. Values other than 0 are treated as errors. A bit that set to one if this Datagram is the last Datagram of the PDU. For a PDU which fits in only one Datagram, it is set to one. 0..127, a monotonically increasing value, modulo 128, see . Note that this does not limit an L3DL PDU to 128 frames. Total number of octets in the Datagram including all payloads and fields. A 32 bit hash over the Datagram to detect bit flips, see .
There is a reason conservative folk use a checksum in UDP. And as many operators stretch to jumbo frames (over 1,500 octets) longer checksums are the prudent approach. For the purpose of computing a checksum, the checksum field itself is assumed to be zero. The following code describes the suggested algorithm.
Sum up 32-bit unsigned ints in a 64-bit long, then take the high-order section, shift it right, rotate, add it in, repeat until zero. #include #include /* The F table from Skipjack, and it would work for the S-Box. */ static const uint8_t sbox[256] = { 0xa3,0xd7,0x09,0x83,0xf8,0x48,0xf6,0xf4,0xb3,0x21,0x15,0x78, 0x99,0xb1,0xaf,0xf9,0xe7,0x2d,0x4d,0x8a,0xce,0x4c,0xca,0x2e, 0x52,0x95,0xd9,0x1e,0x4e,0x38,0x44,0x28,0x0a,0xdf,0x02,0xa0, 0x17,0xf1,0x60,0x68,0x12,0xb7,0x7a,0xc3,0xe9,0xfa,0x3d,0x53, 0x96,0x84,0x6b,0xba,0xf2,0x63,0x9a,0x19,0x7c,0xae,0xe5,0xf5, 0xf7,0x16,0x6a,0xa2,0x39,0xb6,0x7b,0x0f,0xc1,0x93,0x81,0x1b, 0xee,0xb4,0x1a,0xea,0xd0,0x91,0x2f,0xb8,0x55,0xb9,0xda,0x85, 0x3f,0x41,0xbf,0xe0,0x5a,0x58,0x80,0x5f,0x66,0x0b,0xd8,0x90, 0x35,0xd5,0xc0,0xa7,0x33,0x06,0x65,0x69,0x45,0x00,0x94,0x56, 0x6d,0x98,0x9b,0x76,0x97,0xfc,0xb2,0xc2,0xb0,0xfe,0xdb,0x20, 0xe1,0xeb,0xd6,0xe4,0xdd,0x47,0x4a,0x1d,0x42,0xed,0x9e,0x6e, 0x49,0x3c,0xcd,0x43,0x27,0xd2,0x07,0xd4,0xde,0xc7,0x67,0x18, 0x89,0xcb,0x30,0x1f,0x8d,0xc6,0x8f,0xaa,0xc8,0x74,0xdc,0xc9, 0x5d,0x5c,0x31,0xa4,0x70,0x88,0x61,0x2c,0x9f,0x0d,0x2b,0x87, 0x50,0x82,0x54,0x64,0x26,0x7d,0x03,0x40,0x34,0x4b,0x1c,0x73, 0xd1,0xc4,0xfd,0x3b,0xcc,0xfb,0x7f,0xab,0xe6,0x3e,0x5b,0xa5, 0xad,0x04,0x23,0x9c,0x14,0x51,0x22,0xf0,0x29,0x79,0x71,0x7e, 0xff,0x8c,0x0e,0xe2,0x0c,0xef,0xbc,0x72,0x75,0x6f,0x37,0xa1, 0xec,0xd3,0x8e,0x62,0x8b,0x86,0x10,0xe8,0x08,0x77,0x11,0xbe, 0x92,0x4f,0x24,0xc5,0x32,0x36,0x9d,0xcf,0xf3,0xa6,0xbb,0xac, 0x5e,0x6c,0xa9,0x13,0x57,0x25,0xb5,0xe3,0xbd,0xa8,0x3a,0x01, 0x05,0x59,0x2a,0x46 }; /* non-normative example C code, constant time even */ uint32_t sbox_checksum_32(const uint8_t *b, const size_t n) { uint32_t sum[4] = {0, 0, 0, 0}; uint64_t result = 0; for (size_t i = 0; i < n; i++) sum[i & 3] += sbox[*b++]; for (int i = 0; i < sizeof(sum)/sizeof(*sum); i++) result = (result << 8) + sum[i]; result = (result >> 32) + (result & 0xFFFFFFFF); result = (result >> 32) + (result & 0xFFFFFFFF); return (uint32_t) result; } ]]>
The basic L3DL application layer PDU is a typical TLV (Type Length Value) PDU. It includes a signature to provide optional integrity and authentication. It may be broken into multiple Datagrams, see
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Payload Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Payload ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields of the basic L3DL header are as follows: An integer differentiating PDU payload types 0 – HELLO 1 - OPEN 2 - KEEPALIVE 3 - ACK 4 - IPv4 Announcement 5 - IPv6 Announcement 6 - MPLS IPv4 Announcement 7 - MPLS IPv6 Announcement 8-254 Reserved 255 - VENDOR Total number of octets in the Payload field. The application layer content of the L3DL PDU. The type of the Signature. Type 0, a null signature, is defined in this document. Sig Type 0 indicates a null Signature. For very short PDUs, the underlying Datagram cheksums may be sufficient for integrity, if not for authentication. Sig Type 1 is specified in a companion document [ref later]. Other Sig Types may be defined in other documents. The length of the Signature, possibly including padding, in octets. If Sig Type is 0, Signature Length must be 0. The result of running the signature algorithm specified in Sig Type over all octets of the PDU except for the Signature itself.
The HELLO PDU is unique in that it is encapsulated in a multicast Ethernet frame. It solicits response(s) from other LLEI(s) on the link. See for why multicast is used. The destination multicast MAC Addressess to be used MUST be one of the following, See Clause 9.2.2 of : Nearest Bridge = Propagation constrained to a single physical link; stopped by all types of bridges (including MPRs (media converters)). Nearest non-TPMR Bridge = Propagation constrained by all bridges other than TPMRs; intended for use within provider bridged networks. All other L3DL PDUs are encapsulated in unicast Ethernet frames, as the peer's destination MAC address is known after the HELLO exchange. When an interface is turned up on a device, it SHOULD issue a HELLO periodically. The interval is set by configuration with a default of 60 seconds.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 0 | Payload Length = 0 | Sig Type = 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Signature Length = 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
If more than one device responds, one adjacency is formed for each unique (source link address) response. L3DL treats each adjacency as a separate logical link. When a HELLO is received from a source link address with which there is no established L3DL adjacency, the receiver SHOULD respond with an OPEN PDU. The two devices establish an L3DL adjacency by exchanging OPEN PDUs. The Payload Length is zero as there is no payload. HELLO PDUs can not be signed as keying material has yet to be exchanged. Hence the signature MUST always be null.
Each device has learned the other's MAC Address from the HELLO exchange, see . Therefore the OPEN and subsequent PDUs are unicast, as opposed to the HELLO's multicast, Ethernet frames.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 1 | Payload Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Nonce | ID Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ ~ My ID ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | AttrCount | Attribute List ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Auth Type | Auth Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Authentication Data ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Payload Length is the number of octets in all fields of the PDU from the Nonce to the Authentication Data, excluding the Sig Type, the Signature Length, and the Signature. The Nonce enables detection of a duplicate OPEN PDU. It SHOULD be either a random number or the time of day. It is needed to prevent session closure due to a repeated OPEN caused by a race or a dropped or delayed ACK. My ID is the sending LLEI, see . It can be an ASN with high order bits zero, a classic RouterID with high order bits zero, a catenation of the two, a 80-bit ISO System-ID, or any other identifier unique to a single logical link endpoint in the topology. IDs are big-endian. AttrCount is the number of attributes in the Attribute List. Attributes are single octets whose semantics are user-defined. A node may have zero or more user-defined attributes, e.g. spine, leaf, backbone, route reflector, arabica, ... Attribute syntax and semantics are local to an operator or datacenter; hence there is no global registry. Nodes exchange their attributes only in the OPEN PDU. Auth Type is the Signature algorithm type, see . Auth Length is a 16-bit field denoting the length in octets of the Authentication Data, not including the Auth Type or the Auth Lengths. If there are no Authentication Data, the Auth Type and Auth Length MUST both be zero. The Authentication Data are specific to the operational environment. A failure to authenticate is a failure to start the L3DL session, an ERROR PDU is sent (Error Code 2), and HELLOs MUST be restarted. The Signature fileds are described in and in an asymmetric key environment serve as a proof of possession of the signing auth data by the sender. Once two logical link endpoints know each other, and have ACKed each other's OPEN PDUs, Layer 2 KEEPALIVEs (see ) MAY be started to ensure Layer 2 liveness and keep the session semantics alive. The timing and acceptable drop of KEEPALIVE PDUs are discussed in . If a sender of OPEN does not receive an ACK of the OPEN PDU Type, then they MUST resend the same OPEN PDU, with the same Nonce. Resending an unacknowledged OPEN PDU, like other ACKed PDUs, SHOULD use exponential back-off, see . If a properly authenticated OPEN arrives with a new Nonce from an LLEI with which the receiving logical link endpoint believes it already has an L3DL session (OPENs have already been exchanged), the receiver MUST assume that the sending LLEI or entire device has been reset. All discovered encapsulation data SHOULD be withdrawn via the BGP-LS API and the recipient MUST respond with a new OPEN. In this circumstance encapsulations SHOULD NOT be kept because, while the new OPEN is likely to be followed by new encapsulation PDUs of the same data, the old session might have an encapsulation type not in the new session.
The ACK PDU acknowledges receipt of a PDU and reports any error condition which might have been raised.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 3 | Payload Length = 5 | PDU Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | EType | Error Code | Error Hint | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The ACK acknowledges receipt of an OPEN, Encapsulation, VENDOR PDU, etc. The PDU Type is the Type of the PDU being acknowledged, e.g., OPEN or one of the Encapsulations. If there was an error processing the received PDU, then the EType is non-zero. If the EType is zero, Error Code and Error Hint MUST also be zero. A non-zero EType is the receiver's way of telling the PDU's sender that the receiver had problems processing the PDU. The Error Code and Error Hint will tell the sender more detail about the error. The decimal value of EType gives a strong hint how the receiver sending the ACK believes things should proceed: 0 - No Error, Error Code and Error Hint MUST be zero 1 - Warning, something not too serious happened, continue 2 - Session should not be continued, try to restart 3 - Restart is hopeless, call the operator 4-15 - Reserved Someone stuck in the 1990s might think of the error codes as 0x1zzz, 0x2zzz, etc. They might be right. Or not. The Error Code indicates the type of error. The Error Hint is any additional data the sender of the error PDU thinks will help the recipient or the debugger with the particular error. The Signature fileds are described in .
If a PDU sender expects an ACK, e.g. for an OPEN, an Encapsulation, a VENDOR PDU, etc., and does not receive the ACK for a configurable time (default one second), the sender resends the PDU using exponential back-off, see .. This cycle MAY be repeated a configurable number of times (default three) before it is considered a failure. The session is considered closed in case of this ACK failure.
Once the devices know each other's LLEIs, know each other's upper layer identities, have means to ensure link state, etc., the L3DL session is considered established, and the devices SHOULD exchange interface encapsulations, addresses, (and labels). The Encapsulation types the peers exchange may be IPv4 Announcement (), IPv6 Announcement (), MPLS IPv4 Announcement (), MPLS IPv6 Announcement (), and/or possibly others not defined here. The sender of an Encapsulation PDU MUST NOT assume that the peer is capable of the same Encapsulation Type. An ACK () merely acknowledges receipt. Only if both peers have sent the same Encapsulation Type is it safe to assume that they are compatible for that type. A receiver of an encapsulation might recognize an addressing conflict, such as both ends of the link trying to use the same address. In this case, the receiver SHOULD respond with an ERROR (Error Code 1) instead of an ACK. As there may be other usable addresses or encapsulations, this error might log and continue, letting an upper layer topology builder deal with what works. Further, to consider a logical link of a type to formally be established so that it may be pushed up to upper layer protocols, the addressing for the type must be compatible, e.g. on the same IPvX subnet.
The header for all encapsulation PDUs is as follows:
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Payload Length | Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | Encapsulation List... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The 16-bit Count is the number of Encapsulations in the Encapsulation list. An Encapsulation PDU describes zero or more addresses of the encapsulation type. An Encapsulation PDU of Type T replaces all previous encapsulations of Type T. To remove all encapsulations of Type T, the sender uses a Count of zero. If an LLEI has multiple addresses for an encapsulation type, one and only one address SHOULD be configured to be marked as primary, see . Loopback addresses are generally not seen directly on an external interface. One or more loopback addresses MAY be exposed by configuration on one or more L3DL speaking external interfaces, e.g. for iBGP peering. They SHOULD be marked as such, see . If there is exactly one non-loopback address for an encapsulation type on an interface, it SHOULD be marked as primary. If a sender has multiple links on the same interface, separate data, ACKs, etc. must be kept for each peer. Over time, multiple Encapsulation PDUs may be sent for an interface as configuration changes. If the length of an Encapsulation PDU exceeds the Datagram size limit on media, the PDU is broken into multiple Datagrams. See . The Signature fileds are described in . The Receiver MUST acknowledge the Encapsulation PDU with a Type=3, ACK PDU () with the Encapsulation Type being that of the encapsulation being announced, see . If the Sender does not receive an ACK in a configurable interval (default one second), they SHOULD retransmit. After a user configurable number of failures, the L3DL session should be considered dead and the OPEN process SHOULD be restarted.
0 1 2 3 ... 7 +---------------+---------------+---------------+---------------+ | Primary | Loopback | Reserved ... | | +---------------+---------------+---------------+---------------+
Each Encapsulation interface address MAY be marked as a primary address, and/or a loopback, in which case the respective bit is set to one. Only one address MAY be marked as primary for an encapsulation type.
The IPv4 Encapsulation describes a device's ability to exchange IPv4 packets on one or more subnets. It does so by stating the interface's addresses and the corresponding prefix lengths.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 4 | Payload Length | Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | PrimLoop Flags| IPv4 Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | PrefixLen | more ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The 16-bit Count is the number of IPv4 Encapsulations.
The IPv6 Encapsulation describes a logical link's ability to exchange IPv6 packets on one or more subnets. It does so by stating the interface's addresses and the corresponding prefix lengths.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 5 | Payload Length | Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | PrimLoop Flags| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + + | | + + | IPv6 Address | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | PrefixLen | more ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The 16-bit Count is the number of IPv6 Encapsulations.
As an MPLS enabled interface may have a label stack, see , a variable length list of labels is needed.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Label Count | Label | Exp |S| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Label | Exp |S| more ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
A Label Count of zero is an implicit withdraw of all labels for that prefix on that interface.
The MPLS IPv4 Encapsulation describes a logical link's ability to exchange labeled IPv4 packets on one or more subnets. It does so by stating the interface's addresses the corresponding prefix lengths, and the corresponding labels.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 6 | Payload Length | Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | PrimLoop Flags| MPLS Label List ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | IPv4 Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | PrefixLen | more ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The 16-bit Count is the number of MPLSv6 Encapsulations.
The MPLS IPv6 Encapsulation describes a logical link's ability to exchange labeled IPv6 packets on one or more subnets. It does so by stating the interface's addresses, the corresponding prefix lengths, and the corresponding labels.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 7 | Payload Length | Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | PrimLoop Flags| MPLS Label List ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + + | | + + | IPv6 Address | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Prefix Len | more ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The 16-bit Count is the number of MPLSv6 Encapsulations.
L3DL devices SHOULD beacon frequent Layer 2 KEEPALIVE PDUs to ensure session continuity. They SHOULD be beaconed at a configured frequency. One per second is the default. Layer 3 liveness, such as BFD, may be more aggressive. If a KEEPALIVE is not received from a peer with which a receiver has an open session for a configurable time (default 30 seconds), the session SHOULD BE presumed closed. The devices MAY keep configuration state until a new session is established and new Encapsulation PDUs are received.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 2 | Payload Length = 0 | Sig Type = 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Signature Length = 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 255 | Payload Length | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Enterprise Number | Ent Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Enterprise Data ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sig Type | Signature Length | ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ ~ Signature ... ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Vendors or enterprises may define TLVs beyond the scope of L3DL standards. This is done using a Private Enterprise Number followed by free form data. Ent Type allows a VENDOR PDU to be sub-typed in the event that the vendor/enterprise needs multiple PDU types. As with Encapsulation PDUs, a receiver of a VENDOR PDU MUST respond with an ACK or an ERROR PDU. Simarly, a VENDOR PDU MUST only be sent over an open session.
Layer 2 liveness may be continuously tested by KEEPALIVE PDUs, see . As layer 2.5 or layer 3 connectivity could still break, liveness above layer 2 SHOULD be frequently tested using BFD () or a similar technique. This protocol assumes that one or more Encapsulation addresses will be used to ping, BFD, or whatever the operator configures.
Thus far, a one-hop point-to-point logical link discovery protocol has been defined. The devices know their unique LLEIs and know the unique peer LLEIs and Encapsulations on each logical link interface. Full topology discovery is not appropriate at the L3DL layer, so Dijkstra à la IS-IS etc. is assumed to be done by higher level protocols. Therefore the LLEIs, link Encapsulations, and state changes are pushed North via a small subset of the BGP-LS API. The upper layer routing protocol(s), e.g. BGP-SPF, learn and maintain the topology, run Dijkstra, and build the routing database(s). For example, if a neighbor's IPv4 Encapsulation address changes, the devices seeing the change push that change Northbound.
BGP-LS defines BGP-like Datagrams describing logical link state (links, nodes, link prefixes, and many other things), and a new BGP path attribute providing Northbound transport, all of which can be ingested by upper layer protocols such as BGP-SPF; see Section 4 of . For IPv4 links, TLVs 259 and 260 are used. For IPv6 links, TLVs 261 and 262. If there are multiple addresses on a link, multiple TLV pairs are pushed North, having the same ID pairs.
The Northbound protocol needs a few minor extensions to BGP-LS. Luckily, others have needed the same extensions. Similarly to BGP-SPF, the BGP protocol is used in the Protocol-ID field specified in table 1 of . The local and remote node descriptors for all NLRI are the ID's described in . This is equivalent to an adjacency SID or a node SID if the address is a loopback address. Label Sub-TLVs from Section 2.1.1, are used to associate one or more MPLS Labels with a link.
This section explores some trade-offs taken and some considerations.
A device with multiple Layer 2 interfaces, traditionally called a switch, may be used to forward frames and therefore packets from multiple devices to one logical interface (LLEI), I, on an L3DL speaking device. Interface I could discover a peer J across the switch. Later, a prospective peer K could come up across the switch. If I was not still sending and listening for HELLOs, the potential peering with K could not be discovered. Therefore, interfaces MUST continue to send HELLOs as long as they are turned up.
Both HELLO and KEEPALIVE are periodic. KEEPALIVE might be eliminated in favor of keeping only HELLOs. But KEEPALIVEs are unicast, and thus less noisy on the network, especially if HELLO is configured to transit layer-2-only switches, see .
One can think of the protocol as an instance (i.e. state machine) which runs on each logical link of a device. As the upper routing layer must view VLAN topologies as separate graphs, L3DL treats VLAN ports as separate links. L3DL PDUs learned over VLAN-ports may be interpreted by upper layer-3 routing protocols as being learned on the corresponding layer-3 SVI interface for the VLAN. As Sub-Interfaces each have their own LLIEs, they act as separate interfaces, forming their own links.
An implementation SHOULD provide the ability to configure a logical interface as L3DL speaking or not. An implementation SHOULD provide the ability to configure whether HELLOs on an L3DL enabled interface send Nearest Bridge or Nearest non-TPMR Bridge multicast frames from that interface; see . An implementation SHOULD provide the ability to distribute one or more loopback addresses or interfaces into L3DL on an external L3DL speaking interface. An implementation SHOULD provide the ability to configure one of the addresses of an encapsulation as primary on an L3DL speaking interface. If there is only one address for a particular encapsulation, the implementation MAY mark it as primary by default.
The protocol as it is MUST NOT be used outside a datacenter or similarly closed environment due to lack of formal definition of the authentication and authorisation mechanism. Sufficient mechanisms may be descrived in separate documents. Many MDC operators have a strange belief that physical walls and firewalls provide sufficient security. This is not credible. All MDC protocols need to be examined for exposure and attack surface. It is generally unwise to assume that on the wire Ethernet is secure. Strange/unauthorized devices may plug into a port. Mis-wiring is very common in datacenter installations. A poisoned laptop might be plugged into a device's port. Malicious nodes/devices could mis-announce addressing, form malicious sessions, etc. If OPENs are not being authenticated, an attacker could forge an OPEN for an existing session and cause the session to be reset. For these reasons, the OPEN PDU's authentication data exchange SHOULD be used.
This document requests the IANA create a registry for L3DL PDU Type, which may range from 0 to 255. The name of the registry should be L3DL-PDU-Type. The policy for adding to the registry is RFC Required per , either standards track or experimental. The initial entries should be the following:
PDU Code PDU Name ---- ------------------- 0 HELLO 1 OPEN 2 KEEPALIVE 3 ACK 4 IPv4 Announce / Withdraw 5 IPv6 Announce / Withdraw 6 MPLS IPv4 Announce / Withdraw 7 MPLS IPv6 Announce / Withdraw 8-254 Reserved 255 VENDOR
This document requests the IANA create a registry for L3DL Signature Type, AKA Sig Tyoe, which may range from 0 to 255. The name of the registry should be L3DL-Signature-Type. The policy for adding to the registry is RFC Required per , either standards track or experimental. The initial entries should be the following:
Number Name ------ ------------------- 0 Null 1 TOFU - Trust On First Use 2-255 Reserved
This document requests the IANA create a registry for L3DL PL Flag Bits, which may range from 0 to 7. The name of the registry should be L3DL-PL-Flag-Bits. The policy for adding to the registry is RFC Required per , either standards track or experimental. The initial entries should be the following:
Bit Bit Name ---- ------------------- 0 Primary 1 Loopback 2-7 Reserved
This document requests the IANA create a registry for L3DL Error Codes, a 16 bit integer. The name of the registry should be L3DL-Error-Codes. The policy for adding to the registry is RFC Required per , either standards track or experimental. The initial entries should be the following:
Error Code Error Name ---- ------------------- 0 Reserved 1 Logical Link Addressing Conflict 2 Authorisation Failure in OPEN 3 Signature Failure in PDU
This document requires a new EtherType.
The authors thank Cristel Pelsser for multiple reviews, Jeff Haas for review and comments, Joe Clarke for a useful review, John Scudder for deeply serious review and comments, Larry Kreeger for a lot of layer 2 clue, Martijn Schmidt for his contribution, Neeraj Malhotra for review, Russ Housley for checksum discussion and sBox, and Steve Bellovin for checksum advice.
IANA Private Enterprise Numbers IEEE Standard for Local and Metropolitan Area Networks: Overview and Architecture IEEE Local and Metropolitan Area Networks: Overview and Architecture Institute of Electrical and Electronics Engineers