<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml2rfc.tools.ietf.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
     There has to be one entity for each item to be referenced. 
     An alternate method (rfc include) is described in the references. -->

<!ENTITY RFC2119 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
     please see http://xml2rfc.tools.ietf.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
     (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
     (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="info" ipr="trust200902" docName="draft-bhagat-bgp-multiple-nexthops-00" obsoletes="" updates="" submissionType="IETF" xml:lang="en">
  <!-- category values: std, bcp, info, exp, and historic
     ipr values: full3667, noModification3667, noDerivatives3667
     you can add the attributes updates="NNNN" and obsoletes="NNNN" 
     they will automatically be output with "(if approved)" -->

  <!-- ***** FRONT MATTER ***** -->

  <front>
    <!-- The abbreviated title is used in the page header - it is only necessary if the 
         full title is longer than 39 characters -->

    <title abbrev="draft-bhagat-bgp-multiple-nexthops">BGP Multiple Nexthops</title>

    <!-- add 'role="editor"' below for the editors if appropriate -->

    <!-- Another author who claims to be an editor -->

    <author fullname="Amit Bhagat" initials="A.B."
            surname="Bhagat">
      <organization>Amazon</organization>

      <address>
        <postal>
          <street></street>

          <!-- Reorder these if your country does things differently -->

          <city>Seattle</city>

          <region></region>

          <code></code>

          <country>USA</country>
        </postal>

        <phone></phone>

        <email>abhagat@amazon.com</email>

        <!-- uri and facsimile elements may also be added -->
      </address>
    </author>

    <date month="March" year="2021" />

    <!-- If the month and year are both specified and are the current ones, xml2rfc will fill 
         in the current day for you. If only the current year is specified, xml2rfc will fill 
	 in the current day and month for you. If the year is not the current one, it is 
	 necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the 
	 purpose of calculating the expiry date).  With drafts it is normally sufficient to 
	 specify just the year. -->

    <!-- Meta-data Declarations -->

    <area>Routing</area>

    <workgroup>IDR Working Group</workgroup>

    <!-- WG name at the upperleft corner of the doc,
         IETF is fine for individual submissions.  
	 If this element is not present, the default is "Network Working Group",
         which is used by the RFC Editor as a nod to the history of the IETF. -->

    <keyword>Internet Draft</keyword>
    <keyword>BGP</keyword>

    <!-- Keywords will be incorporated into HTML output
         files in a meta tag but they have no effect on text or nroff
         output. If you submit your draft to the RFC Editor, the
         keywords will be used for the search engine. -->

    <abstract>
      <t>This document presents a new feature in BGP that allows grouping of multiple BGP sessions between a pair of speakers and sending multiple nexthops for a single prefix. This helps avoid sending and receiving duplicate routes across all sessions.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>In Data Center networks where CLOS fabrics are built solely using BGP <xref target="RFC4271"></xref>, it is very common to have topology where a pair of routers have multiple BGP sessions between them - a single BGP session over every link. Each BGP session is independent of the others and BGP messages are sent and received over every BGP session. There are various reasons for following this design pattern but the main reason is that when links within the LAG interfaces go down, that results in inconsistent bandwidth availability which is not reflected at the routing level. This causes the capacity models to not work correctly and can also result in network congestion.</t>
      <t>While the maintenance of these independent BGP sessions is trivial, routers sending and receiving duplicate BGP UPDATE messages for hundreds or thousands of routes, leads to unnecessarily generating, processing and storing of routes. These duplicate messages provide no extra information except capability to select and install multiple paths for routes. Every route in the BGP UPDATE messages has same BGP path attributes except the NEXT_HOP attribute.</t>
      <t>This document provides a way to advertise the route only one time with multiple NEXT_HOP attributes to achieve the same benefits as having the same route advertised multiple times over multiple BGP sessions with different NEXT_HOP attributes.</t>
    </section>

    <section anchor="capability_support" title="Capability Support" toc="default">
      <t>A new Capability Optional parameter will be communicated in BGP Open message. A BGP speaker SHOULD use Capability Advertisement procedure in <xref target="RFC3392"></xref> to announce the support. The Capability Code is to be assigned by IANA.</t>

      <figure align="center" anchor="capability" title="BGP Multiple Nexthops Capability">
        <artwork align="left"><![CDATA[

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |     Type      |    Length     |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |      AFI                      |  Reserved     |  SAFI         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ~                                                               ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |      AFI                      |  Reserved     |  SAFI         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      ]]></artwork>
      </figure>

      <t>Capability Type: TBA by IANA</t>

      <t>Capability Length: Variable length.</t>

      <t>Capability Value: Specifies all AFI/SAFI configured on the BGP speaker that support the feature.</t>
    </section>

    <section anchor="operation" title="Operation" toc="default">
      <t>During BGP session establishment, BGP Multiple Nexthops capability for every supported AFI/SAFI is advertised and received in BGP Open message. When two BGP speakers have multiple BGP sessions between themselves and if they support BGP Multiple Nexthops capability, BGP Identifier and AS of the peer are used to identify all BGP sessions that can be logically grouped together per AFI/SAFI. BGP UPDATE messages sent over one BGP session applies to all other BGP sessions within this logical group. The BGP peers MUST share same configuration settings to be treated as a group on the speaker.</t>

      <t>When BGP UPDATE message is advertised, the rules for the next hop information are as follows: </t>

       <t hangText="short">When sending a message to an external peer:
         <list style="symbols">
           <t>The BGP speaker SHOULD add multiple NEXT_HOP attributes - each NEXT_HOP attribute carrying the IP address of the interface that the speaker uses to establish BGP session to peer.</t>
         </list>
       </t>
      

      <t hangText="short">When sending a message to an internal peer:

       <list style="symbols">
         <t>If the route is not locally originated, the BGP speaker SHOULD NOT modify the NEXT_HOP attributes unless it has been explicitly configured to announce its own IP address(es) as next-hop(s).</t>

         <t>If the route is locally originated, the BGP speaker SHOULD add multiple NEXT_HOP attributes - each NEXT_HOP attribute carrying the IP address of the interface that the speaker uses to establish BGP session to peer.</t>
       </list>
      </t>

      <t hangText="short">When withdrawing routes, next-hop information is not carried in the message. In that case, the peer SHOULD remove the route with any number of NEXT_HOP attributes attached to it even when the withdraw message is received over a different BGP session than the original BGP session over which the update message was sent.</t>

      <t hangText="short">When the link or BGP session associated with the logical group goes down, the speakers SHOULD remove only the NEXT_HOP associated with routes.</t>

      <t hangText="short">Note that the BGP UPDATE message is sent over a single BGP session in the logical group. For example, if there are 8 independent BGP sessions between two speakers, the speaker chooses only 1 out of 8 sessions over which it sends the BGP UPDATE message. The speaker can choose one BGP session at random, or in round-robin fashion, or some other means and hence is out-of-scope of this document.</t>
    </section>

    <section anchor="multiprotocol" title="Multiprotocol Extensions" toc="default">
      <t><xref target="RFC4760"></xref> defines MP_REACH_NLRI path attribute which carries routes as well as next-hop information, grouped together. Details of next-hop information for MP_REACH_NLRI in section 3 of <xref target="RFC4760"></xref>. This document allows adding multiple NEXT_HOP attributes when advertising routes with MP_REACH_NLRI path attribute using the same mechanism described in section 3 of this document.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>As specified in the document, the IANA will assign a new Capability Code for BGP Multiple Nexthops capability support.</t>
    </section>

    <section anchor="ack" title="Acknowledgements">
      <t>The authors would like to thank members of IDR Working Group for their review and comments.</t>
    </section>
  </middle>

  <!--  *****BACK MATTER ***** -->

  <back>
    <references title="Normative References">
      <!-- Here we use entities that we defined at the beginning. -->
      <?rfc include="reference.RFC.4271.xml"?>
      <?rfc include="reference.RFC.3392.xml"?>
      <?rfc include="reference.RFC.4760.xml"?>
    </references>
  </back>
</rfc>