<?xml version='1.0' encoding='ascii'?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<?rfc strict="yes" ?>
<?rfc toc="yes"?>
<?rfc tocdepth="2"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>
<rfc category="std" ipr="trust200902" docName="draft-cel-nfsv4-rpcrdma-version-two-02" obsoletes="" updates="" submissionType="IETF" xml:lang="en">
  <front>
    <title abbrev="RDMA Transport for RPC V2">RPC-over-RDMA Version Two Protocol </title>
    <author initials="C.L." surname="Lever" fullname="Charles Lever" role="editor">
      <organization abbrev="Oracle">Oracle Corporation </organization>
      <address>
        <postal>
          <street>1015 Granger Avenue</street>
          <city>Ann Arbor</city>
          <region>MI</region>
          <code>48104</code>
          <country>USA</country>
        </postal>
        <phone>+1 734 274 2396</phone>
        <email>chuck.lever@oracle.com</email>
      </address>
    </author>
    <author initials="D.N." surname="Noveck" fullname="David Noveck">
      <organization abbrev="HPE">Hewlett Packard Enterprise </organization>
      <address>
        <postal>
          <street>165 Dascomb Road</street>
          <city>Andover</city>
          <region>MA</region>
          <code>01810</code>
          <country>USA</country>
        </postal>
        <phone>+1 978 474 2011</phone>
        <email>davenoveck@gmail.com</email>
      </address>
    </author>
    <date/>
    <area>Transport</area>
    <workgroup>Network File System Version 4</workgroup>
    <keyword>NFS-Over-RDMA</keyword>
    <abstract>
      <t>This document specifies an improved protocol for conveying Remote Procedure Call (RPC) messages on physical transports capable of Remote Direct Memory Access (RDMA), based on RPC-over-RDMA Version One.  </t>
    </abstract>
    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in <xref target="RFC2119" pageno="false" format="default"/>.  </t>
    </note>
  </front>
  <middle>
    <section title="Introduction" toc="default">
      <t>Remote Direct Memory Access (RDMA) <xref target="RFC5040" pageno="false" format="default"/> <xref target="RFC5041" pageno="false" format="default"/> <xref target="IB" pageno="false" format="default"/> is a technique for moving data efficiently between end nodes.  By directing data into destination buffers as it is sent on a network and placing it via direct memory access by hardware, the complementary benefits of faster transfers and reduced host overhead are obtained.  </t>
      <t>A protocol already exists that enables ONC RPC <xref target="RFC5531" pageno="false" format="default"/> messages to be conveyed on RDMA transports.  That protocol is RPC-over-RDMA Version One, specified in <xref target="I-D.ietf-nfsv4-rfc5666bis" pageno="false" format="default"/>.  RPC-over-RDMA Version One is deployed and in use, though there are some shortcomings to this protocol, such as: <list style="symbols"><t>The use of small Receive buffers force the use of RDMA Read and Write transfers for small payloads, and limit the size of backchannel messages.  </t><t>Lack of support for potential optimizations, such as remote invalidation, that require changes to on-the-wire behavior.  </t></list> </t>
      <t>To address these issues in a way that is compatible with existing RPC-over-RDMA Version One deployments, a new version of RPC-over-RDMA is presented in this document.  RPC-over-RDMA Version Two contains only incremental changes over RPC-over-RDMA Version One to facilitate adoption of Version Two by existing Version One implementations.  </t>
      <t>The major new feature in RPC-over-RDMA Version Two is extensibility of the RPC-over-RDMA header.  Extensibility enables narrow changes to RPC-over-RDMA Version Two so that new optional capabilities can be introduced without a protocol version change and while maintaining interoperability with existing implementations.  New capabilities can be proposed and developed independently of each other, and implementaters can choose among them.  It should be straightforward to create and document experimental features and then bring them through the standards process.  </t>
      <t>In addition to extensibility, the default inline threshold value is larger in RPC-over-RDMA Version Two.  This change is driven by the increase in average size of RPC messages containing common NFS operations.  With NFSv4.1 <xref target="RFC5661" pageno="false" format="default"/> and later, compound operations convey more data per RPC message.  The default 1KB inline threshold in RPC-over-RDMA Version One prevents attaining the best possible performance.  </t>
      <t>Other new features include support for Remote Invalidation.  </t>
    </section>
    <section title="Inline Threshold" toc="default">
      <section title="Terminology" toc="default">
        <t>The term "inline threshold" is defined in Section 4 of <xref target="I-D.ietf-nfsv4-rfc5666bis" pageno="false" format="default"/>.  An "inline threshold" value is the largest message size (in octets) that can be conveyed in one direction on an RDMA connection using only RDMA Send and Receive.  Each connection has two inline threshold values: one for messages flowing from requester-to-responder (referred to as the "call inline threshold"), and one for messages flowing from responder-to-requester (referred to as the "reply inline threshold").  Inline threshold values are not advertised to peers via the base RPC-over-RDMA Version Two protocol.  </t>
        <t>A connection's inline threshold determines when RDMA Read or Write operations are required because the RPC message to be sent cannot be conveyed via RDMA Send and Receive.  When an RPC message does not contain DDP-eligible data items, a requester prepares a Long Call or Reply to convey the whole RPC message using RDMA Read or Write operations.  </t>
      </section>
      <section title="Motivation" toc="default">
        <t>RDMA Read and Write operations require that each data payload resides in a region of memory that is registered with the RNIC.  When an RPC is complete, that region is invalidated, fencing it from the responder.  </t>
        <t>Both registration and invalidation have a latency cost which is insignificant compared to data handling costs.  When a data payload is small, however, the cost of registering and invalidating the memory where the payload resides becomes a relatively significant part of total RPC latency.  Therefore the most efficient operation of RPC-over-RDMA occurs when RDMA Read and Write operations are used for large payloads, and avoided for small payloads.  </t>
        <t>When RPC-over-RDMA Version One was conceived, the typical size of RPC messages that did not involve a significant data payload was under 500 bytes.  A 1024-byte inline threshold adequately minimized the frequency of inefficient Long Calls and Replies.  </t>
        <t>Starting with NFSv4.1 <xref target="RFC5661" pageno="false" format="default"/>, NFS COMPOUND RPC messages are larger and more complex than before.  With a 1024-byte inline threshold, RDMA Read or Write operations are needed for frequent operations that do not bear a data payload, such as GETATTR and LOOKUP, reducing the efficiency of the transport.  </t>
        <t>To reduce the need to use Long Calls and Replies, RPC-over-RDMA Version Two increases the default inline threshold size.  This also increases the maximum size of backward direction RPC messages.  </t>
      </section>
      <section title="Default Values" toc="default">
        <t>RPC-over-RDMA Version Two receiver implementations MUST support an inline threshold of 4096 bytes, but MAY support larger inline threshold values.  A mechanism for discovering a peer's preferred inline threshold value (not defined in this document) may be used to optimize RDMA Send operations further.  In the absense of such a mechanism, senders MUST assume a receiver's inline threshold is 4096 bytes.  </t>
        <t>The new default inline threshold size is no larger than the size of a hardware page on typical platforms.  This conserves the resources needed to Send and Receive base level RPC-over-RDMA Version Two messages, enabling RPC-over-RDMA Version Two to be used on a broad variety of hardware.  </t>
      </section>
    </section>
    <section title="Remote Invalidation" toc="default">
      <t>An STag that is registered using the FRWR mechanism (in a privileged execution context), or is registered via a Memory Window (in user space), may be invalidated remotely <xref target="RFC5040" pageno="false" format="default"/>.  These mechanisms are available only when a requester's RNIC supports MEM_MGT_EXTENSIONS.  </t>
      <t>For the purposes of this discussion, there are two classes of STags.  Dynamically-registered STags are used in a single RPC, then invalidated.  Persistently-registered STags live longer than one RPC.  They may persist for the life of an RPC-over-RDMA connection, or longer.  </t>
      <t>An RPC-over-RDMA requester may provide more than one STag in one transport header.  It may provide a combination of dynamically- and persistently-registered STags in one RPC message, or any combination of these in a series of RPCs on the same connection.  Only dynamically-registered STags using Memory Windows or FRWR (ie. registered via MEM_MGT_EXTENSIONS) may be invalidated remotely.  </t>
      <t>There is no transport-level mechanism by which a responder can determine how a requester-provided STag was registered, nor whether it is eligible to be invalidated remotely.  A requester that mixes persistently- and dynamically-registered STags in one RPC, or mixes them across RPCs on the same connection, must therefore indicate which handles may be invalidated via a mechanism provided in the Upper Layer Protocol.  RPC-over-RDMA Version Two provides such a mechanism.  </t>
      <t>The RDMA Send With Invalidate operation is used to invalidate an STag on a remote system.  It is available only when a responder's RNIC supports MEM_MGT_EXTENSIONS, and must be utilized only when a requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and recognize an IETH).  </t>
      <section title="Backward-Direction Remote Invalidation" toc="default">
        <t>Existing RPC-over-RDMA protocol specifications <xref target="I-D.ietf-nfsv4-rfc5666bis" pageno="false" format="default"/> <xref target="I-D.ietf-nfsv4-rpcrdma-bidirection" pageno="false" format="default"/> do not forbid direct data placement in the backward-direction, even though there is currently no Upper Layer Protocol that may use it.  </t>
        <t>When chunks are present in a backward-direction RPC request, Remote Invalidation allows the responder to trigger invalidation of a requester's STags as part of sending a reply, the same as in the forward direction.  </t>
        <t>However, in the backward direction, the server acts as the requester, and the client is the responder.  The server's RNIC, therefore, must support receiving an IETH, and the server must have registered the STags with an appropriate registration mechanism.  </t>
      </section>
    </section>
    <section title="Protocol Extensibility" anchor="protocol-extensibility" toc="default">
      <t>The core RPC-over-RDMA Version Two header format is specified in <xref target="xdr-protocol-definition" pageno="false" format="default"/> as a complete and stand-alone piece of XDR.  Any change to this XDR description requires a protocol version number change.  </t>
      <section title="Optional Features" anchor="optional-features" toc="default">
        <t>RPC-over-RDMA Version Two introduces the ability to extend the core protocol via optional features.  Extensibility enables minor protocol issues to be addressed and incremental enhancements to be made without the need to change the protocol version.  The key capability is that both sides can detect whether a feature is supported by their peer or not.  With this ability, OPTIONAL features can be introduced over time to an otherwise stable protocol.  </t>
        <t>The rdma_opttype field carries a 32-bit unsigned integer.  The value in this field denotes an optional operation that MAY be supported by the receiver.  The values of this field and their meaning are defined in other Standards Track documents.  </t>
        <t>The rdma_optinfo field carries opaque data.  The content of this field is data meaningful to the optional operation denoted by the value in rdma_opttype.  The content of this field is not defined in the base RPC-over-RDMA Version Two protocol, but is defined in other Standards Track documents </t>
        <t>When an implementation does not recognize or support the value contained in the rdma_opttype field, it MUST send an RPC-over-RDMA message with the rdma_xid field set to the same value as the erroneous message, the rdma_proc field set to RDMA2_ERROR, and the rdma_err field set to RDMA2_ERR_INVAL_OPTION.  </t>
      </section>
      <section title="Message Direction" toc="default">
        <t>Backward direction operation depends on the ability of the receiver to distinguish between incoming forward and backward direction calls and replies.  This needs to be done because both the XID field and the flow control value (RPC-over-RDMA credits) in the RPC-over-RDMA header are interpreted in the context of each message's direction.  </t>
        <t>A receiver typically distinguishes message direction by examining the mtype field in the RPC header of each incoming payload message.  However, RDMA2_OPTIONAL type messages may not carry an RPC message payload.  </t>
        <t>To enable RDMA2_OPTIONAL type messages that do not carry an RPC message payload to be interpreted unambiguously, the rdma2_optional structure contains a field that identifies the message direction.  A similar field has been added to the rpcrdma2_chunk_lists and rpcrdma2_error structures to simplify parsing the RPC-over-RDMA header at the receiver.  </t>
      </section>
      <section title="Documentation Requirements" toc="default">
        <t>RPC-over-RDMA Version Two may be extended by defining a new rdma_opttype value, and then by providing an XDR description of the rdma_optinfo content that corresponds with the new rdma_opttype value.  As a result, a new header type is effectively created.  </t>
        <t>A Standards Track document introduces each set of such protocol elements.  Together these elements are considered an OPTIONAL feature.  Each implementation is either aware of all the protocol elements introduced by that feature, or is aware of none of them.  </t>
        <t>Documents describing extensions to RPC-over-RDMA Version Two should contain: <list style="symbols"><t>An explanation of the purpose and use of each new protocol element added </t><t>An XDR description of the protocol elements, and a script to extract it </t><t>A mechanism for reporting errors when the error is outside the available choices already available in the base protocol or in other extensions </t><t>An indication of whether a Payload stream must be present, and a description of its contents </t><t>A description of interactions with existing extensions </t></list> </t>
        <t>The last bullet includes requirements that another OPTIONAL feature needs to be present for new protocol elements to work, or that a particular level of support be provided for some particular facility for the new extension to work.  </t>
        <t>Implementers combine the XDR descriptions of the new features they intend to use with the XDR description of the base protocol in this document.  This may be necessary to create a valid XDR input file because extensions are free to use XDR types defined in the base protocol, and later extensions may use types defined by earlier extensions.  </t>
        <t>The XDR description for the RPC-over-RDMA Version Two protocol combined with that for any selected extensions should provide an adequate human-readable description of the extended protocol.  </t>
      </section>
    </section>
    <section title="XDR Protocol Definition" anchor="xdr-protocol-definition" toc="default">
      <t>This section contains a description of the core features of the RPC-over-RDMA Version Two protocol, expressed in the XDR language <xref target="RFC4506" pageno="false" format="default"/>.  </t>
      <t>This description is provided in a way that makes it simple to extract into ready-to-compile form.  The reader can apply the following shell script to this document to produce a machine-readable XDR description of the RPC-over-RDMA Version One protocol without any OPTIONAL extensions.  </t>
      <figure title="" suppress-title="false" align="left" alt="" width="" height="">
        <artwork xml:space="preserve" name="" type="" align="left" alt="" width="" height="">

&lt;CODE BEGINS&gt;

#!/bin/sh
grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??'

&lt;CODE ENDS&gt;

</artwork>
      </figure>
      <t>That is, if the above script is stored in a file called "extract.sh" and this document is in a file called "spec.txt" then the reader can do the following to extract an XDR description file: </t>
      <figure title="" suppress-title="false" align="left" alt="" width="" height="">
        <artwork xml:space="preserve" name="" type="" align="left" alt="" width="" height="">

&lt;CODE BEGINS&gt;

sh extract.sh &lt; spec.txt &gt; rpcrdma_corev2.x

&lt;CODE ENDS&gt;

</artwork>
      </figure>
      <t>Optional extensions to RPC-over-RDMA Version Two, published as Standards Track documents, will have similar means of providing XDR that describes those extensions.  Once XDR for all desired extensions is also extracted, it can be appended to the XDR description file extracted from this document to produce a consolidated XDR description file reflecting all extensions selected for an RPC-over-RDMA implementation.  </t>
      <section title="Code Component License" toc="default">
        <t>Code components extracted from this document must include the following license text.  When the extracted XDR code is combined with other complementary XDR code which itself has an identical license, only a single copy of the license text need be preserved.  <figure title="" suppress-title="false" align="left" alt="" width="" height=""><artwork xml:space="preserve" name="" type="" align="left" alt="" width="" height="">

&lt;CODE BEGINS&gt;

/// /*
///  * Copyright (c) 2010, 2016 IETF Trust and the persons
///  * identified as authors of the code.  All rights reserved.
///  *
///  * The authors of the code are:
///  * B. Callaghan, T. Talpey, C. Lever, and D. Noveck.
///  *
///  * Redistribution and use in source and binary forms, with
///  * or without modification, are permitted provided that the
///  * following conditions are met:
///  *
///  * - Redistributions of source code must retain the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer.
///  *
///  * - Redistributions in binary form must reproduce the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer in the documentation and/or other
///  *   materials provided with the distribution.
///  *
///  * - Neither the name of Internet Society, IETF or IETF
///  *   Trust, nor the names of specific contributors, may be
///  *   used to endorse or promote products derived from this
///  *   software without specific prior written permission.
///  *
///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
///  */

&lt;CODE ENDS&gt;

</artwork></figure> </t>
      </section>
      <section title="RPC-Over-RDMA Version Two XDR" toc="default">
        <t>The XDR defined in this section is used to encode the Transport Header Stream in each RPC-over-RDMA Version Two message.  The terms "Transport Header Stream" and "RPC Payload Stream" are defined in Section 4 of <xref target="I-D.ietf-nfsv4-rfc5666bis" pageno="false" format="default"/>.  <figure title="" suppress-title="false" align="left" alt="" width="" height=""><artwork xml:space="preserve" name="" type="" align="left" alt="" width="" height="">

&lt;CODE BEGINS&gt;

/// /* From RFC 5531, Section 9 */
/// enum msg_type {
///         CALL = 0,
///         REPLY = 1
/// };
///
/// struct rpcrdma2_segment {
///         uint32 rdma_handle;
///         uint32 rdma_length;
///         uint64 rdma_offset;
/// };
///
/// struct rpcrdma2_read_segment {
///         uint32                  rdma_position;
///         struct rpcrdma2_segment rdma_target;
/// };
///
/// struct rpcrdma2_read_list {
///         struct rpcrdma2_read_segment rdma_entry;
///         struct rpcrdma2_read_list    *rdma_next;
/// };
///
/// struct rpcrdma2_write_chunk {
///         struct rpcrdma2_segment rdma_target&lt;&gt;;
/// };
///
/// struct rpcrdma2_write_list {
///         struct rpcrdma2_write_chunk rdma_entry;
///         struct rpcrdma2_write_list  *rdma_next;
/// };
///
/// struct rpcrdma2_chunk_lists {
///         enum msg_type               rdma_direction;
///         uint32                      rdma_inv_handle;
///         struct rpcrdma2_read_list   *rdma_reads;
///         struct rpcrdma2_write_list  *rdma_writes;
///         struct rpcrdma2_write_chunk *rdma_reply;
/// };
///
/// enum rpcrdma2_errcode {
///         RDMA2_ERR_VERS = 1,
///         RDMA2_ERR_BAD_XDR = 2,
///         RDMA2_ERR_CANT_REPLY = 3,
///         RDMA2_ERR_INVAL_PROC = 4,
///         RDMA2_ERR_INVAL_OPTION = 5
/// };
///
/// struct rpcrdma2_err_vers {
///         uint32 rdma_vers_low;
///         uint32 rdma_vers_high;
/// };
///
/// struct rpcrdma2_err_reply {
///         bool   rdma_processed;
///         uint32 rdma_segment_index;
///         uint32 rdma_length_needed;
/// };
///
/// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) {
///         case RDMA2_ERR_VERS:
///           rpcrdma2_err_vers rdma_vrange;
///         case RDMA2_ERR_BAD_XDR:
///           void;
///         case RDMA2_ERR_CANT_REPLY:
///           rpcrdma2_err_reply rdma_reply;
///         case RDMA2_ERR_INVAL_PROC:
///           void;
///         case RDMA2_ERR_INVAL_OPTION:
///           void;
/// };
///
/// struct rpcrdma2_optional {
///         enum msg_type rdma_optdir;
///         uint32 rdma_opttype;
///         opaque rdma_optinfo&lt;&gt;;
/// };
///
/// enum rpcrdma2_proc {
///         RDMA2_MSG = 0,
///         RDMA2_NOMSG = 1,
///         RDMA2_ERROR = 4,
///         RDMA2_OPTIONAL = 5
/// };
///
/// union rpcrdma2_body switch (rpcrdma2_proc rdma_proc) {
///         case RDMA2_MSG:
///           rpcrdma2_chunk_lists rdma_chunks;
///         case RDMA2_NOMSG:
///           rpcrdma2_chunk_lists rdma_chunks;
///         case RDMA2_ERROR:
///           rpcrdma2_error rdma_error;
///         case RDMA2_OPTIONAL:
///           rpcrdma2_optional rdma_optional;
/// };
///
/// struct rpcrdma2_xprt_hdr {
///         uint32        rdma_xid;
///         uint32        rdma_vers;
///         uint32        rdma_credit;
///         rpcrdma2_body rdma_body;
/// };

&lt;CODE ENDS&gt;

</artwork></figure> </t>
        <section title="Presence Of Payload" toc="default">
          <t><list style="symbols"><t>When the rdma_proc field has the value RDMA2_MSG, an RPC Payload Stream MUST follow the Transport Header Stream in the Send buffer.  </t><t>When the rdma_proc field has the value RDMA2_ERROR, an RPC Payload Stream MUST NOT follow the Transport Header Stream.  </t><t>When the rdma_proc field has the value RDMA2_OPTIONAL, all, part of, or no RPC Payload Stream MAY follow the Transport header Stream in the Send buffer.  </t></list> </t>
        </section>
        <section title="Message Direction" toc="default">
          <t>Implementations of RPC-over-RDMA Version Two are REQUIRED to support backwards direction operation as described in <xref target="I-D.ietf-nfsv4-rpcrdma-bidirection" pageno="false" format="default"/>.  <list style="symbols"><t>When the rdma_proc field has the value RDMA2_MSG or RDMA2_NOMSG, the value of the rdma_direction field MUST be the same as the value of the associated RPC message's msg_type field.  </t><t>When the rdma_proc field has the value RDMA2_ERROR, the direction of the message is always Responder-to-Requester (REPLY).  </t><t>When the rdma_proc field has the value RDMA2_OPTIONAL and a whole or partial RPC message payload is present, the value of the rdma_optdir field MUST be the same as the value of the associated RPC message's msg_type field.  </t><t>When the rdma_proc field has the value RDMA2_OPTIONAL and no RPC message payload is present, a Requester MUST set the value of the rdma_optdir field to CALL, and a Responder MUST set the value of the rdma_optdir field to REPLY.  The Requester chooses a value for the rdma_xid field from the XID space that matches the message's direction.  Requesters and Responders set the rdma_credit field in a similar fashion: a value is set that is appropriate for the direction of the message.  </t></list> </t>
        </section>
        <section title="Remote Invalidation" toc="default">
          <t>Among the set of handles in the RPC Call's transport header, the requester selects one handle that may be invalidated remotedly.  The requester sets the rdma_inv_handle field to that value.  If none of the rdma_handle values in the Call may be invalidated by the responder, the requester MUST set the rdma_inv_handle field to the value zero.  The requester MUST NOT set the value of the rdma_inv_handle field to any other value.  </t>
          <t>The responder copies the value of the rdma_inv_handle field set by the requester to the rdma_inv_handle field in the matching reply.  If the rdma_inv_handle field contains zero, the responder MUST NOT use RDMA Send With Invalidate to transmit the matching RPC reply.  Otherwise, the responder SHOULD use RDMA Send With Invalidate to transmit the reply, specifying the value in the rdma_inv_handle field as the handle to be invalidated remotely.  The responder MUST NOT specify any other handle for this operation.  </t>
        </section>
        <section title="Transport Errors" toc="default">
          <t>Error handling works the same way in RPC-over-RDMA Version Two as it does in RPC-over-RDMA Version One, with the addition of several new error codes.  Version One error handling is described in Section 5 of <xref target="I-D.ietf-nfsv4-rfc5666bis" pageno="false" format="default"/>.  </t>
          <t>In all cases below, the sender copies the values of the rdma_xid and rdma_vers fields from the incoming transport header that generated the error to transport header of the error response.  The rdma_proc field is set to RDMA2_ERROR.  <list style="hanging"><t hangText="RDMA2_ERR_VERS"><vspace blankLines="0"/> This is the equivalent of ERR_VERS in RPC-over-RDMA Version One.  The error code value, semantics, and utilization are the same.  </t><t hangText="RDMA2_ERR_INVAL_PROC"><vspace blankLines="0"/> This is a new error code in RPC-over-RDMA Version Two.  If a receiver recognizes the value in the rdma_vers field, but it does not recognize the value in the rdma_proc field, it MUST send RDMA2_ERR_INVAL_PROC.  </t><t hangText="RDMA2_ERR_BAD_XDR"><vspace blankLines="0"/> This is the equivalent of ERR_CHUNK in RPC-over-RDMA Version One, with a few extra restrictions; the error code value is the same.  If a receiver recognizes the value in the rdma_proc field but the incoming RPC-over-RDMA transport header cannot be parsed, it MUST send RDMA2_ERR_BAD_XDR before Upper Layer Protocol processing starts.  </t><t hangText="RDMA2_ERR_CANT_REPLY"><vspace blankLines="0"/> This is a new error code in RPC-over-RDMA Version Two.  If a message is otherwise correct but the requester has not provided enough Write or Reply chunk resources to transmit the reply, the responder MUST send RDMA2_ERR_CANT_REPLY.  The responder MUST set the rdma_processed field to TRUE if the responder discovered the shortage after the Upper Layer Protocol has finished processing the request; otherwise the field MUST be set to FALSE.  The responder MUST set the rdma_segment_index field to point to the first segment in the transport header that is too short, or to zero to indicate that it was not possible to determine which segment was too small.  Indexing starts at one (1), which represents the first segment in the first Write chunk (in either the Write list or Reply chunk).  The responder MUST set the rdma_length_needed to the number of bytes needed in that segment in order to convey the reply.  Upon receipt of this error code, a responder may choose to terminate the operation (for instance, if the responder set both fields above to zero), or it may send the request again using the same XID and larger reply resources.  </t><t hangText="RDMA2_ERR_INVAL_OPTION"><vspace blankLines="0"/> This is a new error code in RPC-over-RDMA Version Two.  A receiver MUST send RDMA2_ERR_INVAL_OPTION when an RDMA2_OPTIONAL message is received and the receiver does not recognize the value in the rdma_opttype field.  </t></list> </t>
        </section>
      </section>
    </section>
    <section title="Protocol Version Negotiation" toc="default">
      <t>When an RPC-over-RDMA Version Two requester establishes a connection to a responder, the first order of business is to determine the responder's highest supported protocol version.  </t>
      <t>As with RPC-over-RDMA Version One, a requester MUST assume the ability to exchange only a single RPC-over-RDMA message at a time until it receives a non-error RPC-over-RDMA message from the responder that reports the responder's actual credit limit.  </t>
      <t>First, the requester sends a single valid RPC-over-RDMA message with the value two (2) in the rdma_vers field.  Because the responder might support only RPC-over-RDMA Version One, this initial message can be no larger than the Version One default inline threshold of 1024 bytes.  </t>
      <section title="Responder Does Support RPC-over-RDMA Version Two" toc="default">
        <t>If the responder does support RPC-over-RDMA Version Two, it sends an RPC-over-RDMA message back to the requester with the same XID containing a valid non-error response.  Subsequently, both peers use the default inline threshold value for RPC-over-RDMA Version Two connections (4096 bytes).  </t>
      </section>
      <section title="Responder Does Not Support RPC-over-RDMA Version Two" toc="default">
        <t>If the responder does not support RPC-over-RDMA Version Two, <xref target="I-D.ietf-nfsv4-rfc5666bis" pageno="false" format="default"/> REQUIRES that it send an RPC-over-RDMA message to the requester with the same XID, with RDMA2_ERROR in the rdma_proc field, and with the error code RDMA2_ERR_VERS.  This message also reports a range of protocol versions that the responder supports.  To continue operation, the requester selects a protocol version in the range of responder-supported versions for subsequent messages on this connection.  </t>
        <t>If the connection is lost immediately after the RDMA2_ERROR reply is received, a requester can avoid a possible version negotiation loop when re-establishing another connection by assuming that particular responder does not support RPC-over-RDMA Version Two.  A requester can assume the same situation (no responder support for RPC-over-RDMA Version Two) if the initial negotiation message is lost or dropped.  </t>
        <t>Once the negotiation exchange is complete, both peers use the default inline threshold value for the protocol version that will be used for the remainder of the connection lifetime.  To permit inline threshold values to change during negotiation of protocol version, RPC-over-RDMA Version Two implementations MUST allow inline threshold values to change without triggering a connection loss.  </t>
      </section>
      <section title="Requester Does Not Support RPC-over-RDMA Version Two" toc="default">
        <t><xref target="I-D.ietf-nfsv4-rfc5666bis" pageno="false" format="default"/> REQUIRES that a responder MUST send Replies with the same RPC-over-RDMA protocol version that the requester uses to send its Calls.  </t>
      </section>
    </section>
    <section title="Security Considerations" anchor="security-considerations" toc="default">
      <t>The security considerations for RPC-over-RDMA Version Two are the same as those for RPC-over-RDMA Version One.  </t>
    </section>
    <section title="IANA Considerations" anchor="iana-considerations" toc="default">
      <t>There are no IANA considerations at this time.  </t>
    </section>
  </middle>
  <back>
    <references title="Normative References">
      <reference anchor="RFC2119" target="http://www.rfc-editor.org/info/rfc2119">
        <front>
          <title>Key words for use in RFCs to Indicate Requirement Levels</title>
          <author initials="S." surname="Bradner" fullname="S. Bradner">
            <organization/>
          </author>
          <date year="1997" month="March"/>
          <abstract>
            <t>In many standards track documents several words are used to signify the requirements in the specification.  These words are often capitalized. This document defines these words as they should be interpreted in IETF documents.  This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
          </abstract>
        </front>
        <seriesInfo name="BCP" value="14"/>
        <seriesInfo name="RFC" value="2119"/>
        <seriesInfo name="DOI" value="10.17487/RFC2119"/>
      </reference>
      <reference anchor="RFC4506" target="http://www.rfc-editor.org/info/rfc4506">
        <front>
          <title>XDR: External Data Representation Standard</title>
          <author initials="M." surname="Eisler" fullname="M. Eisler" role="editor">
            <organization/>
          </author>
          <date year="2006" month="May"/>
          <abstract>
            <t>This document describes the External Data Representation Standard (XDR) protocol as it is currently deployed and accepted.  This document obsoletes RFC 1832.  [STANDARDS-TRACK]</t>
          </abstract>
        </front>
        <seriesInfo name="STD" value="67"/>
        <seriesInfo name="RFC" value="4506"/>
        <seriesInfo name="DOI" value="10.17487/RFC4506"/>
      </reference>
      <reference anchor="RFC5531" target="http://www.rfc-editor.org/info/rfc5531">
        <front>
          <title>RPC: Remote Procedure Call Protocol Specification Version 2</title>
          <author initials="R." surname="Thurlow" fullname="R. Thurlow">
            <organization/>
          </author>
          <date year="2009" month="May"/>
          <abstract>
            <t>This document describes the Open Network Computing (ONC) Remote Procedure Call (RPC) version 2 protocol as it is currently deployed and accepted.  This document obsoletes RFC 1831.   [STANDARDS-TRACK]</t>
          </abstract>
        </front>
        <seriesInfo name="RFC" value="5531"/>
        <seriesInfo name="DOI" value="10.17487/RFC5531"/>
      </reference>
    </references>
    <references title="Informative References">
      <reference anchor="I-D.ietf-nfsv4-rfc5666bis">
        <front>
          <title>Remote Direct Memory Access Transport for Remote Procedure Call, Version One</title>
          <author initials="C" surname="Lever" fullname="Chuck Lever">
            <organization/>
          </author>
          <author initials="W" surname="Simpson" fullname="William Simpson">
            <organization/>
          </author>
          <author initials="T" surname="Talpey" fullname="Tom Talpey">
            <organization/>
          </author>
          <date month="May" day="27" year="2016"/>
          <abstract>
            <t>This document specifies a protocol for conveying Remote Procedure Call (RPC) messages on physical transports capable of Remote Direct Memory Access (RDMA).  It requires no revision to application RPC protocols or the RPC protocol itself.  This document obsoletes RFC 5666.</t>
          </abstract>
        </front>
        <seriesInfo name="Internet-Draft" value="draft-ietf-nfsv4-rfc5666bis-07"/>
        <format type="TXT" target="http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-rfc5666bis-07.txt"/>
        <format type="PS" target="http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-rfc5666bis-07.ps"/>
        <format type="PDF" target="http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-rfc5666bis-07.pdf"/>
      </reference>
      <reference anchor="I-D.ietf-nfsv4-rpcrdma-bidirection">
        <front>
          <title>Bi-directional Remote Procedure Call On RPC-over-RDMA Transports</title>
          <author initials="C" surname="Lever" fullname="Chuck Lever">
            <organization/>
          </author>
          <date month="June" day="9" year="2016"/>
          <abstract>
            <t>Minor versions of NFSv4 newer than NFSv4.0 work best when ONC RPC transports can send Remote Procedure Call transactions in both directions on the same connection.  This document describes how RPC- over-RDMA transport endpoints convey RPCs in both directions on a single connection.</t>
          </abstract>
        </front>
        <seriesInfo name="Internet-Draft" value="draft-ietf-nfsv4-rpcrdma-bidirection-05"/>
        <format type="TXT" target="http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-rpcrdma-bidirection-05.txt"/>
        <format type="PS" target="http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-rpcrdma-bidirection-05.ps"/>
        <format type="PDF" target="http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-rpcrdma-bidirection-05.pdf"/>
      </reference>
      <reference anchor="IB" target="http://www.infinibandta.org">
        <front>
          <title>InfiniBand Architecture Specifications</title>
          <author>
            <organization>InfiniBand Trade Association</organization>
          </author>
          <date/>
        </front>
      </reference>
      <reference anchor="RFC5040" target="http://www.rfc-editor.org/info/rfc5040">
        <front>
          <title>A Remote Direct Memory Access Protocol Specification</title>
          <author initials="R." surname="Recio" fullname="R. Recio">
            <organization/>
          </author>
          <author initials="B." surname="Metzler" fullname="B. Metzler">
            <organization/>
          </author>
          <author initials="P." surname="Culley" fullname="P. Culley">
            <organization/>
          </author>
          <author initials="J." surname="Hilland" fullname="J. Hilland">
            <organization/>
          </author>
          <author initials="D." surname="Garcia" fullname="D. Garcia">
            <organization/>
          </author>
          <date year="2007" month="October"/>
          <abstract>
            <t>This document defines a Remote Direct Memory Access Protocol (RDMAP) that operates over the Direct Data Placement Protocol (DDP protocol).  RDMAP provides read and write services directly to applications and enables data to be transferred directly into Upper Layer Protocol (ULP) Buffers without intermediate data copies.  It also enables a kernel bypass implementation.  [STANDARDS-TRACK]</t>
          </abstract>
        </front>
        <seriesInfo name="RFC" value="5040"/>
        <seriesInfo name="DOI" value="10.17487/RFC5040"/>
      </reference>
      <reference anchor="RFC5041" target="http://www.rfc-editor.org/info/rfc5041">
        <front>
          <title>Direct Data Placement over Reliable Transports</title>
          <author initials="H." surname="Shah" fullname="H. Shah">
            <organization/>
          </author>
          <author initials="J." surname="Pinkerton" fullname="J. Pinkerton">
            <organization/>
          </author>
          <author initials="R." surname="Recio" fullname="R. Recio">
            <organization/>
          </author>
          <author initials="P." surname="Culley" fullname="P. Culley">
            <organization/>
          </author>
          <date year="2007" month="October"/>
          <abstract>
            <t>The Direct Data Placement protocol provides information to Place the  incoming data directly into an upper layer protocol's receive buffer  without intermediate buffers.  This removes excess CPU and memory  utilization associated with transferring data through the  intermediate buffers.  [STANDARDS-TRACK]</t>
          </abstract>
        </front>
        <seriesInfo name="RFC" value="5041"/>
        <seriesInfo name="DOI" value="10.17487/RFC5041"/>
      </reference>
      <reference anchor="RFC5661" target="http://www.rfc-editor.org/info/rfc5661">
        <front>
          <title>Network File System (NFS) Version 4 Minor Version 1 Protocol</title>
          <author initials="S." surname="Shepler" fullname="S. Shepler" role="editor">
            <organization/>
          </author>
          <author initials="M." surname="Eisler" fullname="M. Eisler" role="editor">
            <organization/>
          </author>
          <author initials="D." surname="Noveck" fullname="D. Noveck" role="editor">
            <organization/>
          </author>
          <date year="2010" month="January"/>
          <abstract>
            <t>This document describes the Network File System (NFS) version 4 minor version 1, including features retained from the base protocol (NFS version 4 minor version 0, which is specified in RFC 3530) and protocol extensions made subsequently.  Major extensions introduced in NFS version 4 minor version 1 include Sessions, Directory Delegations, and parallel NFS (pNFS).  NFS version 4 minor version 1 has no dependencies on NFS version 4 minor version 0, and it is considered a separate protocol.  Thus, this document neither updates nor obsoletes RFC 3530.  NFS minor version 1 is deemed superior to NFS minor version 0 with no loss of functionality, and its use is preferred over version 0.  Both NFS minor versions 0 and 1 can be used simultaneously on the same network, between the same client and server.  [STANDARDS-TRACK]</t>
          </abstract>
        </front>
        <seriesInfo name="RFC" value="5661"/>
        <seriesInfo name="DOI" value="10.17487/RFC5661"/>
      </reference>
      <reference anchor="RFC5662" target="http://www.rfc-editor.org/info/rfc5662">
        <front>
          <title>Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description</title>
          <author initials="S." surname="Shepler" fullname="S. Shepler" role="editor">
            <organization/>
          </author>
          <author initials="M." surname="Eisler" fullname="M. Eisler" role="editor">
            <organization/>
          </author>
          <author initials="D." surname="Noveck" fullname="D. Noveck" role="editor">
            <organization/>
          </author>
          <date year="2010" month="January"/>
          <abstract>
            <t>This document provides the External Data Representation Standard (XDR) description for Network File System version 4 (NFSv4) minor version 1.  [STANDARDS-TRACK]</t>
          </abstract>
        </front>
        <seriesInfo name="RFC" value="5662"/>
        <seriesInfo name="DOI" value="10.17487/RFC5662"/>
      </reference>
      <reference anchor="RFC5666" target="http://www.rfc-editor.org/info/rfc5666">
        <front>
          <title>Remote Direct Memory Access Transport for Remote Procedure Call</title>
          <author initials="T." surname="Talpey" fullname="T. Talpey">
            <organization/>
          </author>
          <author initials="B." surname="Callaghan" fullname="B. Callaghan">
            <organization/>
          </author>
          <date year="2010" month="January"/>
          <abstract>
            <t>This document describes a protocol providing Remote Direct Memory Access (RDMA) as a new transport for Remote Procedure Call (RPC).  The RDMA transport binding conveys the benefits of efficient, bulk-data transport over high-speed networks, while providing for minimal change to RPC applications and with no required revision of the application RPC protocol, or the RPC protocol itself.  [STANDARDS-TRACK]</t>
          </abstract>
        </front>
        <seriesInfo name="RFC" value="5666"/>
        <seriesInfo name="DOI" value="10.17487/RFC5666"/>
      </reference>
    </references>
    <section title="Acknowledgments" toc="default">
      <t>The authors gratefully acknowledge the work of Brent Callaghan and Tom Talpey on the original RPC-over-RDMA Version One specification <xref target="RFC5666" pageno="false" format="default"/>.  The authors also wish to thank Bill Baker, Greg Marsden, and Matt Benjamin for their support of this work.  </t>
      <t>The extract.sh shell script and formatting conventions were first described by the authors of the NFSv4.1 XDR specification <xref target="RFC5662" pageno="false" format="default"/>.  </t>
      <t>Special thanks go to nfsv4 Working Group Chair Spencer Shepler and nfsv4 Working Group Secretary Thomas Haynes for their support.  </t>
    </section>
  </back>
</rfc>
