Ganesha and NFSv4.0 4.1

Table of Contents Ganesha NFSv4 and NFSv41 Compliance NFSv4.0 (RFC 3530) SETCLIENTID/SETCLIENTID_CONFIRM NFSv4.0 Callbacks NFSv4.1 (RFC 5661) NFSv4.1 State Grace Period, Recovery, and Reclaim Rules Lease Period NFSv4.1 State serialization Wraparound Exactly Once Semantics (EOS) Current/Saved stateid OPEN_DOWNGRADE FREE_STATEID TEST_STATEID OPEN with EXCLUSIVE4_1 GSS Secure State Verifier (SSV) SECINFO SECINFO_NO_NAME NFSv4.1 SESSION Support CREATE_SESSION CREATE_SESSION Attributes BIND_CONN_TO_SESSION EXCHANGE_ID DESTROY_CLIENTID Trunking NFSv4.1 callbacks Optional features we know we wish to support pNFS Callbacks (15.3, including CB_LAYOUTRECALL) Delegations (10.2) Referrals (11.4.3) Persistent Sessions Optional features we have no immediate plans to support Retention Attributes NFSv4.1 RDMA and RDMA transport integration General Stability and Performance Requirements Cache Inode Dirent cache reimplementation Cache Inode GC Cache Inode Invalidation Upcalls RPC Zero Copy State Management

Ganesha NFSv4 and NFSv41 Compliance

NFSv4.0 (RFC 3530)

all required operations
state-bearing operations (but they may not be correct)
1. in particular, sequence id management Philippe and IBM seem to be fixing this
rpcsec_gss security -- supported, IBM has been testing this
unicode (some prescriptions in NFSv4 now deprecated for 4.1 and in 3530bis)

SETCLIENTID/SETCLIENTID_CONFIRM

The clientid is formed by hashing the client.id counter to 3530's statement that the server "must take care to ensure that these values are extremely unlikely to ever be regenerated." Collisions are detected by comparing the recorded and generated clientids rather than the provided and recorded client.id arguments, making spurious errors and state disposals likely.

Mutating clientid records rather than generating new uncomfirmed records on SETCLIENTID and replacing old, confirmed records on SETCLIENTID_CONFIRM removes much of the robustness in the mechanism and violates the implementation guidelines in the spec.

State is not actually released when required.

NFSv4.0 Callbacks

Callbacks are optional in NFSv4.0. A client and server must support callbacks and establish and maintain a callback path if clients are to be allowed to use specific optional protocol features, in particular, delegations.

In NFSv4.0, information is provided with the SETCLIENTID/SETCLIENTID_CONFIRM operations, which effect a callback registration (address, program, port, callback_ident, and GSS-API security flavor). The server initiates callback connections using the supplied information (including GSS-API secure connection establishment). An infrastructure must exist to track and test client-provided callback information, identify path-down condition, and alert clients of path-down. By definition, an NFSv4.0 callback transport is independent of the transport(s) the client is using to send NFSv4.0 protocol requests.

Ganesha has placeholder support for NFSv4.0 callbacks. The server does not acknowledge callback registration, and does not establish or test callback connections of any security flavor.
Ganesha has partial but throw-away support for client callback operations and compounds, further elaboration of these interfaces is required for NFSv4.0 and NFSv4.1 callback support.
Ganesha has no asynchronous client/lease/path garbage collection mechanism, an efficient mechanism of this type is needed for callback path-down checks (and probably also for other async state checks).

NFSv4.1 (RFC 5661)

NFSv4.1 open and lock state exist, but may not be correct for all cases.
session, clientid and sequence management exist but not complete in all respects, especially clientid and sequence rules
BIND_CONN_TO_SESSION (for callbacks [below], but also reconnection, as used by the Windows client)
rpcsec_gss security -- supported
1. but not SSV (there may be no working implementation?)

NFSv4.1 State

Grace Period, Recovery, and Reclaim Rules

It is our understanding that Jim Wahlig of IBM intends to implement this functionality.

Ganesha currently falls short of spec in a number of ways
1. RECLAIM_COMPLETE implementation is a no-op, and is not optional.
2. There is no implementation of a reclaim grace period for NFSv4 within the current Ganesha code (there is for NLM), which is required.
3. The reclaim-type version of OPEN and LOCK are not implemented, which is required.
Minimal Compliance
1. Implement the reclaim grace period.
  1. Recognize when the grace period needs to begin; recognize when in grace period and when grace period has expired
    1. At restart
    2. When a filesystem moves to a new server
2. Implement the following "reclaim-type" operations:
  1. OPEN with claim = CLAIM_PREVIOUS (see 9.11)
  2. LOCK with reclaim = true (see 9.11)
3. Implement RECLAIM_COMPLETE operation (section 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims Finished)
4. When within the grace period
  1. Process reclaims according to section 8.4.2. Server Failure and Recovery and section 11.7.7 Lock State and File System Transitions
  2. Record which clients have called RECLAIM_COMPLETE
  3. Clients that have not yet called RECLAIM_COMPLETE can only call reclaim-type operations and RECLAIM_COMPLETE; other operations result in NFS4ERR_GRACE
  4. Clients that have called RECLAIM_COMPLETE will receive NFS4ERR_COMPLETE_ALREADY if they call it again subsequently
  5. Note: from section 8.4.2.1. State Reclaim, "For a server to provide simple, valid handling during the grace period, the easiest method is to simply reject all non-reclaim locking requests and READ and WRITE operations by returning the NFS4ERR_GRACE error."
Optional Compliance
1. Implement the following "reclaim-type" operation:
  1. WANT_DELEGATION with wda_claim = CLAIM_PREVIOUS (see 10.2.1)
2. Serialize client lock information to persistant store, which will allow the server to:
  1. Truncate the grace period when no further reclaims can be made
  2. Allow some operations to take place during the grace period if there are no possible impending reclaims that will prohibit such operations

Lease Period

The server should maintain a lease period for each client, during which the client's locks remain valid.
1. The lease exists to handle the case of a client that has locks failing, therefore the lease can apply to all sessions of the server on the client.
Each time the client submits a SEQUENCE operation on the sever, the lease is automatically renewed.
1. The client, if it needs to renew the lease, can submit an empty SEQUENCE operation strictly for lease renewal purposes.
2. The server can renew the lease upon receipt of the SEQUENCE operation as long as it guarantees the lock does not expire during the operations.
3. The server updates the lease upon completion of the SEQUENCE operation to at least the sum of the current time and the lease period.
The server can release locks once the lease has expired, thereby allowing other clients to claim locks that would otherwise be conflicting.
1. The server could use a timer to expire the lease or alternatively simple wait until a conflicting lock request is made.

NFSv4.1 State serialization

Where NFSv4.0 serialized on open and lock owners, NFSv4.1 serializes on the stateid. Two open requests for the the same or different files may be fired off in parallel over multiple connections. The same is true of lock owners -- multiple locking calls can be in-flight at the same time for the same lock owner. Currently Ganesha does not support this well since it still serializes everything through the open or lock owners. Individual states are neither locked with a mutex nor reference counted. While this imposes unnecessary and incorrect serialization on opens and locks, it is an especially bad problem for layouts, which are intended to have a stream of LAYOUTGET and LAYOUTRETURN requests on the forechannel and LAYOUTRECALL requests ont he backchannel. To support all three, we should have a mechanism that serializes NFSv4.1 stateids that does not apply to NFSv4.0.

Wraparound

There is currently no check for wraparound in stateid.seqid. The check in 12.5.5.2.1.4 should be implemented at least for LAYOUT states.

Exactly Once Semantics (EOS)

NFSv41 uses a finite-sized reply cache for requests to the server.
This is implemented as a slot table, where each slot has a unique identifier (1..N), and each slot holds a sequence_id and the cached result of a request.
Each SEQUENCE operation designates a slot in the table along with a sequence id. The client tries to use the lowest available slot in order to minimize resource requirements on the server for the slot table.
1. Using the slot number and sequence id, the server can tell if this is a new request, a resubmission of the previous, already handled, request, or a variety of error conditions and respond appropriately.
  1. In the case of a resubmission of the previous request, the cached response can be replayed back to the client.
2. The sequence id is 32 bits and can roll over.
3. The client can, for a given SEQUENCE operation ask that the server not cache the results. The server may still cache the results or elect to adhere to the request.

The above all seems to be implemented between nfs41_op_sequence.c, nfs4_Compound.c, and nfs4_session.h.

NFSv41 allows for adaptive adjustments to the slot table to optimize resource allocation for it. It does not appear that such adjustments are currently implemented.

Adding support for a persistent reply cache that could be enabled for servers with SSD would improve robustness.

Current/Saved stateid

Not all functions update or respect the current stateid. The saved stateid is never updated or restored.

OPEN_DOWNGRADE

OPEN_DOWNGRADE updates the stateid.seqid but is otherwise a no-op.

FREE_STATEID

Currently Ganesha does nothing, simply returning NFS4_OK whatever the client passes in.

TEST_STATEID

This function returns NFS4_OK on every stateid.

OPEN with EXCLUSIVE4_1

NFSv4.1 requires that the verifier4 supplied with the EXCLUSIVE4_1 open flag be committed to stable storage. It recommends a dedicated location (an extended attribute, for example), but failing that recommends repurposing recommended attributes such as the access and modification time. Currently, Ganesha stores this verifier in memory, making it unable to provide exclusivity guarantee across server reboots.

GSS

The RPCSEC_GSS security flavor MUST be implemented (2.2.11). Ganesha has GSS support but integration is not complete.

RPCSEC_GSS must support Kerberos V (2.2.1.1.1.2, complete)
RPCSEC_GSS support is required for secure NFSv4.1 backchannel if clients request RPCSEC_GSS (2.10.8.2)

Secure State Verifier (SSV)

Is in RFC5661 defined as an optional (2.10.8.3) mechanism for strong session protection (clientid, lock and open state) protection.

Current status of Ganesha:

CREATE_SESSION
BACKCHANNEL_CTL -- is mandatory for fully compliant GSS callback security
- the Linux client currently has no code to call the operation, but Linux Documentation/filesystems/nfs/nfs41-server.txt calls the op mandatory to implement
SSV
- Ganesha lacks SSV functionality in
  - EXCHANGE_ID
  - BIND_CONN_TO_SESSION (which is unimplemented [Linux]
  - SECINFO
  - SECINFO_NO_NAME (which is unimplemented)
  - SET_SSV (present as a but in Ganesha but unimplemented)
  - BACKCHANNEL_CTL (which is unimplemented)

SECINFO

The operation can return AUTH_NONE and AUTH_UNIX flavors, but not RPCSEC_GSS. Actually enforcing RPCSEC_GSS security on specific objects or altogether is not mandatory, but may be insufficient for the intended use.

SECINFO_NO_NAME

The operation is not implemented in Ganesha, file missing. The operation is mandatory (18.45)

NFSv4.1 SESSION Support

The Ganesha server has basic support for session operations as used by the Linux kernel NFSv4.1 client.

CREATE_SESSION

Ganesha next currently does not correctly accept a sessionid agreed on in EXCHANGE_ID. Linux Box Ganesha has a fix for this, pushed as part of the pNFS patch.

CREATE_SESSION Attributes

When the client invokes CREATE_SESSION it passes in attributes to make various requests upon the server:

General attributes:
1. Whether to persist the session reply cache for EOS operations.
  - This attribute is currently ignored and assumed to be false.
2. Whether to use the existing connection for the back channel as well as the fore channel.
  - This attribute is currently ignored and assumed to be true. That is the connection is always used for both the fore and back channels.
3. Whether to upgrade the connection to a RDMA (remote direct memory access) connection if it's not already
  - This attribute is currently ignored and assumed to be false.
Attributes for the fore channel and for the back channel
- These attributes are currently accepted without analysis.
- The spec allows for the server to make appropriate adjustments on some of these attributes, which it currently does not do.

BIND_CONN_TO_SESSION

Is mandatory to implement, and also used by the Windows (but not the Linux) client. Not implemented in Ganesha, even as a stub. Linux Box has a prototype implementation (see backchannel).

EXCHANGE_ID

Ganesha's clientid is just a hash of the co_owner string, which violates the prescription that two incarnations of the same client should not have the same clientid (see Grace Period, Recovery, and Reclaim Rules). Collisions are also detected by comparing the clientid produced by this hash, creating the opportunity for spurious collisions.

Currently Ganesha supports no state protection and nfs4_op_exchange_id is hard-coded to return SP4_NONE whatever the client requests. RFC5661 implies that a client may specificy either of SP4_MACH_CRED and SP4_SSV at its option.

DESTROY_CLIENTID

DESTROY_CLIENTID is currently unimplemented, returning NFS4ERR_OP_ILLEGAL. It is required functionality.

Trunking

NFSv4.1 supports both session trunking and clientid trunking, support for both types is mandatory (2.5.10).

Ganesha appears to support clientid trunking
- clientid trunking is enabled using the CREATE_SESSION operation multiple times with a shared clientid, which logically is supported
Ganesha does not support session trunking
- session trunking is enabled using the BIND_CONN_TO_SESSION operation, which is unimplemented (a prototype implementation of BIND_CONN_TO_SESSION has been produced by Linux Box, but it is incomplete)

NFSv4.1 callbacks

The callback mechanism in NFSv4.1 is better integrated than in NFSv4.0, but remains optional. In NFSv4.1, the backchannel connections are initiated at the client rather than the server (for NAT traversal), using the new CREATE_SESSION and BIND_CONN_TO_SESSION operations. Although RFC 5661 provides for flexible "fore" and "back" channel management within sessions, the support is not well elaborated in the specification, and the Linux client and current Windows client support only a single, shared (bi-directional) channel configuration per-session.

Ganesha has no explicit backchannel support. The server unconditionally accepts whatever backchannel configuration a client requests in CREATE_SESSION, but does not make use of the backchannel thereafter.
Underlying RPC library changes are needed to support bi-directional operation
- Linux Box is working on bi-directional support for TI-RPC (a dedicated back-channel "switching" mechanism was previously implemented)
Ganesha has partial but throw-away support for client callback operations and compounds, further elaboration of these interfaces is required for NFSv4.1 callback support. Linuxbox plans to implement LAYOUTRECALL at least.
Ganesha has no async client/lease/path garbage collection mechanism, an efficient mechanism of this type is needed for callback path-down checks (and probably also for other async state checks)
new GSS-API (and possibly SSV) channel identity work will be needed to support callback security, will be completed by linux Box

Optional features we know we wish to support

pNFS

A generic pNFS implementation and FSAL-based pNFS implementation has been submitted for inclusion in Ganesha.

Callbacks (15.3, including CB_LAYOUTRECALL)

Delegations (10.2)

Referrals (11.4.3)

Persistent Sessions

NFSv4.1 optionally allows information about client sessions and their associated state to be saved to persistant store, such that if the server restarts, the state can be recovered and operations can be resumed in minimal time.
The following must minimally be stored if this is implemented (2.10.6.5):
1. Session id
2. Reply cache slot table

(Optional session persist values are mentioned in the same section.)

Optional features we have no immediate plans to support

Retention Attributes

"Servers MAY support or not support retention on any file object type" (5.13).

NFSv4.1 RDMA and RDMA transport integration

There is some evidence that IBM may eventually support RDMA efforts

General Stability and Performance Requirements

Cache Inode

Dirent cache reimplementation

Dirent cache and readdir result caching based on AVL trees implemented (finished, merged to next)

Cache Inode GC

Stability and replacement issues
- Will entail work by several parties

Cache Inode Invalidation Upcalls

Design in progress, includes upcall/events layer
- Initial implementation from IBM

RPC

General reorganizing cleanups (in progress)
- support only TI-RPC implementation (2x remove duplicated code)
- remove requirements for Ganesha/TI-RPC layering violations
  - direct access to transport array (or even that it is such)
  - transport copying with changed parameters
- support plug-out request activation
- support plug-out Duplicate Request Cache
- support plug-out allocator indirection (finished)
- support plug-out log channel (finished)
Changes required to interoperate with Linux and Windows client backchannel (in progress)
Channel multiplexing currently done using Unix select
- EPOLL support to generic TI-RPC, this is planned to be merged with Ganesha in tandem with bi-directional changes

Currently being done by Linuxbox

Zero Copy

A Linux zero-copy I/O strategy will be necessary to achieve i/o performance competitive with kernel mode implementations.

To support POSIX-like FSALs only, a sendfile-based mechanism may be sufficient
To support user-client-based (e.g., Ceph) FSALs, as well as exported kernel file systems, a model based on tee() and splice() would be required

Linuxbox is working on placement functionality.

State Management

Further unification of open, lock and pnfs state possible
Efficiency of new state representation should be evaluated and measured (many lists)