Jonathan Bauman (firstname.lastname@example.org), CITI
wc, etc.) with measurements of total, user and system time. Decreases in system time expected with transfers sufficiently large to overcome setup costs.
Development will be based on a dual-processor Dell PowerEdge 2650 running SuSE Linux Professional version 9.1 and a stock Linux 2.6.9 kernel. This platform is chosen to provide a common reference between CITI, Network Appliance and Mellanox in the hopes that it will minimize difficulties associated with resolving issues with the supporting software and hardware. Since this design is concerned with only the NFS version 3 server Chuck Lever's RPC client transport switch patches and Trond Myklebust and CITI's NFS version 4 patches will not be included.
This section describes the initial design for the modifications to the Linux NFSv3 server necessary to support RDMA transport. Since NFSD and the underlying RPC server code were not initially designed to support non-socket transport types, significant modification will be required to achieve a suitably robust and general solution. As much as possible, modification to NFS and NFSv3-specific code is avoided in favor of modification to the shared RPC subsystem. This approach should serve to ease the development of RDMA-enabled versions of other RPC programs, particularly the other versions of the NFS program. Furthermore, as inclusion of RDMA transport code in the Linux kernel is a goal, appropriate code structure and style will drive design decisions. In particular, modularity will be preferred to quickness of implementation. The goal of this development is a functioning server suitable for eventual inclusion in the mainline kernel rather than a prototype.
There are two basic approaches envisioned for adding RDMA transport support to the Linux NFSv3 server. The first is to add new RDMA-specific data structures and functions without perturbing the socket-oriented code and enabling the RDMA code paths via conditional switching. However, due to the level of integration of socket-specific code in the RPC layer, this would allow virtually no reuse of existing RPC code, and would necessitate significant modifications at the NFS layer. Furthermore, this approach is unlikely to be acceptable to Linux kernel maintainers. The other approach is to add another layer of abstraction in the RPC layer dividing it into a unified state management layer and an abstract transport layer. To an extent, this already exists to allow the RPC socket interface to use both TCP and UDP transports. This design proposes to isolate all socket-specific code and replace it with a generalized interface that can be implemented by an RDMA transport as well as by sockets. In this respect it is similar to the Chuck Lever's RPC client transport switch. Rather than attempting to create a fully abstracted transport interface before beginning development, RPC functionality will be abstracted and RDMA versus socket implementations will be isolated as needed during development. The final goal is to achieve a completely abstract interface devoid of socket-specific code and suitable for implementation of new transport types. However, in order to speed development toward a working implementation, and to minimize the creation of unnecessary abstraction, this incremental approach is preferred.
In order to quantify development progress as well as simplify the design and development tasks, the implementation has been divided into three stages. Though development work will certainly overlap between them, each stage is characterized by the level of RDMA functionality provided.
This stage will involve no transfer of RPC or NFS data. It is simply concerned with configuring the RDMA hardware and software to listen for and accept a connection from an RDMA peer.
The extent of modification to NFS-specific code will be to replace calls to
int svc_makesock(struct svc_serv *serv, int protocol, unsigned short port)
and add another call with RDMA transport as the parameter. The NFS server should now accept connections on the designated port via TCP, UDP and RDMA. The new
int svc_makexprt(struct svc_serv *serv, int protocol, unsigned short port)
svc_makexprtfunction will operate in largely the same way as
svc_maxsockdid for socket-based transports and continues to call
svc_create_socket, but for RDMA, this will begin the sequence of calls that will open the interface adapter, register memory, and create an endpoint listening for connections. All RDMA-specific transport code will be isolated to a separate file. Socket-specific code will likewise be gradually migrated to a separate file, leaving only transport agnostic code in
svc.cand creating a new file
svcxprt.cto handle the transport interface.
In order to integrate the RDMA implementation into the existing sockets-based code, significant reorganization of RPC structures is needed.
svc_sock will be replaced by an abstract
svc_xprt structure. As much of the structure as possible will be retained and a
union of structures or a pointer to a structure with transport specific data will be added.
svcsock.h will be reorganized into
svcrdma.h. Less invasive modification will also be made to the
svc_rqst structures where they reference socket-specific structures. Finally, minor modifications will be required at all points in the RPC code that reference the old
svc_sock structure, or socket-specific fields of the other structures.
svc_xprt structure will likely need to contain additional function pointers to satisfy the increased control required for RDMA. The exact details of the interface modifications are not specified at this time, and will be subject to the needs of development. However, in all cases, minimizing divergence from the original RPC code is preferred.
Though the resulting RDMA functionality achieved by completion of this stage is relatively modest, it also represents a significant reorganization of much of the underlying RPC code as well as RDMA initialization routines that while simple in function are considerable in volume. As such, though the development time may seem considerable, it represents a major step forward in the implementation of the full RDMA server.
Initial estimate of basic development time: 5 weeks
This stage will call for the creation of RDMA-specific send and receive routines similar to
sendto}. Data for all requests and replies will be sent inline. This is similar to standard TCP operation, but will utilize RDMA Send operations. There is not expected to be any
performance gain over TCP with this stage; in fact, a moderate performance degradation may occur. However, it should function as further validation of the success of stage one, and provide a solid framework of flow control on which to base stage three.
The main tasks associated with this stage will be successfully registering the memory buffers used by the RDMA Send operation and ensuring their proper management by both the RDMA hardware and RPC/NFS software layers. Also, new code must be added to process RDMA headers. This does not appear to pose significant difficulty.
This stage appears to be relatively simple to implement, but should yield a functional NFSv3 over RDMA server.
Initial estimate of basic development time: 3 weeks
This stage will enable the use of RDMA Read and Write operation for large data transfers. At this point, socket-specific functionality should be completely abstracted out of the RPC interface, so any interface changes will be solely for the purpose of increasing the level of control for RDMA.
The majority of this stage of development will involve encoding and decoding chunk lists and managing the memory associated with the RDMA Read/Write operations. Two factors will hopefully serve to simplify this implementation:
RPC manages request and response data with the use of an
xdr_buf structure which contains an initial
kvec structure followed by an array of contiguous pages. The initial
kvec is used for RPC header data, as well as the data payload for short messages, while the list of pages is used exclusively for large data movement operations such as
WRITE. George Feinberg proposed taking advantage of this fact in his design for the NFSv4 RDMA client. This allows the server to transparently determine when to utilize write chunks and RDMA Send operations for RPC replies.
Since only the NFS RDMA server performs RDMA Read/Write operations, there is no perceived increased security risk in pre-registering all server memory. This allows for simple utilization of any desired memory region for RDMA operations, eliminating the need to specialize the page buffer allocation schemes used by the RPC layer. However, before inclusion in the Linux kernel, the potential for spoofing of RDMA steering tags and the consequences of this memory registration strategy should be reconsidered.
The primary challenge in implementing RDMA operations is in handling the different types of chunks and chunk lists. As mentioned previously, the
xdr_buf structure is designed in such a way that allows the server to separate large data payloads from inline data. The three currently used chunk types will be handled thusly.
svc_recvwill be responsible for interpreting the chunk list in the RDMA header and performing RDMA Read operations into the
xdr_bufstructure's page list. The result will be the same as if the data were received via TCP or UDP: the RPC and upper layer protocols should be unaffected.
svc_recvwill be responsible for interpreting the chunk list in the RDMA header and storing the information in the transport-specific data structure attached to the
svc_rqststructure. Subsequently, the transport-specific function that is called by
svc_sendwill access the stored write chunk data, and perform RDMA Write operations with the contents of the
xdr_bufstructure's page list. Again, there should be no affect to the rest of the RPC and upper layer protocols.
svc_rqstsubstructure. The transport-specific function that is called by
svc_sendwill check for the presence of reply chunks, and if present will use them to send the contents of the
xdr_bufstructure to the client via RDMA Write operations, followed by a null RDMA Send operation to indicate completion.
In determining the location in the protocol stack to place the modifications for handling RDMA chunks, minimal collateral code impact and opacity to RPC upper layer protocols were of primary concern. This, along with the insight into the use of the
xdr_buf structure, led to managing chunks at the RPC transport level. Investigation of placing control at the XDR layer was also examined, but proved impractical due to differences in the XDR handling of read/write operations by different NFS versions. For example, NFSv3 server performs the read system call during XDR request decoding, whereas NFSv4 server performs the same call during XDR reply encoding. Performing chunk handling at the RPC transport layer should obviate the need to make modifications for different NFS versions (or other RPC programs) as well as achieving optimal performance.
The experience of Network Appliance engineers indicates that this is the most difficult stage of RDMA development. That, coupled with CITI's lack of experience with RDMA results in this stage having the highest development time.
Initial estimate of basic development time: 6 weeks
Initial estimate of total development time: 14 weeks.
Estimates of development time in this section assume all CITI work will be completed by one developer working alone. Changes in personnel allocation may affect the schedule. Other factors that may affect the schedule include changes to the development platform and any delays incurred due to dependencies described in the previous section. Finally, it should be emphasized that this is the first utilization of Linux as an RDMA server platform. This work is experimental in nature, and unforeseen complications should not be unexpected. Though these estimates attempt to provide time for addressing the known difficulties in implementing an RDMA transport, they should still be treated as rough estimates. Insofar as possible attempts will be made to revise future estimates as work progresses.
As of yet, Mellanox has not been able to provide an kDAPL implementation that has been verified operational on CITI hardware. As such, NetApp's NFSv3 RDMA client cannot yet be run. A working version is expected soon, but since this a necessary component to test the RDMA server, all CITI time spent installing and configuring new software from Mellanox and NetApp should be added to the raw server implementation estimate. For scheduling purposes, 1 week will be assumed for now.
Network Appliance has requested that CITI give a presentation at Connectathon 2005 regarding experiences implementing NFSv4.1 sessions work on Linux. Creation of this presentation will require 1 week of work and must be completed to allow feedback from Network Appliance (1 week suggested) and timely submission to Connectathon organizers (no date posted).
|Week 1||Week 2||Week 3||Week 4||Week 5||Week 6|
|Setup test client||Stage 1||Stage 1 (continued)|
|Create Presentation||Revise Presentation|
|OpenIB Developers Workshop||Connectathon|
|Week 7||Week 8||Week 9||Week 10||Week 11||Week 12|
|Stage 1 (continued)||Stage 2||Stage 3|
|Week 13||Week 14||Week 15||Week 16||Week 17||Week 18|
|Stage 3 (continued)|