Purpose & Procedure
I wanted to determine whether cpu processing time is a significant bottleneck in the RPC. If processing time is significant, then RPC performance may be improved by streamlining the existing code; i.e., removing unnecessary copies and calculations. However, if the processing time is not significant in comparison to other factors, then our efforts will be better spent elsewhere.
In order to determine how much cpu time was being spent, and where, I developed kprofile. kprofile is a kernel patch that allows me to put timers in the kernel, and access results through a proc interface. It has a low enough overhead that it shouldn't noticeably affect the code that it's measuring.
I measured time spent in each state of the RPC's finite state machine, and time spent on queus. I also had a timer that measured the lifetime of each RPC. Hypothetically, the time spent in the different state machine states plus the time spent on queues should add up to the total execution time. However, I encountered discrepancies as large as 30%. In order to get percentages that would add up to 100%, I used an 'adjusted lifetime', which is the sum of the adjusted average times in each state and queue.
The adjusted average time accounts for an individual RPC going through a state or queue more than once, or some RPCs not going through a particular state or queue. For instance, most write requests go through call_reserve twice, so the adjusted average is twice the average time that call_reserve took.
There are three percentages I calculated. As I mentioned before, all of them use the adjusted lifetime. The first percentage accounts for all states and queues. The second one subtracts the time spent on the backlog and the schedq from the total time, which leaves in network and processing time. The final percentage subtracts out time spent in call_transmit (sending the rpc) and on the pending queue (waiting for a server reply). This leaves only actual processing time, and shows where the most processing time is being spent.
Testing was performed from one of our Dual 933 MHz PIII's against a Network Appliance F85 file server.
All time measurements are in ns.
Here are the results for read and write RPC's.
adjusted lifetime: 809937
adjusted lifetime: 8281676
The second percentage is useful for comparing client processing vs network and server processing. For a read, 2% of the time is spent transmitting the request to the server, and 95% of the time is spent waiting for the reply. Only 3% of the time is spent processing the request on the client.
Similarly, for a write, 7% of the time is spent transmitting the request, and 90% of the time is spent waiting for the server's reply.
From these results, it appears that most of the time is spent either transmitting the request, or waiting for the server to do the work and send a reply. In order to determine if client processing time is ever significant, I exercised all of the rpc types by running the sun connectathon suite, and looked for the rpc type that had the least time in call_transmit and on the pending queue. This rpc turned out to be FSINFO, which is a request for information about the file system.
adjusted lifetime: 102796
Even here, only 4% of the time is spent doing processing on the client side.
Processing on the client side is a very small fraction of the time spent executing an RPC. There is little to be gained from optimizing the code to further reduce processing time.
This does not necessarily mean that the client side RPC code cannot be improved. While there is little to be done to reduce the time it takes for an individual RPC's execution time, it still remains to be seen whether overall throughput can be improved. I believe that the best way to determine this is to find out if the server is always being kept busy. If the server always has as much work than it can handle, increasing throughput from the client will only bog down the server. If it does not, then there may be improvements to be made on the client side.
I've been unable to determine why the individual times do not add up to the total time. There are some places that time is not measured, such as in rpc_execute, between finite state machine steps. However, I believe these times are fairly trivial. This matter may warrant further investigation, but I don't think it significantly affects my results or conclusions.