projects techreports press lab location staff
citi top.2 top.3
citi mid.3
bot.1 bot.2 bot.3
star

projects: nfs-perf: jnewsome: kprofile

Purpose & Procedure

I wanted to determine whether cpu processing time is a significant bottleneck in the RPC. If processing time is significant, then RPC performance may be improved by streamlining the existing code; i.e., removing unnecessary copies and calculations. However, if the processing time is not significant in comparison to other factors, then our efforts will be better spent elsewhere.

In order to determine how much cpu time was being spent, and where, I developed kprofile. kprofile is a kernel patch that allows me to put timers in the kernel, and access results through a proc interface. It has a low enough overhead that it shouldn't noticeably affect the code that it's measuring.

I measured time spent in each state of the RPC's finite state machine, and time spent on queus. I also had a timer that measured the lifetime of each RPC. Hypothetically, the time spent in the different state machine states plus the time spent on queues should add up to the total execution time. However, I encountered discrepancies as large as 30%. In order to get percentages that would add up to 100%, I used an 'adjusted lifetime', which is the sum of the adjusted average times in each state and queue.

The adjusted average time accounts for an individual RPC going through a state or queue more than once, or some RPCs not going through a particular state or queue. For instance, most write requests go through call_reserve twice, so the adjusted average is twice the average time that call_reserve took.

There are three percentages I calculated. As I mentioned before, all of them use the adjusted lifetime. The first percentage accounts for all states and queues. The second one subtracts the time spent on the backlog and the schedq from the total time, which leaves in network and processing time. The final percentage subtracts out time spent in call_transmit (sending the rpc) and on the pending queue (waiting for a server reply). This leaves only actual processing time, and shows where the most processing time is being spent.

Testing was performed from one of our Dual 933 MHz PIII's against a Network Appliance F85 file server.

All time measurements are in ns.

Results

Here are the results for read and write RPC's.

READ

  Count % of adj Lifetime % w/o schedq & backlog % nor call_transmit and pending Avg (ns) Adjusted Avg (ns) Low (ns) High (ns)
Finite State Machine
call_allocate 16210 0% 0% 5% 456 456 244 106031
call_bind 16210 0% 0% 1% 105 105 78 15781
call_decode 16210 0% 0% 34% 2787 2787 471 169577
call_encode 16210 0% 0% 32% 2654 2654 980 142529
call_reconnect 0% 0% 0% 0
call_reserve 27131 0% 0% 22% 1098 1837 332 106762
call_reserve_result 27131 0% 0% 3% 166 277 84 118121
call_transmit 16210 1% 2% N/A 9227 9227 4084 117031
Queues
backlog 10921 42% N/A N/A 509687 343386 6721 14280660
pending 16210 46% 95% N/A 379604 379604 99068 1981564
schedq 26857 8% N/A N/A 42011 69604 867 75089244
Other
lifetime 16210 N/A N/A N/A 836195 836195 138785 75387996

adjusted lifetime: 809937
discrepancy: 26258 (3 %)

WRITE

  Count % of adj Lifetime % w/o schedq & backlog % nor call_transmit and pending Avg (ns) Adjusted Avg (ns) Low (ns) High (ns)
Finite State Machine
call_allocate 48392 0% 0% 6% 380 380 243 25545
call_bind 48392 0% 0% 1% 108 108 78 18613
call_decode 48392 0% 0% 25% 1463 1463 471 36846
call_encode 48392 0% 0% 26% 1516 1516 962 57530
call_reconnect 0% 0% 0% 0
call_reserve 84144 0% 0% 35% 1177 2046 338 80692
call_reserve_result 84144 0% 0% 4% 135 234 83 27407
call_transmit 48392 0% 8% N/A 43800 43800 4371 606383
Queues
backlog 35752 93% N/A N/A 10469450 7734827 8042 37075930
pending 48392 5% 90% N/A 460852 460852 99482 19793949
schedq 78992 0% N/A N/A 22330 36450 807 59058937
Other
lifetime 48392 N/A N/A N/A 8293613 8293613 129143 59237867
otherq 0% 0% 0% 0

adjusted lifetime: 8281676
discrepancy: 11937 (0 %)

The second percentage is useful for comparing client processing vs network and server processing. For a read, 2% of the time is spent transmitting the request to the server, and 95% of the time is spent waiting for the reply. Only 3% of the time is spent processing the request on the client.

Similarly, for a write, 7% of the time is spent transmitting the request, and 90% of the time is spent waiting for the server's reply.

From these results, it appears that most of the time is spent either transmitting the request, or waiting for the server to do the work and send a reply. In order to determine if client processing time is ever significant, I exercised all of the rpc types by running the sun connectathon suite, and looked for the rpc type that had the least time in call_transmit and on the pending queue. This rpc turned out to be FSINFO, which is a request for information about the file system.

FSINFO

  Count % of adj Lifetime % w/o schedq & backlog % nor call_transmit and pending Avg (ns) Adjusted Avg (ns) Low (ns) High (ns)
Finite State Machine
call_allocate 6009 0% 0% 10% 341 341 289 498
call_bind 6009 0% 0% 2% 93 93 88 97
call_decode 6009 1% 1% 33% 1039 1039 703 9993
call_encode 6009 0% 0% 31% 1000 1000 900 7752
call_reconnect 0% 0% 0% 0
call_reserve 6009 0% 0% 17% 551 551 500 674
call_reserve_result 6009 0% 0% 3% 103 103 84 124
call_transmit 6009 5% 5% N/A 5881 5881 5191 18039
Queues
backlog 0% N/A N/A 0
pending 6009 91% 91% N/A 93788 93788 83324 327996
schedq 0% N/A N/A 0
Other
lifetime 6009 N/A N/A N/A 114787 114787 103506 348849
otherq 0% 0% 0% 0

adjusted lifetime: 102796
discrepancy: 11991 (10 %)

Even here, only 4% of the time is spent doing processing on the client side.

Conclusions

Processing on the client side is a very small fraction of the time spent executing an RPC. There is little to be gained from optimizing the code to further reduce processing time.

This does not necessarily mean that the client side RPC code cannot be improved. While there is little to be done to reduce the time it takes for an individual RPC's execution time, it still remains to be seen whether overall throughput can be improved. I believe that the best way to determine this is to find out if the server is always being kept busy. If the server always has as much work than it can handle, increasing throughput from the client will only bog down the server. If it does not, then there may be improvements to be made on the client side.

I've been unable to determine why the individual times do not add up to the total time. There are some places that time is not measured, such as in rpc_execute, between finite state machine steps. However, I believe these times are fairly trivial. This matter may warrant further investigation, but I don't think it significantly affects my results or conclusions.

blank.space
b.star projects | techreports | press | lab | location | staff Email address
or call +1 734 763 2929
Copyright © 1996-2013
The Regents of the University of Michigan
bottom.line
citi