CITI: projects: nfs-perf: jnewsome: kprofile

Purpose & Procedure

I wanted to determine whether cpu processing time is a significant bottleneck in the RPC. If processing time is significant, then RPC performance may be improved by streamlining the existing code; i.e., removing unnecessary copies and calculations. However, if the processing time is not significant in comparison to other factors, then our efforts will be better spent elsewhere.

In order to determine how much cpu time was being spent, and where, I developed kprofile. kprofile is a kernel patch that allows me to put timers in the kernel, and access results through a proc interface. It has a low enough overhead that it shouldn't noticeably affect the code that it's measuring.

I measured time spent in each state of the RPC's finite state machine, and time spent on queus. I also had a timer that measured the lifetime of each RPC. Hypothetically, the time spent in the different state machine states plus the time spent on queues should add up to the total execution time. However, I encountered discrepancies as large as 30%. In order to get percentages that would add up to 100%, I used an 'adjusted lifetime', which is the sum of the adjusted average times in each state and queue.

The adjusted average time accounts for an individual RPC going through a state or queue more than once, or some RPCs not going through a particular state or queue. For instance, most write requests go through call_reserve twice, so the adjusted average is twice the average time that call_reserve took.

There are three percentages I calculated. As I mentioned before, all of them use the adjusted lifetime. The first percentage accounts for all states and queues. The second one subtracts the time spent on the backlog and the schedq from the total time, which leaves in network and processing time. The final percentage subtracts out time spent in call_transmit (sending the rpc) and on the pending queue (waiting for a server reply). This leaves only actual processing time, and shows where the most processing time is being spent.

Testing was performed from one of our Dual 933 MHz PIII's against a Network Appliance F85 file server.

All time measurements are in ns.

Results

Here are the results for read and write RPC's.

READ

Finite State Machine
	Count	% of adj Lifetime	% w/o schedq & backlog	% nor call_transmit and pending	Avg (ns)	Adjusted Avg (ns)	Low (ns)	High (ns)
call_allocate	16210	0%	0%	5%	456	456	244	106031
call_bind	16210	0%	0%	1%	105	105	78	15781
call_decode	16210	0%	0%	34%	2787	2787	471	169577
call_encode	16210	0%	0%	32%	2654	2654	980	142529
call_reconnect		0%	0%	0%		0
call_reserve	27131	0%	0%	22%	1098	1837	332	106762
call_reserve_result	27131	0%	0%	3%	166	277	84	118121
call_transmit	16210	1%	2%	N/A	9227	9227	4084	117031
Queues
backlog	10921	42%	N/A	N/A	509687	343386	6721	14280660
pending	16210	46%	95%	N/A	379604	379604	99068	1981564
schedq	26857	8%	N/A	N/A	42011	69604	867	75089244
Other
lifetime	16210	N/A	N/A	N/A	836195	836195	138785	75387996

adjusted lifetime: 809937
discrepancy: 26258 (3 %)

WRITE

Finite State Machine
	Count	% of adj Lifetime	% w/o schedq & backlog	% nor call_transmit and pending	Avg (ns)	Adjusted Avg (ns)	Low (ns)	High (ns)
call_allocate	48392	0%	0%	6%	380	380	243	25545
call_bind	48392	0%	0%	1%	108	108	78	18613
call_decode	48392	0%	0%	25%	1463	1463	471	36846
call_encode	48392	0%	0%	26%	1516	1516	962	57530
call_reconnect		0%	0%	0%		0
call_reserve	84144	0%	0%	35%	1177	2046	338	80692
call_reserve_result	84144	0%	0%	4%	135	234	83	27407
call_transmit	48392	0%	8%	N/A	43800	43800	4371	606383
Queues
backlog	35752	93%	N/A	N/A	10469450	7734827	8042	37075930
pending	48392	5%	90%	N/A	460852	460852	99482	19793949
schedq	78992	0%	N/A	N/A	22330	36450	807	59058937
Other
lifetime	48392	N/A	N/A	N/A	8293613	8293613	129143	59237867
otherq		0%	0%	0%		0

adjusted lifetime: 8281676
discrepancy: 11937 (0 %)

The second percentage is useful for comparing client processing vs network and server processing. For a read, 2% of the time is spent transmitting the request to the server, and 95% of the time is spent waiting for the reply. Only 3% of the time is spent processing the request on the client.

Similarly, for a write, 7% of the time is spent transmitting the request, and 90% of the time is spent waiting for the server's reply.

From these results, it appears that most of the time is spent either transmitting the request, or waiting for the server to do the work and send a reply. In order to determine if client processing time is ever significant, I exercised all of the rpc types by running the sun connectathon suite, and looked for the rpc type that had the least time in call_transmit and on the pending queue. This rpc turned out to be FSINFO, which is a request for information about the file system.

FSINFO

Finite State Machine
	Count	% of adj Lifetime	% w/o schedq & backlog	% nor call_transmit and pending	Avg (ns)	Adjusted Avg (ns)	Low (ns)	High (ns)
call_allocate	6009	0%	0%	10%	341	341	289	498
call_bind	6009	0%	0%	2%	93	93	88	97
call_decode	6009	1%	1%	33%	1039	1039	703	9993
call_encode	6009	0%	0%	31%	1000	1000	900	7752
call_reconnect		0%	0%	0%		0
call_reserve	6009	0%	0%	17%	551	551	500	674
call_reserve_result	6009	0%	0%	3%	103	103	84	124
call_transmit	6009	5%	5%	N/A	5881	5881	5191	18039
Queues
backlog		0%	N/A	N/A		0
pending	6009	91%	91%	N/A	93788	93788	83324	327996
schedq		0%	N/A	N/A		0
Other
lifetime	6009	N/A	N/A	N/A	114787	114787	103506	348849
otherq		0%	0%	0%		0

adjusted lifetime: 102796
discrepancy: 11991 (10 %)

Even here, only 4% of the time is spent doing processing on the client side.

Conclusions

Processing on the client side is a very small fraction of the time spent executing an RPC. There is little to be gained from optimizing the code to further reduce processing time.

This does not necessarily mean that the client side RPC code cannot be improved. While there is little to be done to reduce the time it takes for an individual RPC's execution time, it still remains to be seen whether overall throughput can be improved. I believe that the best way to determine this is to find out if the server is always being kept busy. If the server always has as much work than it can handle, increasing throughput from the client will only bog down the server. If it does not, then there may be improvements to be made on the client side.

I've been unable to determine why the individual times do not add up to the total time. There are some places that time is not measured, such as in rpc_execute, between finite state machine steps. However, I believe these times are fairly trivial. This matter may warrant further investigation, but I don't think it significantly affects my results or conclusions.