Linux Kernel Hash Table Behavior:
analysis and improvements
Chuck Lever, Netscape Communications Corp.
chuckl@netscape.com
$Id: hashemail.html,v 1.3 1999/11/12 20:12:54 cel Exp $
Abstract


Hash tables are used in the Linux kernel for storing many
highusage data objects, such as pages, buffers, inodes, and
others.
The author found a significant performance boost with
careful analysis and tuning of these data structures.

This document is Copyright © 1999
Netscape Communications Corp.,
all rights reserved.
Trademarked material referenced in this document is copyright
by its respective owner.

Discourse
Date: Tue, 6 Apr 1999 22:11:22 +0100 (BST)
From: Stephen C. Tweedie
To: Chuck Lever
Cc: Andrea Arcangeli , Stephen C. Tweedie ,
linuxkernel@vger.rutgers.edu, linuxmm@kvack.org
Subject: Re: [patch] arcavm2.2.5
Hi,
On Tue, 6 Apr 1999 16:47:23 0400 (EDT), Chuck Lever
said:
>> but I was just assuming that the offset is always pagealigned.
>> I could write a simulation to check the hash function...
> i didn't realize that the "offset" argument would always be pagealigned.
> but still, why does it help to add the unshifted "offset"? doesn't seem
> like there's any new information in that.
It is always aligned for the page cache (hence mixing in the lower bits
to the hash function shouldn't change anything), but the swap cache uses
the lower bits extensively, concentrating the swap cache into just a few
hash buckets unless we make this change.
Stephen
Date: Tue, 6 Apr 1999 23:04:58 +0200 (CEST)
From: Andrea Arcangeli
To: Chuck Lever
Cc: Stephen C. Tweedie , linuxkernel@vger.rutgers.edu,
linuxmm@kvack.org
Subject: Re: [patch] arcavm2.2.5
On Tue, 6 Apr 1999, Chuck Lever wrote:
>but still, why does it help to add the unshifted "offset"? doesn't seem
>like there's any new information in that.
The unshifted offset is useful for swap cache pages. When the page is a
swap cache page the page>offset really is the swap entry that tell us
where the page is been swapped out. And in such case page>offset is not
page aligned.
Andrea Arcangeli
Date: Fri, 9 Apr 1999 11:44:03 +0100 (BST)
From: Stephen C. Tweedie
To: Chuck Lever
Cc: Andrea Arcangeli , linuxkernel@vger.rutgers.edu,
linuxmm@kvack.org
Subject: Re: [PFC]: hash instrumentation
Hi,
On Thu, 8 Apr 1999 13:22:25 0400 (EDT), Chuck Lever
said:
> i'm discovering that a 13 bit hash mitigates the spikey size distribution
> in the page hash *better* than the +offset change. although i've been
> able to push the system into swap, i still haven't seen any degenerate
> hash behavior that's as bad as the buffer cache's hash function.
Pushing it into swap is not sufficient: to trigger the swap cache you
need to be actively using data which has previously been swapped out,
which requires dirtying the pages, swapping them out and then touching
them again for read access. In that special case, when the
newlyswappedbackin page still matches the copy on disk, the swap
cache mechanism keeps both the inmemory and ondisk copies linked to
each other.
> i'll have more as i test this further.
Don't forget that you can use the "mem=*m" option at boot time to run
in a reduced memory configuration. That might make it a bit easier
for you to trigger this stuff. :)
Additional Work
Date: Tue, 6 Apr 1999 22:31:52 +0100 (BST)
From: Stephen C. Tweedie
To: Andrea Arcangeli
Cc: Doug Ledford , Chuck Lever ,
linuxkernel@vger.rutgers.edu, linuxmm@kvack.org,
Stephen Tweedie
Subject: Re: [patch] arcavm2.2.5
Hi,
On Tue, 6 Apr 1999 15:04:36 +0200 (CEST), Andrea Arcangeli
said:
> I also think that I'll implement the cookie thing suggested by Mark since
> I am too much courious to see how much it will help (even if my mind is
> driven by RBtrees ;).
Trees are bad for this sort of thing in general: they can be as fast as
hashing for lookup, but are much slower for insert and delete
operations. Insert and delete for the page cache _must_ be fast.
Expanding a hash table does not cost enough to offset the performance
problems.
Stephen
Date: Wed, 7 Apr 1999 00:27:21 +0200 (CEST)
From: Andrea Arcangeli
To: Stephen C. Tweedie
Cc: Doug Ledford , Chuck Lever ,
linuxkernel@vger.rutgers.edu, linuxmm@kvack.org,
Linus Torvalds , Mark Hemment
Subject: Re: [patch] arcavm2.2.5
On Tue, 6 Apr 1999, Stephen C. Tweedie wrote:
>Trees are bad for this sort of thing in general: they can be as fast as
>hashing for lookup, but are much slower for insert and delete
>operations. Insert and delete for the page cache _must_ be fast.
It's not so obvious to me. I sure agree that an O(n) insertion/deletion is
far too slow but a O(log(n)) for everything could be rasonable to me. And
trees don't worry about unluky hash behavior.
And for the record my plan would be to put the top of the tree directly in
the inode struct. And then to queue all cached pages that belongs to such
inode in the perinode cachetree. So there would be no need to always
check also the inode field in find_inode() and reading small files would
be _drammatically_ fast and immediate even if there are 5giga of cache
allocated. I think this behavior will become interesting.
Maybe you are 100% right in telling me that RBtrees will lose, but I
would like to try it out someday...
Andrea Arcangeli
Date: Wed, 7 Apr 1999 13:27:45 +0100 (BST)
From: Stephen C. Tweedie
To: Andrea Arcangeli
Cc: Stephen C. Tweedie , Doug Ledford ,
Chuck Lever , linuxkernel@vger.rutgers.edu, linuxmm@kvack.org,
Linus Torvalds , Mark Hemment
Subject: Re: [patch] arcavm2.2.5
Hi,
On Wed, 7 Apr 1999 00:27:21 +0200 (CEST), Andrea Arcangeli
said:
> It's not so obvious to me. I sure agree that an O(n) insertion/deletion is
> far too slow but a O(log(n)) for everything could be rasonable to me. And
> trees don't worry about unluky hash behavior.
Trees are O(log n) for insert/delete, with a high constant of
proportionality (ie. there can be quite a lot of work to be done even
for small log n). Trees also occupy more memory per node. Hashes are
O(1) for insert and delete, and are a _fast_ O(1). The page cache needs
fast insert and delete.
Stephen
Analysis
i've attached a diff to this message that contains the hash
instrumentation i'm using to analyze and improve hash functions. i'd like
to instrument a couple of other hashes (dentry, inode, fib), and create a
new directory in /proc called /proc/cache, with files for each of the
important caches. these may not contain the histogram information, since
it's just for testing, but they might contain important statistics about
hash performance, for example.
here's some sample output (from /var/log/messages) to show what
information you might be able to gather from the hash instrumentation:
for the page cache:
Apr 7 15:08:50 pillbox kernel: Total pages in hash: 4079 total buckets:
2048
Apr 7 15:08:50 pillbox kernel: hash table histogram:
Apr 7 15:08:50 pillbox kernel: size: 0 buckets: 379 pages: 0
Apr 7 15:08:50 pillbox kernel: size: 1 buckets: 462 pages: 462
Apr 7 15:08:50 pillbox kernel: size: 2 buckets: 507 pages: 1014
Apr 7 15:08:50 pillbox kernel: size: 1 buckets: 462 pages: 462
Apr 7 15:08:50 pillbox kernel: size: 2 buckets: 507 pages: 1014
Apr 7 15:08:50 pillbox kernel: size: 3 buckets: 385 pages: 1155
Apr 7 15:08:50 pillbox kernel: size: 4 buckets: 178 pages: 712
Apr 7 15:08:50 pillbox kernel: size: 5 buckets: 94 pages: 470
Apr 7 15:08:50 pillbox kernel: size: 6 buckets: 36 pages: 216
Apr 7 15:08:50 pillbox kernel: size: 7 buckets: 6 pages: 42
Apr 7 15:08:50 pillbox kernel: size: 8 buckets: 1 pages: 8
Apr 7 15:08:50 pillbox kernel: size: 9 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: size: 10 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: size: 11 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: size: 12 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: size: 13 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: size: 14 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: size: 15 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: size:>15 buckets: 0 pages: 0
Apr 7 15:08:50 pillbox kernel: Max bucket size: 8
Apr 7 15:08:50 pillbox kernel: Total lookups: 58933 pages found: 52044
(hit rate: 88%)
Apr 7 15:08:50 pillbox kernel: find_page loops: 53950 loops/lookup:
915/1000
notice that the bucket sizes are nicely distributed, and that most of the
buckets are used. loops/lookup is really the number that matters here,
since it provides a good indication of how much average effort is required
to find a given page. loops/lookup is the average number of times
__find_page (or find_buffer, below) had to loop during a lookup operation.
the histogram table is organized so that each row represents buckets of a
given size. for instance, the row for empty buckets is "size:0" and the
number of buckets that were empty is 379. the number of buckets that
contained only 2 pages is 507, and there were 1014 pages stored in buckets
that only contained 2 pages. this allows us to clearly see the bucket
size distribution, and to easily tell when there are lots of large or
empty buckets (both bad situations).
here's the buffer cache:
Apr 7 15:08:50 pillbox kernel: total buffers in hash table: 2149
Apr 7 15:08:50 pillbox kernel: total buckets: 32768 largest: 59
Apr 7 15:08:50 pillbox kernel: times find_buffer looped: 53900
loops/lookup: 874/1000
Apr 7 15:08:50 pillbox kernel: hash table histogram:
Apr 7 15:08:50 pillbox kernel: size: 0 buckets: 31941 buffers: 0
Apr 7 15:08:50 pillbox kernel: size: 1 buckets: 648 buffers: 648
Apr 7 15:08:50 pillbox kernel: size: 2 buckets: 50 buffers: 100
Apr 7 15:08:50 pillbox kernel: size: 3 buckets: 26 buffers: 78
Apr 7 15:08:50 pillbox kernel: size: 4 buckets: 8 buffers: 32
Apr 7 15:08:50 pillbox kernel: size: 5 buckets: 16 buffers: 80
Apr 7 15:08:50 pillbox kernel: size: 6 buckets: 17 buffers: 102
Apr 7 15:08:50 pillbox kernel: size: 7 buckets: 18 buffers: 126
Apr 7 15:08:50 pillbox kernel: size: 8 buckets: 12 buffers: 96
Apr 7 15:08:50 pillbox kernel: size: 9 buckets: 3 buffers: 27
Apr 7 15:08:50 pillbox kernel: size: 10 buckets: 5 buffers: 50
Apr 7 15:08:50 pillbox kernel: size: 11 buckets: 4 buffers: 44
Apr 7 15:08:50 pillbox kernel: size: 12 buckets: 1 buffers: 12
Apr 7 15:08:50 pillbox kernel: size: 13 buckets: 0 buffers: 0
Apr 7 15:08:50 pillbox kernel: size: 14 buckets: 2 buffers: 28
Apr 7 15:08:50 pillbox kernel: size: 15 buckets: 0 buffers: 0
Apr 7 15:08:50 pillbox kernel: size:>15 buckets: 17 buffers: 726
the buffer hash function doesn't work well: bucket sizes are not nicely
distributed, there is a huge number of empty buckets, and there are more
than a few buckets that contain lots of buffers. this is just after the
system was booted, but after some file system intensive activity, we have
this:
page hash:
Apr 7 15:36:20 pillbox kernel: Total pages in hash: 21854 total buckets:
2048
Apr 7 15:36:20 pillbox kernel: hash table histogram:
Apr 7 15:36:20 pillbox kernel: size: 0 buckets: 0 pages: 0
Apr 7 15:36:20 pillbox kernel: size: 1 buckets: 0 pages: 0
Apr 7 15:36:20 pillbox kernel: size: 2 buckets: 3 pages: 6
Apr 7 15:36:20 pillbox kernel: size: 3 buckets: 4 pages: 12
Apr 7 15:36:20 pillbox kernel: size: 4 buckets: 20 pages: 80
Apr 7 15:36:20 pillbox kernel: size: 5 buckets: 60 pages: 300
Apr 7 15:36:20 pillbox kernel: size: 6 buckets: 76 pages: 456
Apr 7 15:36:20 pillbox kernel: size: 7 buckets: 131 pages: 917
Apr 7 15:36:20 pillbox kernel: size: 8 buckets: 198 pages: 1584
Apr 7 15:36:20 pillbox kernel: size: 9 buckets: 247 pages: 2223
Apr 7 15:36:20 pillbox kernel: size: 10 buckets: 261 pages: 2610
Apr 7 15:36:20 pillbox kernel: size: 11 buckets: 284 pages: 3124
Apr 7 15:36:20 pillbox kernel: size: 12 buckets: 224 pages: 2688
Apr 7 15:36:20 pillbox kernel: size: 13 buckets: 185 pages: 2405
Apr 7 15:36:20 pillbox kernel: size: 14 buckets: 148 pages: 2072
Apr 7 15:36:20 pillbox kernel: size: 15 buckets: 82 pages: 1230
Apr 7 15:36:20 pillbox kernel: size:>15 buckets: 125 pages: 1427
Apr 7 15:36:20 pillbox kernel: Max bucket size: 22
Apr 7 15:36:20 pillbox kernel: Total lookups: 5129003 pages found:
4590496 (hit rate: 89%)
Apr 7 15:36:20 pillbox kernel: find_page loops: 28371900 loops/lookup:
507/1000
the loops/lookup is pretty low (less than one is good) and the hash
distribution is still nicely shaped. it's clear that the hash table is
too small, though. for buffers:
Apr 7 15:36:20 pillbox kernel: total buffers in hash table: 1767
Apr 7 15:36:20 pillbox kernel: total buckets: 32768 largest: 25
Apr 7 15:36:20 pillbox kernel: times find_buffer looped: 2314427
loops/lookup: 1078/1000
Apr 7 15:36:20 pillbox kernel: hash table histogram:
Apr 7 15:36:20 pillbox kernel: size: 0 buckets: 31866 buffers: 0
Apr 7 15:36:20 pillbox kernel: size: 1 buckets: 687 buffers: 687
Apr 7 15:36:20 pillbox kernel: size: 2 buckets: 109 buffers: 218
Apr 7 15:36:20 pillbox kernel: size: 3 buckets: 19 buffers: 57
Apr 7 15:36:20 pillbox kernel: size: 4 buckets: 11 buffers: 44
Apr 7 15:36:20 pillbox kernel: size: 5 buckets: 15 buffers: 75
Apr 7 15:36:20 pillbox kernel: size: 6 buckets: 9 buffers: 54
Apr 7 15:36:20 pillbox kernel: size: 7 buckets: 9 buffers: 63
Apr 7 15:36:20 pillbox kernel: size: 8 buckets: 4 buffers: 32
Apr 7 15:36:20 pillbox kernel: size: 9 buckets: 7 buffers: 63
Apr 7 15:36:20 pillbox kernel: size: 10 buckets: 4 buffers: 40
Apr 7 15:36:20 pillbox kernel: size: 11 buckets: 5 buffers: 55
Apr 7 15:36:20 pillbox kernel: size: 12 buckets: 4 buffers: 48
Apr 7 15:36:20 pillbox kernel: size: 13 buckets: 6 buffers: 78
Apr 7 15:36:20 pillbox kernel: size: 14 buckets: 1 buffers: 14
Apr 7 15:36:20 pillbox kernel: size: 15 buckets: 1 buffers: 15
Apr 7 15:36:20 pillbox kernel: size:>15 buckets: 11 buffers: 224
for buffers the story is quite different. there's still a huge number of
empty buckets, and the size distribution is practically flat.
loops/lookup is greater than one.
i've set up a 2.2.5 kernel with some test hash functions to see what kind
of improvement is available with hash tweaking. i tweaked the page hash
function by changing PAGE_HASH_BITS to 13 (haven't tried +offset yet), and
replaced the buffer hash function with one of my own devising.
these tests were run on a dual 200Mhz PPro with 128M RAM and 2 10M/s SCSI
drivers. i've seen similar results on my 233Mhz K6 and a PII laptop. i
will also test on a halfgig machine.
after some intensive memory and file activity (no swapping, though), the
page hash looks like this:
Apr 7 16:55:27 pillbox kernel: Total pages in hash: 9664 total buckets:
8192
Apr 7 16:55:27 pillbox kernel: hash table histogram:
Apr 7 16:55:27 pillbox kernel: size: 0 buckets: 2521 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 1 buckets: 2976 pages: 2976
Apr 7 16:55:27 pillbox kernel: size: 2 buckets: 1732 pages: 3464
Apr 7 16:55:27 pillbox kernel: size: 3 buckets: 691 pages: 2073
Apr 7 16:55:27 pillbox kernel: size: 4 buckets: 215 pages: 860
Apr 7 16:55:27 pillbox kernel: size: 5 buckets: 51 pages: 255
Apr 7 16:55:27 pillbox kernel: size: 6 buckets: 6 pages: 36
Apr 7 16:55:27 pillbox kernel: size: 7 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 8 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 9 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 10 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 11 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 12 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 13 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 14 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size: 15 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: size:>15 buckets: 0 pages: 0
Apr 7 16:55:27 pillbox kernel: Max bucket size: 6
Apr 7 16:55:27 pillbox kernel: Total lookups: 4568709 pages found:
4083164 (hit rate: 89%)
Apr 7 16:55:27 pillbox kernel: find_page loops: 2322162 loops/lookup:
508/1000
as you can see, this is lots better than the 11 bit hash. in order to
test Doug's varying hash table patch, i'll need to test 11, 12, 13, 14, 15
bits, and so on. also with +offset. i tried a little of this last night
at home, and noticed that even numbered bits (12 and 14) don't work as
well as 11 and 13.
the buffer hash function i used is this:
#define _hashfn(dev,block) \
((((unsigned long)(block) * (40499UL * 65543UL)) >> 17) & bh_hash_mask)
this is the same number of instructions as the old function, except that
one of the instructions is IMUL. i haven't looked up how expensive this
might be. however, i believe a fixed expense is better than a higher
loops/lookup average, since IMUL with a constant multiplier doesn't incur
any memory overhead once it's in the instruction cache, whereas an extra
iteration of find_buffer might be another 2 memory references, either or
both of which might be cache misses.
and the size histogram produced is this:
Apr 7 16:55:27 pillbox kernel: total buffers in hash table: 13009
Apr 7 16:55:27 pillbox kernel: total buckets: 32768 largest: 7
Apr 7 16:55:27 pillbox kernel: times find_buffer looped: 1652176
loops/lookup: 933/1000
Apr 7 16:55:27 pillbox kernel: hash table histogram:
Apr 7 16:55:27 pillbox kernel: size: 0 buckets: 23539 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 1 buckets: 6571 buffers: 6571
Apr 7 16:55:27 pillbox kernel: size: 2 buckets: 1899 buffers: 3798
Apr 7 16:55:27 pillbox kernel: size: 3 buckets: 559 buffers: 1677
Apr 7 16:55:27 pillbox kernel: size: 4 buckets: 102 buffers: 408
Apr 7 16:55:27 pillbox kernel: size: 5 buckets: 39 buffers: 195
Apr 7 16:55:27 pillbox kernel: size: 6 buckets: 53 buffers: 318
Apr 7 16:55:27 pillbox kernel: size: 7 buckets: 6 buffers: 42
Apr 7 16:55:27 pillbox kernel: size: 8 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 9 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 10 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 11 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 12 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 13 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 14 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size: 15 buckets: 0 buffers: 0
Apr 7 16:55:27 pillbox kernel: size:>15 buckets: 0 buffers: 0
a significant improvement over the previous hash, in terms of bucket size
distribution. in fact this function is so much better, we can reduce the
size of the hash table by 2 or even 4 times and get equal throughput.
even so, notice that loops/lookup is still high. the cache hit rate is
81%, so most of those loops are for successful lookups. would it help to
move a bucket entry to the front of the list after a successful lookup ?
i tried this, and it doesn't appear to help.
i have a fullblown patch that replaces the inode, dentry, fib, buffer,
uid, and pid hashes with the simple function listed above. in general, the
system feels more responsive, and throughput appears the same or slightly
better than an unpatched system. when i've finished testing the page hash
stuff, i'll include that and post the full patch.
how it works:
/*
* Based on Knuth, "The Art of Computer Programming", Vol. 3
* Searching and Sorting, pp. 513549.
*
* Essentially, multiplication by a number that is relatively
* prime to the wordsize of the computer results in division
* by its reciprocal, modulus the machine' word size. We
* select a number that is a golden ratio with the machine's
* word size, as that has certain desirable properties for
* hashing, as discussed in Knuth.
*
* (HASH_MULTIPLIER / wordsize) ~= ((sqrt(5)  1) / 2)
*
* A better approximation to the golden ratio could be chosen than below,
* but this works sufficiently well.
*
* 40499 and 65543 are both prime, guaranteeing that HASH_MULTIPLIER
* is relatively prime to 2^32, and doesn't exhibit certain side effects
* that might occur if we used small primes. Note that HASH_MULTIPLIER
* itself doesn't need to be prime; it simply needs to be relatively
prime
* to the machine's word size. I haven't tested how well it works
* with 64bit machines, but evidence suggests it will do well.
*/
#define HASH_MULTIPLIER (40499UL * 65543UL)
/* this is a typical hash function using Knuth's idea */
#define shash_function(key, range) \
((((unsigned long)(key) * HASH_MULTIPLIER) >> 17) & ((range)  1))
Date: Wed, 14 Apr 1999 00:42:34 +0100 (BST)
From: Stephen C. Tweedie
To: Andrea Arcangeli
Cc: linuxkernel@vger.rutgers.edu, Zlatko Calusic ,
MOLNAR Ingo , Stephen C. Tweedie ,
Chuck Lever
Subject: Re: [PFC]: hash instrumentation
Hi,
On Fri, 9 Apr 1999 21:37:57 +0200 (CEST), Andrea Arcangeli
said:
> I am also a bit worried by shrink_dcache_memory(). It looks like to me
> that it's recalled too late. You should swapout heavily or delete files in
> order to shrink the dcache right now. And the shrink in my tree is more
> finegrined than the complete shrink done by the stock kernel, and since my
> tree is very stable (no peak of low memory) also during swap I think that
> shrink_dcache memory is __never__ recalled with normal operations that
> also involve swap. So I think I'll move shrink_dcache_memory() _before_
> swap_out() to see if it's been part of the bottleneck.
Good for testing, but terrible for highmemory machines which rely on
dcache to maintain performance.
As an interesting data point, I applied Chuck's last pair of buffer and
page cache hash updates (the new buffer hash function, and the larger
PAGE_HASH_BITS) to a 2.2.5 kernel to measure the impact on a parallel
build of the current glibc2.1 rpms on a 4way Xeon. Each of those two
improvements gained about 30 seconds of wallclock time on a complete
build, taking it down from 15 minutes to 14.
Adding 4 extra bits to the dcache hash was worth 2 full minutes; 12
minutes total build time.
This was absolutely obvious from the readprofile(8) traces, where
d_lookup() was massively dominating the system CPU time (I had made both
the page and buffer hash lookup functions into externs at this point so
that they would all be profiled separately, not as inlines). d_lookup
was using about 4 times as much CPU as the nearest other functions
(which iirc were the page fault COW handler and file read() actors, both
of which perform large copies).
Shrinking the dcaches excessively in this case will simply masaccre the
performance. Apparently glibc2.1 is a particularly heavy test of
stat() performance, but then many other common server workloads are too.
Interestingly, when performing a full kernel build on the same box, the
d_lookup() cost is so low that it is hardly even on the map.
Cheers,
Stephen
Date: Tue, 13 Apr 1999 23:45:51 0400 (EDT)
From: Chuck Lever
To: Stephen C. Tweedie
Cc: Andrea Arcangeli , linuxkernel@vger.rutgers.edu,
Zlatko Calusic ,
MOLNAR Ingo
Subject: Re: [PFC]: hash instrumentation
On Wed, 14 Apr 1999, Stephen C. Tweedie wrote:
> As an interesting data point, I applied Chuck's last pair of buffer and
> page cache hash updates (the new buffer hash function, and the larger
> PAGE_HASH_BITS) to a 2.2.5 kernel to measure the impact on a parallel
> build of the current glibc2.1 rpms on a 4way Xeon. Each of those two
> improvements gained about 30 seconds of wallclock time on a complete
> build, taking it down from 15 minutes to 14.
how many hash bits did you try? 13? you might consider trying even more,
say 15 or 16. benchmarking has shown that the page hash function is
stable for any bit size between 11 and 16 (i didn't try others), so
varying it, as Doug's patch does, won't degenerate the hash.
> Adding 4 extra bits to the dcache hash was worth 2 full minutes; 12
> minutes total build time.
>
> This was absolutely obvious from the readprofile(8) traces, where
> d_lookup() was massively dominating the system CPU time (I had made both
> the page and buffer hash lookup functions into externs at this point so
> that they would all be profiled separately, not as inlines). d_lookup
> was using about 4 times as much CPU as the nearest other functions
> (which iirc were the page fault COW handler and file read() actors, both
> of which perform large copies).
readprofile traces i've taken show mostly the same thing, although the
page fault handler is by an order of magnitude the highest CPU user
(find_vma() is still up there, too). increasing the efficiency of the
dcache hash will buy probably the biggest performance improvement of any
hash modifications. i've also found that changing the xor's in the
dcache hash function to addition operations improves the bucket size
distribution and buys a bit more improvement.
> Shrinking the dcaches excessively in this case will simply masaccre the
> performance.
actually, that's not strictly true. shrinking the dcache early will
improve the lookup efficiency of the hash, i've found almost by two times.
there's probably compromise here, so, if you're not _too_ aggressive about
shrinking it (but more aggressive than shrink_dcache**() is now), you may
find even more performance improvement.
my preliminary analysis of the inode hash table is that it is also too
small. find_inode() seems to be used mostly for unsuccessful searches
these days, and the hash table size is so small that the buckets contain
long long lists of inodes, making the unsuccessful searches very
expensive.
 Chuck Lever
Date: Wed, 14 Apr 1999 14:54:30 +0100 (BST)
From: Stephen C. Tweedie
To: Chuck Lever
Cc: Stephen C. Tweedie , Andrea Arcangeli ,
linuxkernel@vger.rutgers.edu, Zlatko Calusic ,
MOLNAR Ingo
Subject: Re: [PFC]: hash instrumentation
Hi,
On Tue, 13 Apr 1999 23:45:51 0400 (EDT), Chuck Lever
said:
> On Wed, 14 Apr 1999, Stephen C. Tweedie wrote:
>> As an interesting data point, I applied Chuck's last pair of buffer and
>> page cache hash updates (the new buffer hash function, and the larger
>> PAGE_HASH_BITS) to a 2.2.5 kernel to measure the impact on a parallel
>> build of the current glibc2.1 rpms on a 4way Xeon. Each of those two
>> improvements gained about 30 seconds of wallclock time on a complete
>> build, taking it down from 15 minutes to 14.
> how many hash bits did you try? 13? you might consider trying even more,
> say 15 or 16. benchmarking has shown that the page hash function is
> stable for any bit size between 11 and 16 (i didn't try others), so
> varying it, as Doug's patch does, won't degenerate the hash.
13, but that was quite enough to eliminate __find_page as a significant
CPU cost in this instance, as reported by readprofile.
>> Adding 4 extra bits to the dcache hash was worth 2 full minutes; 12
>> minutes total build time.
> readprofile traces i've taken show mostly the same thing, although the
> page fault handler is by an order of magnitude the highest CPU user
> (find_vma() is still up there, too).
Yes; it does seem as if a glibc build is pretty bad at stressing the
dcache. However, I'd expect things like squid to show similar
dependencies on the name cache performance. Behind d_lookup,
do_anonymous_page and do_wp_page were the major costs (at about a
quarter to half of the cycles of d_lookup).
Hmm. This looks like another place where dropping the kernel lock
during the copy would be beneficial: we already hold the mm semaphore at
the time, so we're not vulnerable to too many races. I'll look at this.
> increasing the efficiency of the dcache hash will buy probably the
> biggest performance improvement of any hash modifications.
For some workloads, yes. Building the kernel spends very little time at
all in d_lookup().
>> Shrinking the dcaches excessively in this case will simply masaccre the
>> performance.
> actually, that's not strictly true. shrinking the dcache early will
> improve the lookup efficiency of the hash, i've found almost by two
> times.
Sure, but a glibc build is referencing a _lot_ of header files! My
concern is that the vmscan loop currently invokes a prune_dcache(0),
which is as aggressive as you can get. If we do that any more
frequently, getting a good balance of the dcache will be a lot harder.
> my preliminary analysis of the inode hash table is that it is also too
> small. find_inode() seems to be used mostly for unsuccessful searches
> these days, and the hash table size is so small that the buckets contain
> long long lists of inodes, making the unsuccessful searches very
> expensive.
FWIW, the profile with the new hash functions but small dcache started
like this (__find_page and find_buffer have been taken out of inline for
profiling here):
4893 d_lookup 23.5240
2741 do_anonymous_page 21.4141
1486 file_read_actor 18.5750
1475 do_wp_page 2.6721
1218 __get_free_pages 2.5805
1075 __find_page 15.8088
844 filemap_nopage 1.1405
684 brw_page 0.7403
600 lookup_dentry 1.2295
594 find_buffer 6.4565
567 page_fault 47.2500
564 handle_mm_fault 1.2261
523 __free_page 2.2543
439 free_pages 1.6140
420 do_con_write 0.2471
403 strlen_user 8.3958
391 zap_page_range 0.8806
382 do_page_fault 0.4799
and with the larger dcache,
2434 do_anonymous_page 19.0156
1451 do_wp_page 2.6286
1343 file_read_actor 16.7875
1328 __find_page 19.5294
1149 __get_free_pages 2.4343
1112 d_lookup 5.3462
847 find_buffer 9.2065
847 filemap_nopage 1.1446
628 brw_page 0.6797
580 page_fault 48.3333
577 lookup_dentry 1.1824
563 handle_mm_fault 1.2239
543 __free_page 2.3405
414 do_con_write 0.2435
397 free_pages 1.4596
377 system_call 6.7321
356 strlen_user 7.4167
354 zap_page_range 0.7973
319 do_page_fault 0.4008
Interestingly, do_anonymous_page, do_wp_page and file_read_actor are all
places where we can probably optimise things to drop the kernel lock.
That won't make them run faster but on SMP it will certainly let other
CPUs get more kernel work done. Film at 11.
Stephen
Date: Wed, 14 Apr 1999 12:26:13 0400 (EDT)
From: Chuck Lever
To: Stephen C. Tweedie
Cc: Andrea Arcangeli , linuxkernel@vger.rutgers.edu,
Zlatko Calusic ,
MOLNAR Ingo
Subject: Re: [PFC]: hash instrumentation
On Wed, 14 Apr 1999, Stephen C. Tweedie wrote:
> > how many hash bits did you try? 13? you might consider trying even more,
> > say 15 or 16. benchmarking has shown that the page hash function is
> > stable for any bit size between 11 and 16 (i didn't try others), so
> > varying it, as Doug's patch does, won't degenerate the hash.
>
> 13, but that was quite enough to eliminate __find_page as a significant
> CPU cost in this instance, as reported by readprofile.
factor in less elapsed time and better worstcase (successful
and unsuccessful) hash performance when using a larger table.
surprisingly, CPU cost is only part of the picture, it seems.
i also tested adding the raw offset in the page hash function, and across
the board i still see a measurable performance drop.
> Hmm. This looks like another place where dropping the kernel lock
> during the copy would be beneficial: we already hold the mm semaphore at
> the time, so we're not vulnerable to too many races. I'll look at this.
let me be the first to encourage you to do this! :)
> >> Shrinking the dcaches excessively in this case will simply masaccre the
> >> performance.
>
> > actually, that's not strictly true. shrinking the dcache early will
> > improve the lookup efficiency of the hash, i've found almost by two
> > times.
>
> Sure, but a glibc build is referencing a _lot_ of header files! My
> concern is that the vmscan loop currently invokes a prune_dcache(0),
> which is as aggressive as you can get. If we do that any more
> frequently, getting a good balance of the dcache will be a lot harder.
andrea's arca10 replaces prune_dcache(0) with something a little more
easygoing:
prune_dcache(dentry_stat.nr_unused / (priority+1));
however, having a good dentry replacement policy might be even better.
> FWIW, the profile with the new hash functions but small dcache started
> like this (__find_page and find_buffer have been taken out of inline for
> profiling here):
>
> 4893 d_lookup 23.5240
> 2741 do_anonymous_page 21.4141
> 1486 file_read_actor 18.5750
> 1475 do_wp_page 2.6721
> 1218 __get_free_pages 2.5805
> 1075 __find_page 15.8088
> 844 filemap_nopage 1.1405
> 684 brw_page 0.7403
> 600 lookup_dentry 1.2295
> 594 find_buffer 6.4565
> 567 page_fault 47.2500
> 564 handle_mm_fault 1.2261
> 523 __free_page 2.2543
> 439 free_pages 1.6140
> 420 do_con_write 0.2471
> 403 strlen_user 8.3958
> 391 zap_page_range 0.8806
> 382 do_page_fault 0.4799
>
> and with the larger dcache,
>
> 2434 do_anonymous_page 19.0156
> 1451 do_wp_page 2.6286
> 1343 file_read_actor 16.7875
> 1328 __find_page 19.5294
> 1149 __get_free_pages 2.4343
> 1112 d_lookup 5.3462
> 847 find_buffer 9.2065
> 847 filemap_nopage 1.1446
> 628 brw_page 0.6797
> 580 page_fault 48.3333
> 577 lookup_dentry 1.1824
> 563 handle_mm_fault 1.2239
> 543 __free_page 2.3405
> 414 do_con_write 0.2435
> 397 free_pages 1.4596
> 377 system_call 6.7321
> 356 strlen_user 7.4167
> 354 zap_page_range 0.7973
> 319 do_page_fault 0.4008
>
> Interestingly, do_anonymous_page, do_wp_page and file_read_actor are all
> places where we can probably optimise things to drop the kernel lock.
> That won't make them run faster but on SMP it will certainly let other
> CPUs get more kernel work done. Film at 11.
the normalized value for page_fault is still pretty high: +48. is there
anything that can be done about that, or is that not a concern?
also i tried benchmarking a stock 2.2.5 kernel with a 12 bit inode hash,
and found performance gains as significant as the other gains you found.
 Chuck Lever
Multiplicative v. Tabledriven hash functions
Date: Thu, 8 Apr 1999 12:51:27 0400 (EDT)
From: Chuck Lever
To: Andi Kleen
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On 8 Apr 1999, Andi Kleen wrote:
> > the buffer hash function i used is this:
> >
> > #define _hashfn(dev,block) \
> > ((((unsigned long)(block) * (40499UL * 65543UL)) >> 17) & bh_hash_mask)
> >
> > this is the same number of instructions as the old function, except that
> > one of the instructions is IMUL. i haven't looked up how expensive this
> > might be. however, i believe a fixed expense is better than a higher
> > loops/lookup average, since IMUL with a constant multiplier doesn't incur
> > any memory overhead once it's in the instruction cache, whereas an extra
> > iteration of find_buffer might be another 2 memory references, either or
> > both of which might be cache misses.
>
> This is correct for the x86 picture, but breaks e.g. on earlier sparcs
> or some m68ks where multiplication is very slow. Remember Linux is not x86
> only.
yes, i'm well aware of the other archs. :)
i think "breaks" is a little strong, though. perhaps it is slower on
older hardware. can you say *how* slow an integer multiplication by a
constant is on older SPARC CPUs or on, say, a 486, compared to memory
references on these machines? that's the real tradeoff here, and i think
it would be good to measure the difference under load before dismissing
the idea.
Date: Thu, 8 Apr 1999 13:22:25 0400 (EDT)
From: Chuck Lever
To: Andrea Arcangeli
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: [PFC]: hash instrumentation
> in arca8 I restored the o+offset in the hash function, feel free to remove
> it before running the bench if you think it will harm (the +offset is not
> needed until you'll be very low on memory and it looks like to me that
> your machine never gets low on memory ;).
i'm discovering that a 13 bit hash mitigates the spikey size distribution
in the page hash *better* than the +offset change. although i've been
able to push the system into swap, i still haven't seen any degenerate
hash behavior that's as bad as the buffer cache's hash function.
i'll have more as i test this further.
> Ah and also the first pagemapcachealigned patch was wrong (blame me). The
> second patch (after Eric pointed out to think about it twice) does the
> right thing (so you may consider to run the bench on it again... excuse
> me). So probably this is the reason that the pagemapaligned+irqaligned
> bench was weird...
i suspected that the strange performance behavior wasn't an implementation
problem, so i tried a plain 2.2.5 kernel with just the page struct
alignment mod in your latest patch. the numbers were about the same as
before. it would probably be a good idea if someone could actually take a
close look at what the alignment mod is doing to change hardware behavior,
'cause it's obviously not what we expect.
Date: Fri, 9 Apr 1999 09:30:35 +0200
From: Janos Farkas
To: Chuck Lever , Andi Kleen
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On 19990408 at 12:51:27, Chuck Lever wrote:
...
> > > #define _hashfn(dev,block) \
> > > ((((unsigned long)(block) * (40499UL * 65543UL)) >> 17) & bh_hash_mask)
> i think "breaks" is a little strong, though. perhaps it is slower on
> older hardware. can you say *how* slow an integer multiplication by a
> constant is on older SPARC CPUs or on, say, a 486, compared to memory
> references on these machines? that's the real tradeoff here, and i think
> it would be good to measure the difference under load before dismissing
> the idea.
Although a bit unsolicited, I have 68030 numbers at hand; this is quite
a bit old CPU, not the top of the line, bit IMHO the most affordable, in
fact, about half of the registered Linux/m68k owners have machines with
a '030 (well, me too :)
On this chip, a shift (independent of count) takes about 410 clock
cycles, depending where the shifted operand is; arithmetic operations
are similar (or a few cycles faster), but a multiplication takes 28
cycles (in register only) with 16 bit values, and 44 cycles for a 32bit
multiplication. I'm out of touch with the discussion, so I don't know
how often do you want to compute hashes, and even less who cares about
the '030; but I think it's quite common, and multiplication here takes
significantly more time.
It's quite common therefore to optimize constant multiplies to
bitshifting, but it depends on the value if this is better.
Janos
Date: Fri, 9 Apr 1999 09:38:39 +0200
From: Janos Farkas
To: Chuck Lever , Andi Kleen , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On 19990409 at 09:30:35, Janos Farkas wrote:
[on a 68030]
...
> are similar (or a few cycles faster), but a multiplication takes 28
> cycles (in register only) with 16 bit values, and 44 cycles for a 32bit
> multiplication.
...
Oh, forgot to tell, that this is only the worst case timing, the actual
bit patterns influence the times (just like with handmade
multiplication); so one needs to be careful what to multiply :)
Janos
Date: Sat, 10 Apr 1999 01:25:49 +0100 (BST)
From: Alan Cox
To: Chuck Lever
Subject: Re: more on hash functions
Another thing btw would be to use architecture specific hashes. The sun4c machines
top out at 32Mb for example.
Date: Sat, 10 Apr 1999 08:51:27 +0200
From: Janos Farkas
To: Chuck Lever
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On 19990409 at 19:06:03, Chuck Lever wrote:
> thanks for the information. the missing piece, though, is how expensive
> is a multiplication operation relative to a couple of memory references?
> that's the direct tradeoff when tuning these hash tables.
[About '030 again]
Sorry, missed it; its two cycles for every additional memory access, in
most circumstances. This is the difference even between an instruction
operating in registers and the same doing it from/to memory, and this is
how longer an instruction takes if it has more instruction words. It's
true even for 32bit accesses on the right system (ie. 32bit memory
bus  common).
[Just to shortly recap: a multiplication is taking 28/44 cycles at worst,
(16/32 bit), and highly depends on the bit patterns; very roughly about
how much 1 bit is set in the multiplier (not true, just to get the picture)]
Janos
Date: Sat, 10 Apr 1999 17:15:01 0500
From: Paul F. Dietz
To: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Chuck Lever wrote:
> agreed that multiplication should be avoided. however, in some cases
> nothing else works.
However, this is not one of those cases, if loads are
much cheaper than multiplication. A quick
and easy hash function would be to hash a 20 bit
value by using its two 10 bit words as indices into
two tables of randomly chosen values (in the
range 0..PAGE_HASH_SIZE1). Xor the two values together.
This class of hash functions (parameterized by
the table contents) is provably good, in the sense
that for any two distinct keys, a randomly chosen
hash function from this class will cause the keys
to collide with probability 1/PAGE_HASH_SIZE
(the class of hash functions is "universal";
see Cormen, Leiserson, and Rivest, "Introduction
to Algorithms", chapter 12.)
Moreover, the expected behavior of hashing with
chaining is O(n/m) time per operation for n values
stored in a table of size m (basically, pairwise
independence of the hash values is enough to
ensure this.) Again, this is regardless of
the actual keys; the randomness comes from the
tables.
Paul
Date: Mon, 12 Apr 1999 00:30:23 0400 (EDT)
From: Chuck Lever
To: Paul F. Dietz
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On Sat, 10 Apr 1999, Paul F. Dietz wrote:
> Chuck Lever wrote:
>
> > agreed that multiplication should be avoided. however, in some cases
> > nothing else works.
>
> However, this is not one of those cases, if loads are
> much cheaper than multiplication. A quick
> and easy hash function would be to hash a 20 bit
> value by using its two 10 bit words as indices into
> two tables of randomly chosen values (in the
> range 0..PAGE_HASH_SIZE1). Xor the two values together.
i hadn't considered trying a tabledriven hash function because Knuth
poopoo'd the idea very early in the section on hashing in Vol. 3 (i don't
remember why at the moment). so, i took your email as a challenge to try
a simple tabledriven hash implementation for the one cache that has been
irascibly untunable: the buffer hash. (the reason for the difficulty is
the regular way in which ext2 uses block offsets  it makes it very hard
to find an "easy" hash function with the desirable properties).
i implemented something close to what you suggested here: a hash function
from 24 bits into 12 using an exclusiveor of two random table lookups.
the reason for the increased size is that the buffer cache can contain
tens of thousands of objects, and to keep positive *and* negative lookups
quick, the hash chains need to be kept short (<15). to do this, you need
to have a large hash table  10 bits (1024 buckets) simply aren't enough.
in fact, 12 bits still aren't enough, but it works well enough for a
demonstration.
every hash function i've tried, including this one, left almost half the
buckets empty during a moderate file system load. it also compared with
other hash functions in terms of average find_buffer() iterations per
lookup request. the size distribution for this hash degenerated more
quickly than the distribution for the multiplication hash.
there are two places where this style of hash function falls down when
compared to the others.
1. random table size is huge. even for a 24bit to 12bit hash, the two
random tables are 2 bytes * 2 ^ 12 buckets (2 pages) each. the pair of
tables are as large as the L1 cache on some CPUs. this is not very useful
data to have chewing up CPU caches and memory bandwidth; not to mention
there are 4 useless bits in every entry (12 bits stored in a unsigned
short).
2. hash table size is inflexible. in essence, you have to maintain a
pair of random tables for each hash table size you want to use.
3. also hash tables can't be practically increased large enough for the
number of objects we want to hash. for the buffer hash, we'd like to hash
a 32bit block number to 14 or more bits. every doubling of hash table
size means a significant decrease in average hash chain length. as well,
since even good hash functions generally only hit about half the buckets,
we want the hash table larger still (this also why one should take any
mathematical claims about the goodness of a hash function with a shaker of
salt).
a 14 bit random number table would consume 8 pages (16 total for both). if
you want 15 bits or even 16 (the size of the buffer hash table in stock
2.2.5 kernels), you're talking about an additional 64 pages total of
memory (but at least there are no wasted bits for 16 bit random tables
stored in an array of short ints!).
would you want something as useless as *random* *data* cluttering the
physical memory of a 68030class machine? while memory accesses are
comparatively cheap on such hardware, physical RAM is usually at a
premium. thus, i would try a hash function, such as one that uses
multiplication, if it:
I. allows me to avoid the use of large random tables that would be hit
pretty hard if used by all the hash functions in the kernel
b. allows me to reduce the number of buckets in my hash tables by making
use of more buckets
i'm sure this is more than anyone cares to know... but i'm obsessive.
has anyone tried using a multiplicative hash on a 68030 Linux build to see
if it impacts performance significantly? i think its arguable whether 68K
Linux needs the scalability of large hash tables and a good multiplicative
hash, anyway. as alan pointed out, it is possible (although more
complicated) to use an architecturespecific hash function for these
hashes.
Date: Sun, 11 Apr 1999 22:52:39 0500
From: Paul F. Dietz
To: Chuck Lever , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Chuck Lever wrote:
[ comments on random table driven hash function ]
Memory use can be reduced by using more than two
tables. If we divide the 24 bit key into 4 6 bit
fields, the total table size is 512 bytes (assuming
each table entry is 2 bytes). This doubles the
number of loads, granted, and increases the shift/mask
work. Divided into 3 8 bit fields (that should be easy),
the memory use is 1.5K bytes.
The objection that you need more random tables
for different hash table sizes is not right.
Just mask off the irrelevant high order bits
of the hash value. For tables containing twobyte
random values, this supports hash tables of size
up to 2^16.
If the multiplicative method works better, it's
probably because the values being hashed have
nonrandomness, such as keys falling into arithmetic
progressions.
I'd be very careful about populating the entries
of the random tables  be sure the values chosen
really are random. If the low order bit is not
random, for example, you may lose half the hash
table.
I did not find a comment in Knuth vol. 3 (1st ed.)
dismissing this idea. Is it in the 2nd ed.? What did
he say? Universal classes of hash functions had not
been discovered at the time the 1st ed. was written.
Paul
Date: Mon, 12 Apr 1999 17:28:49 0500
From: Paul F. Dietz
To: Chuck Lever , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Chuck Lever wrote:
> actually, this was my first implementation choice. i found that the hash
> function didn't work very well with 3 8bit tables  it had poor
> bucket size distribution characteristics. how does one combine 3 8bit
> values appropriately to get, say, a 14bit table index?
[...]
> however, i think we're making different assumptions about what goes in the
> random table. i assumed they were random numbers from 0 to table_size,
> whereas some of your comments indicate you assume that the random table is
> filled with random numbers from 0 to 2^16.
Whoa! The size of the *random* tables is not related
to the size of the *hash* table. Why should it be?
The sizes of the random tables depend on the number of
bits in each of the key fragments.
As I said in the original message, the values stored
in the random tables are in the range 0..PAGE_HASH_SIZE1.
If you've been picking only smaller values, no wonder the
hash has been performing poorly!
You were earlier perfectly correct that this kind of scheme
has a large cache footprint. On a modern processor with
a fast multiply, it would not be competitive.
Paul
Date: Mon, 12 Apr 1999 15:27:03 0400 (EDT)
From: Chuck Lever
To: Paul F. Dietz
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On Sun, 11 Apr 1999, Paul F. Dietz wrote:
> Memory use can be reduced by using more than two
> tables. If we divide the 24 bit key into 4 6 bit
> fields, the total table size is 512 bytes (assuming
> each table entry is 2 bytes). This doubles the
> number of loads, granted, and increases the shift/mask
> work. Divided into 3 8 bit fields (that should be easy),
> the memory use is 1.5K bytes.
actually, this was my first implementation choice. i found that the hash
function didn't work very well with 3 8bit tables  it had poor
bucket size distribution characteristics. how does one combine 3 8bit
values appropriately to get, say, a 14bit table index?
> The objection that you need more random tables
> for different hash table sizes is not right.
> Just mask off the irrelevant high order bits
> of the hash value. For tables containing twobyte
> random values, this supports hash tables of size
> up to 2^16.
> I'd be very careful about populating the entries
> of the random tables  be sure the values chosen
> really are random. If the low order bit is not
> random, for example, you may lose half the hash
> table.
to generate the table, i used random(), seeded with the lower 16 bits of
tv_sec from gettimeofday().
initialize all entries in random_table[] to 1
for (i=0; i I did not find a comment in Knuth vol. 3 (1st ed.)
> dismissing this idea. Is it in the 2nd ed.? What did
> he say? Universal classes of hash functions had not
> been discovered at the time the 1st ed. was written.
i'm at work now, so i have my copy handy  yes it's the second edition.
i was mistaken: his first example *is* tabledriven, but the table is used
to map keys to a random index, given that there are only a small
number of known keys, which is a different problem.
he cites a number of papers, some as late as 1997, but i don't see a
specific reference to universal hash functions.
 Chuck Lever
Date: Mon, 12 Apr 1999 21:10:22 0400 (EDT)
From: Chuck Lever
To: Paul F. Dietz
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On Mon, 12 Apr 1999, Paul F. Dietz wrote:
> Whoa! The size of the *random* tables is not related
> to the size of the *hash* table. Why should it be?
> The sizes of the random tables depend on the number of
> bits in each of the key fragments.
right, i understood that part. the range of the *values* in the random
tables, and not the range of the random table's *index*, is dependent on
the hash table size.
> As I said in the original message, the values stored
> in the random tables are in the range 0..PAGE_HASH_SIZE1.
> If you've been picking only smaller values, no wonder the
> hash has been performing poorly!
well, i simply set the size of the hash table to be the same as the size
of the random tables. i think the only problem i caused myself was a
little inflexibility in hash table size, since the values in the
random tables were indeed random, and in the range 0...hash_table_size1,
as you (and rivest et al.) prescribed.
 Chuck Lever
Date: Tue, 13 Apr 1999 09:14:10 0400
From: Horst von Brand
To: Chuck Lever
Cc: Paul F. Dietz , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Chuck Lever said:
[...]
> actually, this was my first implementation choice. i found that the hash
> function didn't work very well with 3 8bit tables  it had poor
> bucket size distribution characteristics. how does one combine 3 8bit
> values appropriately to get, say, a 14bit table index?
a ^ (b << 3) ^ (c << 6)
Date: Wed, 14 Apr 1999 10:50:21 0700
From: Iain McClatchie
To: cel@monkey.org, dietz@interaccess.com
Subject: Re: more on hash functions
Horst> a ^ (b << 3) ^ (c << 6)
Chuck> that's what i tried... it doesn't work as well as xoring these
Chuck> values together *without* the shifts. i was wondering if there
Chuck> was a better way to do it.
Point 1

Chuck, have you tried it this way?
Your random table should be 256 entries of 32 bits each. You could
have 16bit entries if you know that the hash table will never have
more than 64K entries (which I suppose is likely...)
So then:
short random_table[256];
hash_rec_t *chain_head;
tablebits = tablesize  1;
hash_fn = random_table[ key & 0xff ] ^
random_table[ (key >> 8) & 0xff ] ^
random_table[ (key >> 16) & 0xff ];
chain_head = hash_table[ hash_fn & tablebits ];
Point 2

Paul, you seem to suggest using seperate random tables for each
of the key load lookups. I can't see why that would be necessary.
Why is it that one table will not do? Are byte permutations of
the keys really frequent?
Point 3

This random_table actually seems pretty reasonable to me: the
random_table is 512 bytes, which is a small fraction of P6's 16KB L1
data cache. The four loads take four cycles' load slots, but P6
probably
uses a radix4 or radix8 integer multiplier, which would take 9 or 5
cycles latency anyway. (The FPU has the fully pipelined multiplier on
the P6. I think Alphas actually have fully pipelined integer
multipliers.)
I appreciate this discussion you're having... I'm going to tune up the
hash package in my BDD code now.
Iain
iain@mcclatchie.com
P.S. Feel free to quote anything here to anyone. I sent a private
message because I can't figure out how to stick an InReplyTo: field
in the header of a new Netscape 4.51 mail message, so I can't get my
stuff to thread on linuxkernel.
Date: Wed, 14 Apr 1999 15:19:19 0700
From: Iain McClatchie
To: Chuck Lever
Subject: Re: more on hash functions
Chuck> my random tables were constructed such that, for a 256 entry
Chuck> random table, the table contained unique random values from 0
Chuck> to 255. making these 16 bit numbers would only add some hash
Chuck> tablesize flexibility, but i don't believe it would help the
Chuck> hash function generate better randomness.
I suspect using 16bit numbers will change your results completely.
If you could collect a stream of (unhashed) keys to the hash table,
I could run an experiment here to show you that 16bit random numbers
are fundamentally doing a different (and better) operation than 8bit
numbers. Other than that, I can't recreate your setup here. And of
course, it does you no good for me to create my own stream of madeup
keys, since the whole point in a hash function is to try to remove
any systematic bias in a particular set of keys.
But I still think that the goal that you are pursuing is a good one,
and I'd still like to help. So I can try to persuade you to try an
experiment with a wider random table, that can be compared to an
experiment with a narrower random table.
One way of thinking about your hash function is to determine how many
bits of the key affect each bit of the hash table index. One nice
thing about a multiplicative hash is that each index bit is affected
by every key bit. With your 8bit random table, each bit of the key
affects at most 8 bits of the index. By using shifts to combine the
random table lookups together, you will inevitably have at least 6
bits of a 14bit index which are affected by only 8 bits of the key.
Iain
iain@mcclatchie.com
Date: Wed, 14 Apr 1999 19:54:05 0700
From: Iain McClatchie
To: Chuck Lever
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Iain> This random_table actually seems pretty reasonable to me: the
Iain> random_table is 512 bytes, which is a small fraction of P6's 16KB
Iain> L1 data cache. The four loads take four cycles' load slots, but
Iain> P6 probably uses a radix4 or radix8 integer multiplier, which
Iain> would take 9 or 5 cycles latency anyway.
I take it back. I tried putting random table hash functions into my
BDD code, and got slower results than the existing multiplicative
hashes. It appears that the bucket distribution didn't improve much
at all, and as you point out, computing the hash function is slow.
The loads don't appear to be the problem: most of the work is in
unpacking the bytes, which turns out to be more work than the multiply.
Bottom line: random table hash function is slower than multiplicative
hash function, bucket distribution is no better, so its a loss.
Sorry for the distraction.
Iain
Date: Wed, 14 Apr 1999 21:06:53 0500
From: Paul F. Dietz
To: Chuck Lever , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Chuck Lever wrote:
> my random tables were constructed such that, for a 256 entry random table,
> the table contained unique random values from 0 to 255. making these 16
> bit numbers would only add some hash tablesize flexibility, but i don't
> believe it would help the hash function generate better randomness.
If the hash table itself has more than 256 entries, then it
would certainly produce better randomness. The small random
table entries would mean you are only hashing to the low
end of the hash table.
In Linux x86, the size of the hash table is 2048 (1 << PAGE_HASH_BITS),
so your hash function, as implemented, would be very bad.
Paul
Date: Wed, 14 Apr 1999 22:15:36 0500
From: Paul F. Dietz
To: Chuck Lever , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Chuck Lever wrote:
> #define FIRSTBYTE(x) (0xff & (x))
> #define SECNDBYTE(x) (0xff & ((x)>>8))
> #define THIRDBYTE(x) (0xff & ((x)>>16))
>
> #define hashfn(key) ( (rand1[FIRSTBYTE(key)] << 4) ^ \
> (rand2[SECNDBYTE(key)] << 2) ^ \
> (rand1[THIRDBYTE(key)]) )
>
> to generate a 12bit hash table index. this way i don't have to mask off
> extra bits as the final step, and i can use daintily sized random tables.
This is also clearly bad. Suppose you're hashing keys that
have a common third byte? The low two bits of your hash
values will be constant.
You really have to make every bit of the result depend on
every bit in the key.
Paul
Date: Wed, 14 Apr 1999 17:02:50 0400 (EDT)
From: Chuck Lever
To: Iain McClatchie
Cc: linuxkernel@vger.rutgers.edu, dietz@interaccess.com
Subject: Re: more on hash functions
On Wed, 14 Apr 1999, Iain McClatchie wrote:
> Point 1
> 
> Chuck, have you tried it this way?
>
> Your random table should be 256 entries of 32 bits each. You could
> have 16bit entries if you know that the hash table will never have
> more than 64K entries (which I suppose is likely...)
>
> So then:
>
> short random_table[256];
> hash_rec_t *chain_head;
>
> tablebits = tablesize  1;
> hash_fn = random_table[ key & 0xff ] ^
> random_table[ (key >> 8) & 0xff ] ^
> random_table[ (key >> 16) & 0xff ];
> chain_head = hash_table[ hash_fn & tablebits ];
this is close to the way that i originally implemented it, although i
tried with two tables (first and third byte indexed one table, middle byte
indexed the second table), and with three tables.
my random tables were constructed such that, for a 256 entry random table,
the table contained unique random values from 0 to 255. making these 16
bit numbers would only add some hash tablesize flexibility, but i don't
believe it would help the hash function generate better randomness.
> Point 2
> 
> Paul, you seem to suggest using seperate random tables for each
> of the key load lookups. I can't see why that would be necessary.
> Why is it that one table will not do? Are byte permutations of
> the keys really frequent?
i didn't find it to be real critical for the real data i was hashing.
> Point 3
> 
> This random_table actually seems pretty reasonable to me: the
> random_table is 512 bytes, which is a small fraction of P6's 16KB L1
> data cache. The four loads take four cycles' load slots, but P6
> probably
> uses a radix4 or radix8 integer multiplier, which would take 9 or 5
> cycles latency anyway. (The FPU has the fully pipelined multiplier on
> the P6. I think Alphas actually have fully pipelined integer
> multipliers.)
the multiplicative hash function i created uses 3 instructions in less
than 16 bytes, consumes a fixed number of CPU cycles, and uses no memory
bandwidth (except instruction fetches) to compute the hash value. my goal
was to reduce memory references, cache pollution and evictions, i.e.
making the hash function more memorybandwidth friendly (since this is of
prime concern on big iron), and making a reasonable bucket size
distribution to lower the average lookup time.
on a CPU of older architecture, i don't think the buffer hash will be a
significant performance bottleneck, so adding a single integer
multiplication in that path will probably not be an issue. if it is, use
the old hash function.
 Chuck Lever
Date: Wed, 14 Apr 1999 20:59:20 0500
From: Paul F. Dietz
To: Iain McClatchie , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Iain McClatchie wrote:
> Point 2
> 
> Paul, you seem to suggest using separate random tables for each
> of the key load lookups. I can't see why that would be necessary.
> Why is it that one table will not do? Are byte permutations of
> the keys really frequent?
I don't know if they are frequent, so I can't say that
a single table would necessarily be good.
If you do reuse tables, it would probably be a good idea to
add rather than xor the keys, so that table[i]+table[i] is
less likely to equal table[j]+table[j] mod the hash table
size. You might also consider this:
hash = (table[byte0] + table[(byte1+r1) & 0xff] +
table[(byte2+r2) & 0xff]) mod PAGE_HASH_SIZE
where r1 and r2 are two preselected random numbers in
the range 0..255. This way, for any two distinct keys,
it becomes unlikely that they access the same set of table
locations (again, randomizing the hash function rather
than depending on the randomness of the key distribution.)
Paul
Date: Wed, 14 Apr 1999 23:35:33 0400 (EDT)
From: Chuck Lever
To: Paul F. Dietz
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On Wed, 14 Apr 1999, Paul F. Dietz wrote:
> > my random tables were constructed such that, for a 256 entry random table,
> > the table contained unique random values from 0 to 255. making these 16
> > bit numbers would only add some hash tablesize flexibility, but i don't
> > believe it would help the hash function generate better randomness.
>
> If the hash table itself has more than 256 entries, then it
> would certainly produce better randomness. The small random
> table entries would mean you are only hashing to the low
> end of the hash table.
i was using the random tables to mix up the bits in each byte, like so:
unsigned char rand1[256] = { ... };
unsigned char rand2[256] = { ... };
#define FIRSTBYTE(x) (0xff & (x))
#define SECNDBYTE(x) (0xff & ((x)>>8))
#define THIRDBYTE(x) (0xff & ((x)>>16))
#define hashfn(key) ( (rand1[FIRSTBYTE(key)] << 4) ^ \
(rand2[SECNDBYTE(key)] << 2) ^ \
(rand1[THIRDBYTE(key)]) )
to generate a 12bit hash table index. this way i don't have to mask off
extra bits as the final step, and i can use daintily sized random tables.
> In Linux x86, the size of the hash table is 2048 (1 << PAGE_HASH_BITS),
> so your hash function, as implemented, would be very bad.
the page cache hash function, btw, already works very well, and i don't
believe replacing it with any other kind of function will help. the only
hash function we should be concerned with improving is the buffer cache
hash function: see fs/buffer.c, ll. 42530.
i reread Knuth over dinner last night. unfortunately, he is discussing
hashes where you store data elements right in the hash table, and can then
easily do linear probing [i heard that, you in the second row]. i can't
think of a good way to apply his double hashing algorithm (algorithm D, p.
528) or Brent's variation (p. 532) to help improve the hash function
without using a multiplicative hash.
 Chuck Lever
Date: Thu, 15 Apr 1999 14:47:39 +0100 (BST)
From: Stephen C. Tweedie
To: Chuck Lever
Cc: Paul F. Dietz , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Hi,
On Wed, 14 Apr 1999 23:35:33 0400 (EDT), Chuck Lever
said:
> unsigned char rand1[256] = { ... };
> unsigned char rand2[256] = { ... };
> #define FIRSTBYTE(x) (0xff & (x))
> #define SECNDBYTE(x) (0xff & ((x)>>8))
> #define THIRDBYTE(x) (0xff & ((x)>>16))
> #define hashfn(key) ( (rand1[FIRSTBYTE(key)] << 4) ^ \
> (rand2[SECNDBYTE(key)] << 2) ^ \
> (rand1[THIRDBYTE(key)]) )
> to generate a 12bit hash table index. this way i don't have to mask off
> extra bits as the final step, and i can use daintily sized random tables.
This is a bad way to go about things. You really do want all of the
input data to be significant in all bits of the final hash function.
The way this is usually achieved is to use full 32bit wide random
tables, and to xor them together with full precision all the way
through the hash computation. That gives you a final hash value where
every result bit is a nearlyrandom function of every input bit, and
you can truncate the result to any precision you need. You still only
need 256entry tables, of course.
unsigned int random[256];
static inline unsigned int hash(char key[3])
{
return random[key[0]] ^ random[key[1]] ^ random[key[2]];
}
If you want the final hash to be a fully random function of every bit
in the input data, you neeed to use separate 32bitwide 256entry
random tables for each byte of the input. You'd have to profile it to
see if that really mattered in this case: you don't want to increase
the cache profile of the random table any more than necessary.
Cheers,
Stephen.
Date: Thu, 15 Apr 1999 08:33:51 0400
From: Horst von Brand
To: Paul F. Dietz
Cc: Iain McClatchie , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
"Paul F. Dietz" said:
> Iain McClatchie wrote:
> > Point 2
> > 
> > Paul, you seem to suggest using separate random tables for each
> > of the key load lookups. I can't see why that would be necessary.
> > Why is it that one table will not do? Are byte permutations of
> > the keys really frequent?
> I don't know if they are frequent, so I can't say that
> a single table would necessarily be good.
If you worry about that, shift (or roll) the table entries you get by a few
bits each time.
> If you do reuse tables, it would probably be a good idea to
> add rather than xor the keys, so that table[i]+table[i] is
> less likely to equal table[j]+table[j] mod the hash table
> size. You might also consider this:
>
> hash = (table[byte0] + table[(byte1+r1) & 0xff] +
> table[(byte2+r2) & 0xff]) mod PAGE_HASH_SIZE
hash = (table[byte0] ^ (table[byte1] >> 3) ^ (table[byte2] >> 6)
& ~(PAGE_HASH_SIZE  1)
This gives you in essence bigger tables at probably less cost

Dr. Horst H. von Brand mailto:vonbrand@inf.utfsm.cl
Date: Thu, 15 Apr 1999 18:41:21 +0100 (BST)
From: Alan Cox
To: Chuck Lever
Cc: dietz@interaccess.com, linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
> in theory that's true. however, i also tried a pair of 10 (or 12) bit
> random tables xored together without shifting, and the result wasn't much
> better. i think you can get away with this if your input data allows it.
Tables that big start to cost you in cache misses what you saved in
multiplies
Date: Thu, 15 Apr 1999 16:51:58 +0200
From: Peter Steiner
To: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
>You really have to make every bit of the result depend on
>every bit in the key.
What about:
unsigned int rand1[256];
#define HASH1(x) ( ((x)<<8)^rand1[(x)>>24] )
#define hashfn(key) ( { int hashx=HASH1((key)<<8);\
hashx=HASH1(hashx);\
HASH1(hashx);\
} )
And to initialize the table:
>
void genrand(void) {
unsigned int i,j,k;
for(i=0;i<256;i++){
k=i<<16;
for(j=0;j<8;j++) {
if (k<0x00800000)
k<<=1;
else
k=((k<<1)&0xFFFFFF)^0x14A445;
}
k+=i<<24;
rand1[i]=k;
}
}
>
The key may be up to 24 bits. The result looks like this:
movl %edx,%eax
sall $8,%eax
sall $16,%edx
sarl $24,%eax
xorl rand1(,%eax,4),%edx
movl %edx,%eax
sall $8,%edx
sarl $24,%eax
xorl rand1(,%eax,4),%edx
movl %edx,%eax
sall $8,%edx
sarl $24,%eax
xorl rand1(,%eax,4),%edx
movl %edx,%eax
Is that still fast enough?
Peter
Date: Thu, 15 Apr 1999 21:36:11 +0200
From: Peter Steiner
To: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
>What about:
>#define HASH1(x) ( ((x)<<8)^rand1[(x)>>24] )
>#define hashfn(key) ( { int hashx=HASH1((key)<<8);\
> hashx=HASH1(hashx);\
> HASH1(hashx);\
> } )
Arrrgl... Sorry, this was wrong... I hope it's correct this time. It
should be a 24bit CRC generator.
#define HASH1(x) ( ((x)<<8)^rand1[(x)>>16] )
#define hashfn(key) ( { unsigned int hashx=HASH1(key & 0xFFFFFF);\
hashx=HASH1(hashx);\
HASH1(hashx);\
} )
Peter
Date: Thu, 15 Apr 1999 11:57:34 0400 (EDT)
From: Chuck Lever
To: Paul F. Dietz
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
On Wed, 14 Apr 1999, Paul F. Dietz wrote:
> > #define FIRSTBYTE(x) (0xff & (x))
> > #define SECNDBYTE(x) (0xff & ((x)>>8))
> > #define THIRDBYTE(x) (0xff & ((x)>>16))
> >
> > #define hashfn(key) ( (rand1[FIRSTBYTE(key)] << 4) ^ \
> > (rand2[SECNDBYTE(key)] << 2) ^ \
> > (rand1[THIRDBYTE(key)]) )
> >
> > to generate a 12bit hash table index. this way i don't have to mask off
> > extra bits as the final step, and i can use daintily sized random tables.
>
> This is also clearly bad. Suppose you're hashing keys that
> have a common third byte? The low two bits of your hash
> values will be constant.
>
> You really have to make every bit of the result depend on
> every bit in the key.
in theory that's true. however, i also tried a pair of 10 (or 12) bit
random tables xored together without shifting, and the result wasn't much
better. i think you can get away with this if your input data allows it.
but again, using small random tables is to be desired. can you think of a
way to make a random tabledriven hash function work with an absolute
minimum amount of memory references?
 Chuck Lever
Date: Thu, 15 Apr 1999 17:49:23 0400 (EDT)
From: Chuck Lever
To: linuxkernel@vger.rutgers.edu
Subject: random tabledriven hash benchmarked
ok, i retried the random tabledriven hash function in the buffer cache.
here's my hash function, derived from the good discussion on the list:
#define BYTE1(x) (0xff & (x))
#define BYTE2(x) (0xff & (((x) >> 8) + 67))
#define BYTE3(x) (0xff & (((x) >> 16) + 131))
#define BYTE4(x) (0xff & (((x) >> 24) + 197))
#define _hashfn(dev,block) \
( random_table[BYTE1(block)] + \
random_table[BYTE2(block)] + \
random_table[BYTE3(block)] + \
random_table[BYTE4(block)] ) & bh_hash_mask
random_table is an array of 256 unsigned shorts, all random, ranging
between 0 and 65535.
the random tabledriven hash function features:
1. a single random table that takes up only 512 bytes
2. the ability to address a hash table up to 65536 buckets flexibly
(just vary the bits in the bh_hash_mask mask)
3. mapping all 32 bits of input into the hash table index
4. no multiplication or division
here's the multiplicative hash, for comparison:
/* this is 40499 * 65543  both are prime; the result is ~0.68*(2^32) */
#define MULTIPLIER 2654425957UL
#define _hashfn(dev,block) \
((((unsigned long) (block) * MULTIPLIER) >> 11) & bh_hash_mask)
i benchmarked this against a kernel that used my multiplicative hash on an
AMD K6 at 233Mhz with 64M and a 9G Seagate IDE drive. the results below
are 5run average throughput values in "scripts per hour", with standard
deviation listed afterwards. the offered load varies from 8 concurrent
scripts to 32 concurrent scripts.
8 scripts 16 scripts 24 scripts 32 scripts
mult: 1088.5 s=4.38 1051.8 s=6.59 1019.7 s=6.79 986.7 s=8.56
rand: 1084.7 s=1.16 1039.7 s=14.99 1004.8 s=6.86 987.2 s=5.47
both runs used a 16384 bucket hash table, and averaged around 0.1
find_buffer() iterations per lookup, during a hit rate of about 78%.
the hash table bucket size distributions looked about the same for both
hash functions.
the way i read the results, using the tabledriven algorithm above isn't a
win over multiplicative hashing.
i believe that the tabledriven hash function itself is expensive to
calculate on any architecture. it requires 25 instructions in 92 bytes on
ia32, not to mention all the memory references and register spillage
required to calculate a result. and hopefully the compiler has
rescheduled the instructions to avoid address generation interlock;
otherwise we can expect a few pipeline stalls during the computation.
 Chuck Lever
Date: Thu, 15 Apr 1999 15:01:54 0700
From: Iain McClatchie
To: Paul F. Dietz
Cc: linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
I got a few suggestions about how to use multiple lookups with a
single table. All the suggestions make the hash function itself
slower, and attempt to fix an issue  hash distribution  that
doesn't appear to be a problem. I thought I should explain why
the table lookup function is slow.
A multiplication has a scheduling latency of either 5 or 9 cycles on a
P6. Four memory accesses take four cycles on that same P6. So the core
operations for the two hash function are actually very similar in delay,
and the table lookup appears to have a slight edge. The difference is
in the overhead.
A multiplicative hash, at minimum, requires the loading of a constant,
a multiplication, and a shift. Egcs actually transforms some constant
multiplications into a sequence of shifts and adds which may have
shorter latency, but essentially, the shift (and nothing else) goes in
series with the multiplication and as a result the hash function has
very little latency overhead.
A table lookup hash spends quite a lot of time unpacking the bytes
from the key, and furthurmore uses a load slot to unpack each byte.
This makes for 8 load slots, which take 1 cycle each. Even if
fully parallelized with unpacking, we end up with a fair bit of
latency. Worse yet, egcs runs out of registers and ends up shifting
the key value in place on the stack twice, which gobbles two load and
two store slots.
Bottom line: CPUs really suck at bitshuffling and even byteshuffling.
If there is some clever way to code the byte unpacking in the table
lookup hash function, perhaps using the x86's trick register file,
it might end up faster than the multiplicative hash.
Iain
Date: Thu, 15 Apr 1999 21:47:52 0500
From: Paul F. Dietz
To: Alan Cox , linuxkernel@vger.rutgers.edu
Subject: Re: more on hash functions
Alan Cox wrote:
> Tables that big start to cost you in cache misses what you saved in
> multiplies
In all fairness, the table driven approach was proposed
for processors with slow multiplies, and these would
probably also have less of a CPU clock/main memory
latency mismatch, making cache pollution perhaps less
important.
Using the tables on typical desktop processor wouldn't
make much sense, I think.
Paul
This document was written as part of the Linux Scalability Project.
For more information, see
our home page.
If you have comments or suggestions, email
linuxscalability@umich.edu