Before we begin, everyone should read up on hashtables and what open addressing / closed hashing is. The context is lines 111‥190 of Python’s Objects/dictobject.c as of today (so we get the line numbers straight).
(I’ve reworded this wlog entry a bit; I originally wrote it too late at
night for it to read coherent.) Basically, I’ve got an application where
I’d like to use a hashtable for a number of things – not as generic as
Python, and with focus on small footprint. I’d like to offer associative
arrays in a scripting language, where the keys are always arbitrary byte
strings excluding NUL. Also, I’d like to use the hashtable as backend for
indexed arrays, where the keys are uint32_t and the usual use
case is sequential. Finally, I’m using it for several internal tables,
such as a list of keywords, one of builtins, one of special variables,
etc. which is a reason for me to not use a self-balancing binary tree
as data structure (reading further below might suggest that, but getting
a sorted list of hashtable keys is not the focus, though not unimportant).
My questions on this are:
① Why is the shift on perturb done after its first use? In my experiments (using 32-bit width everywhere), for the pathological case of an 8-element (i = 3) table with three entries 0, 0x40000000 and 0x800000000, the “second round” yields 1 for all three, so it cannot have to do with the upper bits. My lookup looks like:
mask = 2ⁱ - 1; j = perturb = hash(key); goto find_first_slot; find_next_slot: j = (j << 2) + j + perturb + 1; perturb >>= PERTURB_SHIFT; /* FALLTHROUGH */ find_first_slot: entry = table[j & mask]; if (!match(entry)) goto find_next_empty_slot;
This means that my first check is always the bare hash (so “only do it if needed” is no reason) and, since I’m using gotos, I could just move the perturb >>= PERTURB_SHIFT; line before the line recalculating the next j to use. This seems to make more sense, even in the face of Python. (I actually looked at the Python file’s comments again today because I thought to use a different resolution, but they have a good rationale for using the multiplication by 5.)
② Why can’t we just use i as the PERTURB_SHIFT? Sure, this changes a shift-right by a constant, which can possibly be encoded as immediate value in assembly (unless you’re on a pre-80186, which can only do SHR AX,1 and SHR AX,CL but not SHR AX,4, but that’s outside of mksh’s scope) into a right-shift by a variable, but i is already known, and I think the behaviour is better (it wouldn’t eat any bits; assume the same 8-entry hashtable and pathologic keys 0, 8 and 16). Again: who do I think I am to go against the wisdom of the Python people, who seem to have shed more thought on this than everyone else I saw, asked, read about (including Spammipedia). That’s why I’m asking here. On that reference: I don’t support spammers or people nagging for donations or premium accounts, like Xing and Groundspeak/Geocaching.COM, at all. In fact, I urge others to do the same, so it really hurts them; it may be their business model, but not if they spam me. Besides, OpenCaching.DE exists.
Another thing is: to avoid CVE-2011-4815, I’m randomising the hash used, with one “seed” value per hashtable, changed before a resize operation. I originally thought to seed it with nonzero, but then I have to rehash on hashtable resize, so I’ll be XORing the final hash value instead (thanks ciruZ for the idea). I’m thinking of omitting that for indexed arrays, as an attacker almost certainly cannot determine the keys there. (To directly use the indexed array keys, which are already uint32_t, as hashes makes using i from ② even more important.) The hash I’m using is a modified Jenkins one-at-a-time called NZAAT: it’s my new generic standard nōn-cryptographic hash, and the changes are thus: while adding a byte, another increment of the hash is done (so NUL counts), and the finaliser got prefixed with the shift-left-add+shift-right-xor sequence of the adder (but not adding any value or the +1), to get best avalanche for all bytes. I actually compiled several versions of Hash.cs on a Windows® VM at work to analyse the original one-at-a-time and all of my modifications; these turned out to be the simplest ones (I originally had added 0x100 instead of 1, but the effect was the same, and +1 is usually cheaper on most CPUs).
Also, to avoid people being able to get to the seed, a user will always get only a sorted list of hashtable keys (numeric for indexed arrays, ASCIIbetically otherwise; see also my thoughts on JSON from the previous wlog entry). What algorithm do I use? For strings, comparisons are much more expensive, so I’d like to keep them low. Memory use is also a factor; allocating one large(r) block is better than many small ones due to the pool allocator overhead and due to portability to ancient Unicēs (which is another reason for me to use a hashtable which is a small struct plus an array of pointers, and then pass the list of keys as array of string pointers, instead of a tree). For both reasons, I’m thinking a relatively simple MergeSort: I need to allocate the result array anyway, so I can just get two and free the one that isn’t the end result, and it’s AFAICT the cheapest on comparisons other than Tree Sort (which nobody really seems to use, and which would effect to using a balanced binary tree again). Since keys are unique, stability and duplicate handling is never an issue. I’d like to use only one algorithm and one data structure, not a combination, as compactness is a design goal.
Please drop your thoughts on Freenode, e.g. by /msg MemoServ send mirabilos your text here or per eMail to the domains debian, freewrt or mirbsd, which are organisations, with the localpart tg. Or just contact me as usual, if you’re already acquainted. Or lookup 0xE99007E0. Thanks in advance! (Especially, Python Developers’ thoughts are welcome.)