python - Best prime numbers to choose for a double hashed hash table size? -


what best prime numbers choose double hashed hash table size?

side info

  • the hash table part of word analysis project, markov models, training bots model , generate text if else write (which takes lot of words, sentences, transcripts, books... bigger corpus, better)
  • i'm not familiar of math around prime numbers read on guys propose , try go there

what have in mind:

  • the prime numbers shouldn't far/close each other ----> don't have increase size frequently, hash table doesn't end half empty (less collisions, looking ideal ratio between load factor , hash table size)
  • optimal big corpus - i'm not sure how big prime numbers have choose should be, never did before...
  • i thought of implementing function (not hash function) that'd double size of hash table , closest prime number ------> has running time of o(n) because prime divisible ____( have check whether numbers number that's double size of current hash table size have remainder other zero, increment size one/go next odd number , test whole loop again)________ ------> can imagine that slow better approach have fixed set of prime numbers million (just illustration purposes) or , use these size changes

thanks, additional questions appreciated

choose high of twin prime numbers, i. e. when p , p - 2 primes, choose p double hash capacity, because hash_code % (size - 2) secondary step function double hashing algorithm, , modulo prime number more "robust" modulo composite number (if size - 2 composite).

for small sizes (somewhere around 1000 or so) choose all primes, except low ones of twin pairs, because twin pairs rare in beginning of natural numbers scale, size predictability.

add sizes of 5 , 11 (though low in twin primes) better address small table sizes.

exclude numbers used in multiplication hash functions, in java 31 used in string hash function, don't know python.

all above coded in java runnable, lot of pre-generated table sizes (trying keep 0.005 max difference between neighbouring table sizes):

https://github.com/openhft/koloboke/blob/0498951705b45be2e1528afd786c03308c36e5dc/lib/impl/src/main/java/net/openhft/koloboke/collect/impl/hash/dhashcapacities.java#l255-l272

p. s. personal belief double hashing never optimal open addressing flavor, because of modulo operations disproportionately expensive in modern cpus. consider using qhash.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -