Hash Tables

3

I have a closed hash table with open addressing and linear probing for the following dictionary of words in a text;

The input operation works like this:

1) The program asks for the first line of the text (with a limit of 100 characters)    at this point I'm breaking the text, and adding the words in the hash table;    then ask for the 2nd line ...    will continue asking the lines until you find a line called $ STOP

2) After $STOP , the program asks for words (strings) to be searched in the hash, the result of the search will be the number of the incidence line (s) word in text;

//EXEMPLO:
$START
o mundo era bonito demais
o mundo jah foi muito belo 
$STOP
mundo
bonito
//retornará:
mundo: 1 2
bonito: 1

Considering that the number of keys can vary from 1 to n (10000 for example) distinct words, how should I manage the size of my table? Should I put a table vector [10001]?

Will not you spend a lot of memory?

How to deal with this hash table problem closed with open addressing and linear polling "?

    
asked by anonymous 27.10.2014 / 16:59

1 answer

5

A very important thing to keep in mind when using hash tables, especially with open addressing, is the load factor : the ratio between the number of distinct keys in the table and the maximum table capacity . If the load factor is too low you will be wasting memory with a hash table too sparse. On the other hand, if the load factor is very high, you become very prone to collisions in your hash function and clusters formations in your table (long segments of occupied indexes , which means also long linear probes ...). For example, imagine what happens to a table of capacity 10001 with 10000 keys used, like the one you proposed in the question: it will have 10,000 consecutive indexes used and only one free index left. When trying to insert 10001 o element, it is almost certain that there is a collision of the hash function and on average you will have to poll 5000 5000 until you reach the empty bucket!

Because of all this, the ideal is that your load factor is neither too big nor too small. The wikipedia says that if you are using linear addressing, it is good to avoid load factors above 70% or 80% (but not I found the source for that specific number). Of course, this all depends on the hash function you are using and the text you use as input. I suggest you run tests with different capabilities and see the results.

Otherwise, if you do not know a priori the ability of the table you will need a possibility is to resize the table dynamically. Every time the load factor goes from a pre-set limit, you mallocate a new table with double 1 capacity and transfer all keys from the old table to the new table.

What matters here is always to multiply the capacity by a constant greater than 1 (to grow capacities exponentially) instead of increasing by a fixed number of elements (to grow capacity linearly). This causes the asymptotically amortized cost of resizing to be constant. The bigger the scaling factor, the less time you will spend with resizing but in compensation you will get tables spending more memory.

    
27.10.2014 / 17:50