Efficient Set Membership Testing with Hashtable?

  • Thread starter CRGreathouse
  • Start date
In summary, a Bloom filter is a data structure designed to determine if a given item is a member of a set with high probability. It uses a hash table to store the data, and a binary search to find the item.
  • #1
CRGreathouse
Science Advisor
Homework Helper
2,845
0
I'm looking for a fast data structure for determining set membership. Suppose I have 10,000 numbers ranging from 1 to 10 billion or so and I want a quick test to see if a given number n is 'probably' one of those 10,000 values. A hashtable seems similar to what I want, but as I have no value to associate with these numbers it doesn't seem quite right -- and I'm pretty sure I've seen a data structure designed for just this problem.

In particular, I'd like the structure, once initialized (it won't change thereafter), to return "true" if a given member is in the set with high probability, and "false" otherwise. 99.9% would quite suffice, and even lower probabilities may be acceptable.
 
Technology news on Phys.org
  • #2
Wait, maybe I just need to sort the list and use binary search. Thoughts?
 
  • #3
If you have a single permanent list at the start and don't have to add new numbers to it that would seem to make sense.
Depending on how much memory you have and the distribution of the numbers you might do better to have a hash as a prefilter.

You could hash based on the sum of the digits which would lead to a table saying wether any numbers in the llist have this sum, or you could have an array of say 16K entries which correspond to the most/least significant word of the number, and where each entry stores just a bool of wether any numbers with this starting word are in the list.

You would then do a search into the main list to see if there was an exact match, Knuth has whole chapters on how to do this efficently!
 
  • #4
What language are you using? C++ for example provides set data structures. Have you tried whatever your language provides? How did it perform?
 
  • #5
nmtim said:
What language are you using? C++ for example provides set data structures. Have you tried whatever your language provides? How did it perform?

At the moment I'm using C++, though without any optimizations -- I'm just trying to find good algorithms for this problem. The latest version of my program (still very much in development), in broad strokes:

  • Populate the list. At the moment the list is about 40 MB, with several million members.
  • Sort the list -- since the elements beyond the first few are pseudorandom with fair dispersion, I'm using the last element as a pivot. (If this was a bottleneck I might take the median of the last three members instead.)
  • Discard duplicates, freeing space at the end of the array.
  • Run the main section (four nested loops):
    • Check if a given branch can be ruled out by congruence conditions. If not,
    • Use a binary search on the large, pregenerated list to see if the member is present.

So at the moment I'm using no special library features -- just standard arrays and my own functions to manipulate data.
 
  • #6
mgb_phys said:
If you have a single permanent list at the start and don't have to add new numbers to it that would seem to make sense.
Depending on how much memory you have and the distribution of the numbers you might do better to have a hash as a prefilter.

I have plenty of memory, and processing power is currently more of a bottleneck than memory. The table is far too large to fit into any cache, so memory access is presumably pretty bad but I can't do much about that.

The numbers in the table/hash are generated modulo a large constant, so they're pseudorandom and essentially uniformly distributed.

mgb_phys said:
You could hash based on the sum of the digits which would lead to a table saying wether any numbers in the llist have this sum, or you could have an array of say 16K entries which correspond to the most/least significant word of the number, and where each entry stores just a bool of wether any numbers with this starting word are in the list.

Hmm, that sounds like a good idea. Just right shift and lookup, that's it?
 
  • #7
Yes, you don't need to be too picky about the hash function - you just want something where you have a reasonable discrimination between present and not present,
and produces a reasonable number of ouputs, too many and you are going to use more memory to store the hash table than the data, too few and you will always get hits.

Basically you are looking for some parameter in which the numbers aren't randomly distributed - so you have good discrimination.
The high word approach would be good if the numbers are clustered.
 
  • #8
Ah, I found it! The structure I was looking for that I had read about before is the Bloom filter. Has anyone used one of these, or something similar? It's a probabilistic set membership test -- if something fails it's certainly not a member, which is the property I want. I can test positives in a number of different ways, but as it is I haven't found one yet, so for the purpose of optimization they don't exist.

mgb_phys said:
Basically you are looking for some parameter in which the numbers aren't randomly distributed - so you have good discrimination.
The high word approach would be good if the numbers are clustered.

Obviously I'll have to test it, but I think it should work. I do know that the least significant bits aren't good -- the numbers all have the same parity and there are some other trends in the next two bits.
 
  • #9
You should actually check if a Bloom filter really is an improvement over just storing a boolean in each bucket of an ordinary hash table. (equivalently, to using k=1 in your Bloom filter) The Bloom filter's strength is in space optimization, at the cost of extra memory accesses and hash calculations. The only way that it leads to a time improvement is if it allows you to use a better memory access pattern. (e.g. if you can fit the hash table into your L1 cache)

For comparison, note that just storing a 2 gigabyte array of Booleans requires just one memory access and no hash calculuations per lookup. You can avoid the bit twiddling if you unpack it to a 10 or even 80 gigabyte array. The reason this is not a time advantage is because you never get any cache hits, and it requires disk accesses if you don't have enough real RAM.
 
Last edited:
  • #10
update - as Hurkyl mentioned, just storing an array is the quickest. You stated that the range of the numbers was 0 to 10 billion. Assuming these numbers are exact (integers), and you only need to know if a number is in the list or not, then you just need 10 billion bits, or 1.25GB's of ram to do the job, 2GB of ram total should be enough (Using task manager I see 1.57GB of free physical ram on my system with 2GB, so if the rest of the stuff can fit in 250MB of ram, it's enough).

If you needed a sort:
For the sorting, a merge sort is the quickest (fewer operations, lots of cache hits), and it's very easy to skip (delete) duplicates with a merge sort algorithm. In your case, the elements are small, so just sort the elements. If the elements were larger, then it would be better to sort pointers and do a single copy/move. The overhead of a merge sort is that is uses twice as much memory (although its just twice as much pointer memory for larger elements).
 
Last edited:
  • #11
Have you considered a sparse array?
 
  • #12
Jeff Reid said:
update - as Hurkyl mentioned, just storing an array is the quickest. You stated that the range of the numbers was 0 to 10 billion. Assuming these numbers are exact (integers), and you only need to know if a number is in the list or not, then you just need 10 billion bits, or 1.25GB's of ram to do the job, 2GB of ram total should be enough (Using task manager I see 1.57GB of free physical ram on my system with 2GB, so if the rest of the stuff can fit in 250MB of ram, it's enough).

Yes, I'm finding that arrays are working pretty well. I'm trying to convert to a more efficient form right now.
 
  • #13
Hurkyl said:
You should actually check if a Bloom filter really is an improvement over just storing a boolean in each bucket of an ordinary hash table. (equivalently, to using k=1 in your Bloom filter) The Bloom filter's strength is in space optimization, at the cost of extra memory accesses and hash calculations. The only way that it leads to a time improvement is if it allows you to use a better memory access pattern. (e.g. if you can fit the hash table into your L1 cache)

Of course I don't know which way the problem will go, so if the space requirements get large I'll have to make some other structure, possibly a Bloom filter if I can write an efficient implementation. Of course depending on how I do it a pair of bit arrays using different moduli may function better. Edit: but that's not too different from a Bloom filter in the first place...
 
  • #14
jim mcnamara said:
Have you considered a sparse array?

Well, that's an interesting thing. I can use a bitarray as long as the modulus, and waste bits, or I can make an array of 32-bit ints as long as my z-limit and waste processor power. At the moment I'm comparing the methods, but it seems so far that the less-sophisticated bit array is faster, and my program crashes (!) before I hit any kind of memory limit.
 
  • #15
CRGreathouse said:
Of course I don't know which way the problem will go, so if the space requirements get large I'll have to make some other structure, possibly a Bloom filter if I can write an efficient implementation. Of course depending on how I do it a pair of bit arrays using different moduli may function better. Edit: but that's not too different from a Bloom filter in the first place...
Right -- the two implementations are essentially identical. The Bloom filter would just make both lookups in the same array, rather than in two different arrays... and while the Bloom filter should have a better false positive rate, the array is so sparse that the effect should be negligible.

Incidentally, if you're serious about speed, you should avoid doing any sort of division: it's a slow operation. If your modulus is a power of 2, you can simply do a bitmask to compute the remainder... but you only get one hash function that way. To build more, you should try and make them out of bit operations, or maybe invoke a multiplication if necessary.
 
  • #16
Jeff Reid said:
If you needed a sort:
For the sorting, a merge sort is the quickest (fewer operations, lots of cache hits), and it's very easy to skip (delete) duplicates with a merge sort algorithm. In your case, the elements are small, so just sort the elements.
Better than quicksort?

Anyways, I'd imagine a radix sort ought to beat any comparison-based sort.

But really, if you only have a few thousand integers to sort, and only need to do it once, then even a bubble sort should be good enough!
 
  • #17
Hurkyl said:
Incidentally, if you're serious about speed, you should avoid doing any sort of division: it's a slow operation. If your modulus is a power of 2, you can simply do a bitmask to compute the remainder... but you only get one hash function that way. To build more, you should try and make them out of bit operations, or maybe invoke a multiplication if necessary.

In more than three-quarters of the cases I consider I use a special case that uses only boolean operations. The hard part that dominates runtime is the remainder -- the part I'm discussing here. So in effect I'm already using that filter, in a highly optimized fashion, and this is just looking for a good way to do the rest. Now I might use a shift and AND mask as a crude filter, working only because I believe the 'middle bits' to be well-dispersed. This would be something like (n >> 3) & 0xFFFFFF.
 

FAQ: Efficient Set Membership Testing with Hashtable?

What is a Hashtable for booleans?

A Hashtable for booleans is a data structure that stores a collection of key-value pairs, where the keys are unique boolean values and the values are associated with those keys. It allows for efficient retrieval of values based on their corresponding keys.

How is a Hashtable for booleans different from a regular Hashtable?

A Hashtable for booleans only allows for boolean values to be used as keys, whereas a regular Hashtable can use any type of object as keys. This makes the Hashtable for booleans more specialized and efficient for storing boolean values.

What are the advantages of using a Hashtable for booleans?

One advantage is that it allows for fast retrieval of values based on their corresponding keys, as the lookup time is constant regardless of the size of the Hashtable. It also ensures that there are no duplicate keys, which can be useful for certain applications.

Can a Hashtable for booleans handle null values?

No, a Hashtable for booleans cannot handle null values as keys or values. If a null value is attempted to be added, an error will occur. Only non-null boolean values can be used as keys.

How do you add or remove elements from a Hashtable for booleans?

To add an element, you can use the put() method, which takes in a boolean key and the corresponding value. To remove an element, you can use the remove() method, which takes in the boolean key and removes the associated key-value pair from the Hashtable.

Similar threads

Back
Top