Expected number of collisions for hashing with chaining

In summary, the conversation discusses the concept of collisions in a hash table and how to calculate the expected number of collisions. It also explains the use of random variables and indicator variables in this calculation. However, the attempt provided an incorrect result due to not accounting for the fact that the same collision can occur multiple times. The correct solution takes this into consideration and yields a different result.
  • #1
CGandC
326
34
Homework Statement
Suppose we use a hash function ## h ## to hash ## n ## distinct keys into an array ## T ## of length ## m ##. Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of ## \{ \{ k ,l \} : k \neq l and h(k) = h(l) \} ## ?
Relevant Equations
- Knowledge of random variables, indicator random variables.
- Simple uniform hashing means that the probability that every two different keys hash to the same cell is the same, i.e. ## \mathbb{P}( h(k_i) = h(k_j) ) = \frac{1}{m} ##
Attempt:
Denote ## T## as the hash table, ## h ## as the hash function.
Denote ## n ## as the number of keys in the universe, and ## m ## as the size of hash table.

Order the set of keys from the universe as ## \{ k_1 ,..., k_n \} ## such that ## k_i \leq k_j ## where ## i \leq j ##.
Note that a collision occurs when we try to insert some key ## k ## into the table and it occurs that ## T[h(k)] \neq NULL ##.

Let ## X ## be a random variable denoting a key ( ## X(\omega) = \omega ## , ## \omega \in \Omega ## is a key ).
Let ## C(X) ## be a random variable denoting the number of collisions with ## X ##.

Let ## X_{ij} ## be an indicator random variable denoting the event that a collision between ## k_i ## and ## k_j ## occured, i.e. whether ## h(k_i) = h(k_j) ## occured or not.

Thus, ## \mathbb{E}[ C(X)] = \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] \mathbb{P}(X=i) ## ## = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] ##
( where ## \mathbb{P}(X=i) = \frac{1}{n} ## under the assumption that the element being collided is equally likely to be any of the ## n ## elements stored in the table ).

Note that ## C(X) | X=i ## is a random variable and ## C(X) | X=i ~~~~ = \sum_{i \neq j = 1 }^n X_{ij} ##.

Thus, ## \mathbb{E}[ C(X) | X=i] = \sum_{i \neq j = 1 }^n \mathbb{E}[ X_{ij}] = \sum_{i \neq j = 1 }^n \mathbb{P}( h(k_i) = h(k_j) ) ## ## = \sum_{i \neq j = 1 }^n \frac{1}{m} = \frac{n-1}{m} ##

* The sum went over all possible values of ## j ## besides the case where ## j = i ##, hence the ## n-1 ## in the sum .
* ## \mathbb{P}( h(k_i) = h(k_j) ) = \frac{1}{m} ## because of the assumption of simple uniform hashing .

Thus, ## \frac{1}{n} \sum_{i=1}^n \mathbb{E}[ C(X) | X = i] = \frac{1}{n} \cdot \frac{n(n-1)}{m} = \frac{n-1}{m} ##.
Hence, ## \mathbb{E}[ C(X)] = \frac{n-1}{m} ##, which is what we wanted.

Correct answer:
Denote as ## Y ## a random variable representing total number of collisions.
Note that ## Y= \sum_{i \neq j }^n X_{ij} ##.
Thus, ## \mathbb{E}[ Y] =\sum_{i \neq j }^n \mathbb{E}[ X_{ij}] = {n \choose 2} \cdot \frac{1}{m} = \frac{n(n-1)}{2m} ##

Question:
Obviously my answer is not correct since ## \frac{n-1}{m} \neq \frac{n(n-1)}{2m} ##, and I understood the correct answer and I agree with it, but I don't understand why my lines of reasoning in the attempt didn't yield the correct answer, what was wrong in my attempt that led to incorrect result?

Thanks in advance for any help!
 
Physics news on Phys.org
  • #2
CGandC said:
what was wrong in my attempt that led to incorrect result?

You disappeared off into the woods after

CGandC said:
Let ## X_{ij} ## be an indicator random variable denoting the event that a collision between ## k_i ## and ## k_j ## occured, i.e. whether ## h(k_i) = h(k_j) ## occured or not.

Clearly ## P(X_{ij} = 1) = \dfrac 1 m ##, so all we have to do is count the number of ## X_{ij} ##s and remember not to count the same collision twice.
 

FAQ: Expected number of collisions for hashing with chaining

What is the expected number of collisions for hashing with chaining?

The expected number of collisions for hashing with chaining is calculated using the formula: n - m + m * (1 - (1 - 1/m)^n), where n is the number of keys and m is the number of buckets or slots in the hash table.

How does the number of keys and buckets affect the expected number of collisions?

The expected number of collisions increases as the number of keys increases, and decreases as the number of buckets increases. This is because with more keys, there is a higher chance of two keys hashing to the same bucket, and with more buckets, there is a lower chance of collisions occurring.

Can the expected number of collisions be reduced?

Yes, the expected number of collisions can be reduced by increasing the number of buckets or by using a different hashing algorithm that distributes keys more evenly across the buckets.

How does the load factor affect the expected number of collisions?

The load factor, which is the ratio of the number of keys to the number of buckets, affects the expected number of collisions. As the load factor increases, the expected number of collisions also increases, as there are more keys trying to hash to the same buckets.

Is it possible to have zero collisions with hashing and chaining?

No, it is not possible to have zero collisions with hashing and chaining. This is because even with a perfect hash function, there is always a chance of two keys hashing to the same bucket. However, the expected number of collisions can be minimized by choosing a suitable number of buckets and a good hashing algorithm.

Back
Top