How Can I Efficiently Build a Huffman Tree for Huffman Encoding?

ktoz · Dec 16, 2007

Hi

I'm playing around with Huffman encoding and am having a major problem getting my head around how to efficiently build the tree. I Googled "huffman coding" and have a basic grasp of what needs to be done but the actual process of walking through a list of sorted values has turned into a big headache.

Assuming I start with a list of factored values and their rank, in either descending (or ascending) order, the algorithm I came up with takes almost 3 minutes on a dual core 2.16 ghz Macintosh. This is clearly ridiculous as it takes roughly half a second for the rest of my code to read the file and create the factored list of values and their frequencies.

I tried writing to the model http://en.wikipedia.org/wiki/Image:Huffman_tree_2.svg" but due to the large number of initial factors (3000+) I ran into a problem where the sum of the ranks grew so large over subsequent iterations that it didn't even fit in an unsigned long long value without truncation.

Here's a snippet showing a section of the sorted, pre-Huffman data I'm starting with

Code:

87 = ({rank = 87; word = hate; }, {rank = 87; word = GONZALEZ; }); 
88 = ({rank = 88; word = one; }); 
89 = ({rank = 89; word = mainstream; }, {rank = 89; word = "don't"; }); 
90 = (
	{rank = 90; word = like; }, 
	{rank = 90; word = "it's"; }, 
	{rank = 90; word = his; }
); 
91 = ({rank = 91; word = he; }, {rank = 91; word = all; }); 
92 = ({rank = 92; word = legislation; }); 
93 = (
	{rank = 93; word = by; }, 
	{rank = 93; word = more; }, 
	{rank = 93; word = had; }, 
	{rank = 93; word = report; }, 
	{rank = 93; word = at; }
); 
94 = ({rank = 94; word = being; }); 
95 = ({rank = 95; word = them; }); 
96 = ({rank = 96; word = right; }, {rank = 96; word = crimes; });

What this shows is an associative array sorted and grouped by rank of each word in the source file. For example, under the key "93" we see that the five words "by," "more," "had," "report," and "at" all have the same frequency in the document

From this, could someone help me out with a step by step on how to create the huffman tree?

Thanks for any help

Ken

ktoz · Dec 16, 2007

Never mind. I figured it out

Turns out I wasn't properly grouping values after adding their ranks. I was just changing their rank without combining symbol pairs into a new node.

Got the total encoding time down to .5 seconds for a 90K file. 5 seconds for a 10 MB file etc... Much better.

ravenprp · Dec 23, 2007

Hi Ken,

Building a Huffman tree can definitely be a tricky process, especially when dealing with a large number of initial factors. Here are some steps that may help guide you:

1. Start by creating a node for each unique word in the source file. These nodes will initially have a frequency of 1.

2. Sort these nodes in ascending order by frequency.

3. Take the two nodes with the lowest frequencies and combine them into a single node. This new node will have a frequency equal to the sum of the two nodes' frequencies, and its left and right children will be the two original nodes.

4. Remove the two original nodes from the list and add the new node.

5. Repeat this process until there is only one node left in the list. This node will be the root of your Huffman tree.

6. To assign codes to each word, start at the root of the tree and assign a 0 to the left child and a 1 to the right child. Continue this process for each node until you reach a leaf node, which will represent a word in the source file.

7. The code for a word will be the series of 0s and 1s that you encounter as you traverse from the root to the leaf node.

I hope this helps! Let me know if you have any further questions or if you need clarification on any of the steps. Good luck with your project!

How Can I Efficiently Build a Huffman Tree for Huffman Encoding?

FAQ: How Can I Efficiently Build a Huffman Tree for Huffman Encoding?

What is a huffman tree and why is it important in data compression?

How do you build a huffman tree?

What is the time complexity of building a huffman tree?

Can a huffman tree be used for both compression and decompression?

Are there any limitations to using a huffman tree for data compression?

Similar threads

Hot Threads

Recent Insights