Answer: Dictionary Compression: Solving the Mystery of "01

  • Comp Sci
  • Thread starter Villiers
  • Start date
  • Tags
    Compression
In summary, the conversation is discussing the concept of dictionary compression, which involves assigning shortened codes to frequently used words in a document. This reduces the amount of data needed to store the document, as opposed to using ASCII or giving each word a unique number. The conversation also clarifies that the lookup list would only include words found in the specific document, not the entire English language dictionary.
  • #1
Villiers
7
0
Homework Statement
In a text document each letter could be stored as an ASCII code of 8 bits
In this document the word ‘because’ requires 56 bits of data (7 letters x 8 bits)
Instead, the word could be added to a dictionary and assigned the binary code 01 which is a reduction of 38 bits for each occurrence
A saving for 50 occurrences of the word:
Relevant Equations
50 x 54 bits saving = 2,700 bits saved, or 338 bytes
Hi

I have the answer to the dictionary compression question- but can't understand the following in the notes:

Instead, the word could be added to a dictionary and assigned the binary code 01 which is a reduction of 38 bits for each occurrence - what does this mean?

This is the extract in full:
In compressing larger volumes, a document of 100 pages could contain the word ‘because’ 50 times, resulting in 2000 bits of data being required. Instead the word could be added to a dictionary and assigned the code 01 which is a reduction of 38 bits for each occurrence. There would still be a slight overhead in terms of the storage of the dictionary but this would only be a one-off entry per word.
 
Physics news on Phys.org
  • #2
Words could be assigned a number, which is stored instead of the entire word. Webster's Dictionary contains 470,000 words. That would require 19 bits to give each word a unique number. The explanation you quoted seems to use 18 bits instead of 19 (because 56-38=18). That would allow you to assign unique numbers to ##2^{18} = 262,144## words.

CORRECTION: It looks like it is talking about making a lookup list of only the words in the document, not an entire English language dictionary. So the word ‘because’ is the first word in the lookup list with index 01 and any occurrence of that word in the text is replaced by '01'.
 
Last edited:
  • #3
thank you for your feedback
 
  • #4
FactChecker said:
Words could be assigned a number, which is stored instead of the entire word. Webster's Dictionary contains 470,000 words. That would require 19 bits to give each word a unique number. The explanation you quoted seems to use 18 bits instead of 19 (because 56-38=18). That would allow you to assign unique numbers to ##2^{18} = 262,144## words.
I don't think this is what the question means, you are adding in assumptions which I can't see are justified.

Let's have another look:
Villiers said:
In compressing larger volumes, a document of 100 pages could contain the word ‘because’ 50 times, resulting in 2000 bits of data being required.
So each occurrence of 'because' requires 2000 ÷ 50 = 40 bits. This is clearly not ASCII so the calculation 7 x 8 = 56 bits is not relevant.

Villiers said:
Instead the word could be added to a dictionary and assigned the code 01 which is a reduction of 38 bits for each occurrence.
01 is two bits, 40 - 2 = 38.
 
  • Like
Likes FactChecker
  • #5
pbuk said:
I don't think this is what the question means, you are adding in assumptions which I can't see are justified.
I stand corrected. It looks like it is talking about making a lookup list of only the words in the document, not an entire English language dictionary. So the word ‘because’ is the first word in the lookup list with index 01 and any occurrence of that word in the text is replaced by '01'.
 
  • Like
Likes pbuk

FAQ: Answer: Dictionary Compression: Solving the Mystery of "01

What is dictionary compression?

Dictionary compression is a data compression technique that replaces repetitive phrases or words in a document with shorter codes or symbols, resulting in a smaller file size.

How does dictionary compression work?

Dictionary compression works by creating a dictionary of frequently used words or phrases in a document, assigning them shorter codes, and replacing them with those codes in the original document. This reduces the overall file size and makes it easier to store and transmit the data.

What is the purpose of dictionary compression?

The purpose of dictionary compression is to reduce the size of a file or document, making it more efficient to store and transfer. It also helps to improve data retrieval and processing speeds.

What is the "01" mystery in dictionary compression?

The "01" mystery refers to the use of binary code (0s and 1s) in dictionary compression. This code is used to represent the shorter codes assigned to frequently used words or phrases in a document.

What are the advantages of using dictionary compression?

Using dictionary compression can result in significant space and time savings, making it easier to store and transmit large amounts of data. It also helps to improve data processing speeds and reduce bandwidth usage.

Similar threads

Back
Top