True Random Numbers: Is Randomness Just an Illusion?

In summary, the conversation discusses the topic of randomness and how it relates to compression. The speaker shared their experience of using a compression algorithm on data from random.org and found that it displayed biases which can be exploited for significant compression rates. The concept of randomness and how it can be affected by the scale of sampling is also discussed, with the idea that a truly random set without biases can only be achieved by starting with equal numbers of all values and mixing them. However, even this approach may have biases depending on the sample size chosen. The conversation also touches on the importance of considering large enough samples when discussing statistics and randomness.
  • #1
ktoz
171
12
Hi

I'm writing a compression utility based on an improvement to Huffman coding and just for giggles, I thought I'd try it out on true random data from http://www.random.org/" to see if it did anything.

What I found was that the data from random.org (which is derived from atmospheric noise) definitely displays biases which can be exploited to generate significant compression rates (I'm getting upwards of 55% savings.)

I'm well aware of the "http://en.wikipedia.org/wiki/Pigeonhole_principle" " and know that random data shouldn't be compressible so that brings up the question: Is this data somehow not random? Is randomness really just a function of scale?

For example: If you take 1,000,000 randomly generated bits and slice them up into 8 bit samples, there are bound to be certain biases (ie: more instances of the number 15 than 16) If you slice those 1,000,000 bits into 25 bit samples, completely different biases would result, but it seems that by choosing a sample size, you automatically introduce biases into random data, which sort of makes it - not random.

The only way I can think of to create a random set without biases would be to start with equal numbers of all values for a particular sample size and mix them. That way, there would be no frequency differences to exploit. But even this fails if you choose a different sample size.

So I'm wondering is randomness just an illusion caused by sampling scale?
 
Last edited by a moderator:
Mathematics news on Phys.org
  • #2
Statistics are meaningful only when large enough samples are considered.
For example, consider a coin. We "know" that coin tossing is completely random. However, you could try tossing it a few times and might end up with:
HHHTHHTHHH
which is obviously biased in favor of heads. However, you would expect freq(H)/freq(T)=1 as you go on tossing, i.e. as your sample size becomes large enough (BTW, given a distribution, you can estimate how large is "large enough").

Assaf
"www.physicallyincorrect.com"[/URL]
 
Last edited by a moderator:
  • #3
Slicing a random bitstream into groups of 8 does not make the data any less random at all.

As far as the meaning of "random," you have to consider that getting a million heads in a row is a perfectly valid result of flipping a coin a million times.

Your algorithm's ability to compress random data will disappear, on average, if you compress very large bitstreams, or, alternatively, if you individually compress a large number of small bitstreams.

- Warren
 
  • #4
ktoz said:
What I found was that the data from random.org (which is derived from atmospheric noise) definitely displays biases which can be exploited to generate significant compression rates (I'm getting upwards of 55% savings.)

How large are the strings that you're compressing?

Are you including all needed information in the encoded string? Sometimes there's outside data that's used to compress, and this outside data plus the 'encoded data' is at least as large as the original. Along those lines, what format is the encoded data in?
 
  • #5
ktoz said:
The only way I can think of to create a random set without biases would be to start with equal numbers of all values for a particular sample size and mix them. That way, there would be no frequency differences to exploit. But even this fails if you choose a different sample size.

That is, of course, a bias itself, and could be used to compress a string. 256 bytes in which each of the 256 possible values appears once can be compressed from 2048 bits to 1684 bits (though admittedly not easily).
 
  • #6
I took 10,000 bytes from random.org (integers in the range 0-255) and stored them in a 10,000-byte bin file. I ran several compression algorithms on it (using 7-zip) and wasn't able to compress it to below 10,000 bytes. The compressed files were in the range 10,200 to 10,800 bytes or so.
 
  • #7
CRGreathouse said:
How large are the strings that you're compressing? Are you including all needed information in the encoded string?

10,000 bytes and yes to including header info.
 
  • #8
CRGreathouse said:
I took 10,000 bytes from random.org (integers in the range 0-255) and stored them in a 10,000-byte bin file. I ran several compression algorithms on it (using 7-zip) and wasn't able to compress it to below 10,000 bytes. The compressed files were in the range 10,200 to 10,800 bytes or so.

On a Mac if I use a random 10,000 number set (with spaces deleted) from random.org and restrict the values to the range 0-9, I can use the built-in "Create archive of ..." command in the Finder and get around 50 percent compression.

With my algorithm, I'm getting around 5 to 20 percent better compression rates than "zip" depending on file type.

It's still a bit rough around the edges so, I may be making an error somewhere in the code that's giving me these smaller sizes. I have the compress/write-to-file working but still need to write the read-from-file/decompress part, which will indicate whether I've messed up or not.
 
Last edited:
  • #9
Ten kilobytes is a laughably small data set. Don't draw any conclusions at all from that...

- Warren
 
  • #10
ktoz said:
On a Mac if I use a random 10,000 number set (with spaces deleted) from random.org and restrict the values to the range 0-9, I can use the built-in "Create archive of ..." command in the Finder and get around 50 percent compression.

Well, that's your problem. A data stream restricted to numbers 0-9 isn't hardly random at all! You only need 4 bits of information to represent a number from 0 to 15; so 0-9 is even less than that. Therefore, if each character occupies one byte in the file, then it's no wonder your Mac gets 50% compression rates, because 50% of the file is zeros.

If you want a truly random data stream, you MUST use integers in the range 0-255.
 
  • #11
Err.. yeah, Ben. Good catch. I was tacitly assuming that ktoz was using a random bitstream, where every bit is independent of every other bit. If he's just using a file full of ASCII characters '0' through '9' then sure, way more than half of each byte is wasted. A good compression algorithm could easily achieve 75% or even higher compression on such a file.

- Warren
 
  • #12
Ben Niehoff said:
Well, that's your problem. A data stream restricted to numbers 0-9 isn't hardly random at all! If you want a truly random data stream, you MUST use integers in the range 0-255.

I was kind of assuming the "zip" compressor was squeezing out the extraneous zeros, but what difference does restricting the set to 0-9 have, as opposed to 0-255, on whether a sequence is random or not? Doesn't seem like base 10 is intrinsically less capable of randomization than base 16.
 
  • #13
Are you actually putting the ASCII characters '0' through '9' in a file, and then trying to compress it? Just say yes or no.

If so, consider the binary values for the ASCII codes for '0' through '9':

'0': 0011 0000
'1': 0011 0001
'2': 0011 0010
'3': 0011 0011
'4': 0011 0100
'5': 0011 0101
'6': 0011 0110
'7': 0011 0111
'8': 0011 1000
'9': 0011 1001

As you can see, the first nibble doesn't even change; they're duplicated in every single byte, and carry no real information. The second nibble does change, but the majority of them start with zero. In other words, there are fewer than 4 bits of "real information" in each 8 bit byte. That means roughly 5/8ths of your data set is wasted, which is why any decent compression algorithm can reduce it by 50% or more.

- Warren
 
Last edited:
  • #14
Ten bits is enough to encode three decimal digits (allowing compression to 42% the original size) with enough room left over for "end of string", "append 0 then end of string", ..., "append 9 then end of string", "append 0, read 4 more bits to get one more digit, then end of string", ..., "append 9, read 4 more bits to get one more digit, then end of string", and still have enough room left over for three error/override conditions.
 

FAQ: True Random Numbers: Is Randomness Just an Illusion?

What are true random numbers?

True random numbers are numbers that are generated in a completely unpredictable and unbiased manner. They are not affected by any outside factors and cannot be predicted or replicated.

Is randomness just an illusion?

Some scientists argue that true randomness does not exist and that everything in the universe follows predetermined patterns. However, others believe that true randomness does exist in certain systems and phenomena.

How are true random numbers generated?

True random numbers can be generated through a variety of methods, such as using atmospheric noise, quantum processes, or radioactive decay. These methods are believed to produce truly random and unpredictable results.

Why are true random numbers important?

True random numbers are important in fields such as cryptography, statistical analysis, and gaming. They are also used in simulations and experiments to ensure unbiased results.

Can true random numbers be controlled or manipulated?

True random numbers cannot be controlled or manipulated by outside forces. However, they can be influenced by the method used to generate them, so it is important to use a reliable and unbiased method to ensure true randomness.

Back
Top