- #1
lkcl
- 7
- 0
hi folks, the last couple of times I've been on here have been extraordinarily useful, and i have a new challenge that I'm stuck on so thought i'd ask. this one takes a bit of explaining, apologies.
i've encountered an issue with a standard cryptographic algorithm. take a block of data (say size 1024 bits), encrypt it, then the probability that anyone bit in the output is one or zero *must* be *exactly* 0.5 for all permutations of keys with all permutations of data. that's the basic definition of a "perfect" encryption algorithm.
reality is slightly different, and the reality is also that you cannot possibly hope to run through all possible 2^128 keys and all 2^128 data permutations, there's not enough time in the universe.
sooo, you do a series of runs instead. say, 100,000 runs of length 1024 bits. *one possible* test is to count the number of zeros. the probability of each bit being independent (in a perfect algorithm), you can apply a binomial distribution. in a run of 1024 bits, the average should be 512 zeros.
now, here's the thing: if you get *exactly* 512 zeros in every single one of the 100,000 runs, you know intuitively that there's a problem. buuut, how do you actually test that? you take a p-value and then you test the *p-value distribution* for being uniform.
in NIST.gov's statistical test suite, what they do is they have a histogram with 10 buckets, assume a uniform distribution, and then for each p-value from the 100,000 runs drop p-values between 0.0 and 0.1 into the 1st bucket, between 0.1 and 0.2 into the 2nd and so on.
once you have done this, then you can do a chi-squared test on the histogram (they should be roughly level.. but again, *not perfectly* level!).
the problem comes when the p-value distribution is *NOT* uniform, but is a binomial distribution like the "count the number of zeros in a block of 1024 bits" test. with a binomial distribution that only has a limited number of *discrete* values, you end up with what one can call "sampling noise" between the p-values and the histogram granularity: end-result, the histogram gets skewed (certain buckets end up with more p-values), it's non-uniform, and that's game over.
so, that's the background, for which i apologise in having to explain it to such lengths. you can also probably guess what i need to do :)
what i need is to work out based on a histogram of the number of *distinct* occurrences of p-values, if it is "good" or not. i know for a fact that the data below is *not* good, it just looks like it is. the data size which produced the table below is 256 bits, and it is a million runs (of 256 bits). the tail end (p-values less than 1e-6) is where i am suspicious,
but i need to *prove* that it's suspicious rather than just guess.
i need therefore two things (which i must implement in c or python)
1) a way to compute the "perfect" histogram (parameters will be the number of runs and the number of bits, each with probability 0.5)
2) a way to then compare the *actual* histogram against the "perfect" one.
both of these things although i can describe them i have nooo idea where to start :) some guidance, walking through this, greatly appreciated.
0.000000004 1
0.000000077 1
0.000000152 1
0.000000298 1
0.000000573 3
0.000001088 4
0.000002034 13
0.000003746 14
0.000006795 30
0.000012143 60
0.000021377 113
0.000037073 177
0.000063342 355
0.000106625 540
0.000176835 827
0.000288961 1300
0.000465258 2112
0.000738157 3321
0.001154050 5168
0.001778051 7472
0.002699796 11086
0.004040275 15609
0.005959526 22666
0.008664897 31501
0.012419331 44040
0.017548950 59411
0.024448945 79218
0.033586613 104171
0.045500264 133922
0.060792724 171872
0.080118314 216864
0.104162559 266307
0.133614403 324626
0.169131445 386622
0.211299547 457682
0.260589034 530572
0.317310508 604553
0.381573906 680090
0.453254705 754112
0.531971058 820481
0.617075077 880667
0.707660467 929469
0.802587349 965874
0.900523550 988681
1.000000000 498391
- - - Updated - - -
p.s. as i only need to test the tail-end, i am guessing that a cumulative distribution function would do? but i do not know how to apply it.
i've encountered an issue with a standard cryptographic algorithm. take a block of data (say size 1024 bits), encrypt it, then the probability that anyone bit in the output is one or zero *must* be *exactly* 0.5 for all permutations of keys with all permutations of data. that's the basic definition of a "perfect" encryption algorithm.
reality is slightly different, and the reality is also that you cannot possibly hope to run through all possible 2^128 keys and all 2^128 data permutations, there's not enough time in the universe.
sooo, you do a series of runs instead. say, 100,000 runs of length 1024 bits. *one possible* test is to count the number of zeros. the probability of each bit being independent (in a perfect algorithm), you can apply a binomial distribution. in a run of 1024 bits, the average should be 512 zeros.
now, here's the thing: if you get *exactly* 512 zeros in every single one of the 100,000 runs, you know intuitively that there's a problem. buuut, how do you actually test that? you take a p-value and then you test the *p-value distribution* for being uniform.
in NIST.gov's statistical test suite, what they do is they have a histogram with 10 buckets, assume a uniform distribution, and then for each p-value from the 100,000 runs drop p-values between 0.0 and 0.1 into the 1st bucket, between 0.1 and 0.2 into the 2nd and so on.
once you have done this, then you can do a chi-squared test on the histogram (they should be roughly level.. but again, *not perfectly* level!).
the problem comes when the p-value distribution is *NOT* uniform, but is a binomial distribution like the "count the number of zeros in a block of 1024 bits" test. with a binomial distribution that only has a limited number of *discrete* values, you end up with what one can call "sampling noise" between the p-values and the histogram granularity: end-result, the histogram gets skewed (certain buckets end up with more p-values), it's non-uniform, and that's game over.
so, that's the background, for which i apologise in having to explain it to such lengths. you can also probably guess what i need to do :)
what i need is to work out based on a histogram of the number of *distinct* occurrences of p-values, if it is "good" or not. i know for a fact that the data below is *not* good, it just looks like it is. the data size which produced the table below is 256 bits, and it is a million runs (of 256 bits). the tail end (p-values less than 1e-6) is where i am suspicious,
but i need to *prove* that it's suspicious rather than just guess.
i need therefore two things (which i must implement in c or python)
1) a way to compute the "perfect" histogram (parameters will be the number of runs and the number of bits, each with probability 0.5)
2) a way to then compare the *actual* histogram against the "perfect" one.
both of these things although i can describe them i have nooo idea where to start :) some guidance, walking through this, greatly appreciated.
0.000000004 1
0.000000077 1
0.000000152 1
0.000000298 1
0.000000573 3
0.000001088 4
0.000002034 13
0.000003746 14
0.000006795 30
0.000012143 60
0.000021377 113
0.000037073 177
0.000063342 355
0.000106625 540
0.000176835 827
0.000288961 1300
0.000465258 2112
0.000738157 3321
0.001154050 5168
0.001778051 7472
0.002699796 11086
0.004040275 15609
0.005959526 22666
0.008664897 31501
0.012419331 44040
0.017548950 59411
0.024448945 79218
0.033586613 104171
0.045500264 133922
0.060792724 171872
0.080118314 216864
0.104162559 266307
0.133614403 324626
0.169131445 386622
0.211299547 457682
0.260589034 530572
0.317310508 604553
0.381573906 680090
0.453254705 754112
0.531971058 820481
0.617075077 880667
0.707660467 929469
0.802587349 965874
0.900523550 988681
1.000000000 498391
- - - Updated - - -
p.s. as i only need to test the tail-end, i am guessing that a cumulative distribution function would do? but i do not know how to apply it.