Shuffling the numbers to create a pseudo-random distribution is usually a holistic operation. You take the list with all the elements and apply any kind of shuffling method.
However, what do you do if the range is too large and thus contains way too many elements to hold them in memory?
Wouldn't it be great if there was a function
The experiments below provide an investigation of the above random distribution challenge and connected problems.
-
Implement a stateless distribution function
$f(n)$ to distribute a sequence$1,2,3,4, ...$ of a particular range$R$ to bijectively to numbers in the same range. -
Use the created distribution function to implement further utilities:
- a random number generator
- a hash function
- a symmetric encryption method
The range
This is a pure fun project to demonstrate how certain algorithms may be constructed, their interrelation and and how pseudo-random distribution can be judged.
There are plenty of well-tested and investigated implementations available. Relying on robust libraries will save you time and headaches. 😉
My approach is based on the assumption that prime numbers don't follow any pattern or rule. At least none of the most intelligent brains has ever found one. So, somehow prime numbers must imply any irregularity resp. randomness.
If this is true, then it is also true for the bit patterns of their underlying binary representation.
Can we somehow leverage this intrinsic randomness to spread disorder over a range of numbers?
I want to try to do so by only applying the operations XOR
(ROTATE
.
- Let
$P[]$ be a list of unique prime numbers - Step 0: Set the working state
$WS$ to the input number. - Step 1: Take next prime number
$P$ from$P[]$ . - Step 2: Perform
$WS = WS \oplus P$ . - Step 3: Rotate the working state
$WS$ by the number of$1s$ in$WS$ . - Step 4: If there are more prime numbers, continue with Step 1, else return
$result:=WS$ .
This is a bijective mapping, no two inputs can ever result in the same output. We can undo the operation by simply applying
After a few rounds (patterns in (
Before we start we should have a solid foundation, some tests and tools to judge the outcome: the randomness of the distribution created by our algorithm.
When we produce a result we need a defined way to tell whether it is good or bad.
With a handful of values we might be able to do this by looking closely, however, with millions or billions this is hopeless.
So, what are we looking for? Randomness, right? - Hmm ... 😕
Unfortunately, there is no rule or algorithm to simply decide whether a distribution is random or not.
Until the 1980s there was not even a proper foundation for randomness!
Here are two prominent quotes that show how shady the field of pseudo-randomness was (and is):
"Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin."
"A random sequence is a vague notion embodying the idea of a sequence in which each term is unpredictable to the uninitiated and whose digits pass a certain number of tests, traditional with statisticians and depending on the uses to which the sequence is to be put."
Derrick Henry Lehmer, 1941
Finally, in 1982, Andrew Yao developed the concept of Computational Indistinguishability. In 2000 he recieved the Turing Award for his work on pseudo-randomness.
Drastically simplified: A sequence is pseudo-random if you cannot distinguish it from a truly random sequence (e.g. from radio-active decay) using stastical methods. This leads to a set of tests along with thresholds to judge the randomness of any given distribution.
Yao's work can be seen as the theoretic foundation of the NIST test suite.
Luckily, with the Random Bitstream Tester Zsolt Molnár provides an a JavaScript-implementation of the test suite with a simple web-interface.
Although the Random Bitstream tester is no officially certified implementation, it should be sufficient for our purposes.
Whenever we implement a new algorithm or tweak/tune an existing one we need to perform again all the tests on a large number of output bits.
Obviously, this is tedious work. We cannot completely avoid these tests but maybe we can implement a quick test to detect and discard bad candidates early.
So, how can we observe a stream and tell that its output is not random?
When we perform a bijective mapping of numbers of a range on the numbers in the same range, then there are a few cheap KPIs we can gather easily.
- KPI
$sucDist$ : The distance between two subsequently generated (mapped) numbers.
If our function
Reasoning: If
- KPI
$srcDist$ : The distance between the input (sequential number) and the output (mapped number).
When we think beyond, then also the average distance
Clarification: This test only detects bad candidates early. You cannot take these KPIs as a proof of a good distribution!
Conclusion: When we observe the two KPIs while generating (mapping) a sequence of numbers we should be able to discard the majority of bad recipes, so we can concentrate on the remaining good ones. To those we still must apply the expensive tests.
Implementing the experiments will require some trial and error. This often leads to duplicated code and relicts from previous attempts. Especially, multiple implementations of the same methods (e.g., for measurement) quickly diverge which leads to errors, frustration, and fixing effort.
Thus, I want to start with a small set of tools encapsulating some boilerplate code which will help me to concentrate on the actual experiments later. These tools you can find in the util
package.
As explained before I want to observe the source distance and the successor distance during the experiments to use them as KPIs. Collection, minimum, maximum as well as average computation should be implemented once and once only.
Depending on the size of the range java.lang.long
(64 bits) which is very fast.
Unfortunately, when we analyze long
and will cause overflows. Typically, we would use now BigInteger
and call it a day. But incrementing vast amounts of numbers using BigInteger
makes data collection extremely slow (compared to incrementing long
). GenStats
avoids this by implementing the data collection based on long
-values and only performing the final computation with BigInteger
.
See also GenStatsTest
Working with large binary values can be painful, especially in languages like Java that provide signed data types.
It is complicated to compose or decompose values. Trivial log statements for debugging purposes suffer from problems like unexpected accidential promotion. Of course, the JDK provides solutions in the form of auxiliary methods. But they are sometimes hard to find. This causes repetitive boilerplate code for simple operations. Bugs will sneak in quickly.
Thus, I decided to implement BitUtils
along with some tests to help working with binaries in a natural (left-to-right) way consistently across all the related types (byte
, short
, int
and long
).
See also BitUtilsTest
To realize the idea of using prime numbers we of course need a reasonable number of prime numbers of various sizes (according to the data type we process: 8, 16, 32 and 64 bits) with certain qualities. These constants can be found in the PrimePatterns
class along with some documentation.
You may wonder about the bit-flip-distances. The reasoning behind is that when we XOR
a prime pattern with a given bit sequence (working state) then each bit ideally should change with a 50%-probability. The PrimePatterns
have been selected considering the number of 1
s in a pattern (0
doesn't change anything!) on the one hand and the bit-flip-distance among the prime patterns. This shall reduce the likelihood that applying the next prime subsequently will undo the effect of the previous round or amplify any pattern.
Try it out! Use XoRotUtils
to apply some of the patterns to a given value and BitUtils
to check what's happening on bit-level.
Previously, I explained the basic idea to combine XOR
and bit-rotation to achieve my goal: mapping a stream of values to create a pseudo-random distribution.
Because I'll need the same logic for different data types multiple times it makes sense to implement a couple of utility methods along with tests.
The class XoRotUtils
contains a set of functions. Besides the standard rotation I also implemented sign-preserving rotations which is useful for experiments where signed values should not change their signum.
See also XoRotUtilsTest
The class DistributionCodec provides a number of static methods to encode
and decode
numbers. They cover the types byte
(8 bit), short
(16 bit), int
(32 bit), and long
(64 bit).
Out of curiosity I also added encodePreserveSign
and decodePreserveSign
methods. As the name states they apply the same algorithm but so that any negative number will be mapped to a negative number and any positive number (incl. 0
) will be mapped to a positive number.
The corresponding tests (simulations) can be found in DistributionCodecTest.
Based on the formula for the average distance
If
If we preserve the sign then the effective range is just half the size, accordingly:
|
|
||
---|---|---|---|
byte |
|||
short |
|||
int |
|||
long |
testByteCodecFull()
GenStats {
Total count: 256
Minimum src diff: 1
Maximum src diff: 237
Average src diff: 87
Minimum suc diff: 1
Maximum suc diff: 231
Average suc diff: 86
Self-mapped: 0
}
testByteCodecPreserveSignFull()
GenStats {
Total count: 256
Minimum src diff: 1
Maximum src diff: 115
Average src diff: 41
Minimum suc diff: 1
Maximum suc diff: 119
Average suc diff: 42
Self-mapped: 0
}
Although the range of the byte
data type is small, we see the expected averages, which looks great. Unfortunately, with just 256 possible values in a range we cannot apply the NIST tests with the Random Bitstream Tester.
testShortCodecFull()
GenStats {
Total count: 65536
Minimum src diff: 1
Maximum src diff: 64918
Average src diff: 21819
Minimum suc diff: 1
Maximum suc diff: 65089
Average suc diff: 21879
Self-mapped: 0
}
testShortCodecPreserveSignFull()
GenStats {
Total count: 65536
Minimum src diff: 1
Maximum src diff: 32693
Average src diff: 10889
Minimum suc diff: 1
Maximum suc diff: 32672
Average suc diff: 10931
Self-mapped: 0
}
For the short
data type the averages are also very close to the expected values. Unfortunately, with just 65,536 possible values in a range we still cannot apply the NIST tests with the Random Bitstream Tester.
testIntCodecFull()
GenStats {
Total count: 4294967296
Minimum src diff: 1
Maximum src diff: 4294933813
Average src diff: 1431733275
Minimum suc diff: 1
Maximum suc diff: 4294900458
Average suc diff: 1431651907
Self-mapped: 0
}
testIntCodecPreserveSignFull()
GenStats {
Total count: 4294967296
Minimum src diff: 1
Maximum src diff: 2147445334
Average src diff: 715831073
Minimum suc diff: 1
Maximum suc diff: 2147427639
Average suc diff: 715700251
Self-mapped: 0
}
For the int
data type with its
This time the range is big enough, so we can feed the Random Bit Stream Tester.
testIntDistPattern()
I see two errors for the last two tests I could not find any explanation for - but the other tests look great!
We can assume that the 1 million bits created from the sequence indeed look random.
testLongCodec()
GenStats {
Total count: 4294967296
Minimum src diff: 1786079634
Maximum src diff: 18446461873646289201
Average src diff: 6143413651019470283
Minimum suc diff: 3
Maximum suc diff: 18446408542927331498
Average suc diff: 6148871959460087487
Self-mapped: 0
}
testLongCodecPreserveSign()
GenStats {
Total count: 4294967296
Minimum src diff: 3343366927
Maximum src diff: 9223041817990852893
Average src diff: 3074464191414455403
Minimum suc diff: 1
Maximum suc diff: 9223236968881150431
Average suc diff: 3073819925782603098
Self-mapped: 0
}
For the long
data type with its
Again, we see the observed averages close to the expected values, which is a good sign for our candidate.
Let's feed the Random Bit Stream Tester again.
testLongDistPattern()
Again, I see these errors for the last two tests, and all the other tests look great.
We can assume that the 1 million bits created from the sequence indeed look random.
Apart from the last two tests who neither failed nor passed but ran into errors, I see the results as a success. With only a few operations we could realize an implementation that passes complicated statistical tests!
For the types int
and long
we can now assume the produced distribution is pseudo-random.
The next challenge shall be implementing a random number generator (64 bits) based on our distribution function.
Note, that simply taking the next element,
- There should be a seed, so that we can produce different sequences of values.
- It must be possible to get the same number a few times in a row.
As our distribution function is bijective, it would just create a single long sequence and start over when the range is exhausted.
Thus, the DistributionCodecRandomGenerator uses a different approach.
In a first attempt, I just took the last value to encode it again, but that quickly created unwanted patterns. Thus, the current implementation works with an extra internal noise
which evolves independently from the last returned value and gets involved in computing the next value the generator returns.
You can find the related simulations in DistributionCodecRandomGeneratorTest.
testRandom()
GenStats {
Total count: 4294967296
Minimum src diff: 1307188592
Maximum src diff: 18446592376029438236
Average src diff: 6148918652859093665
Minimum suc diff: 2235341302
Maximum suc diff: 18446712165080022615
Average suc diff: 6148852070620812323
Self-mapped: 0
}
💡 To gather the typical statistics in the usual way I needed a source value src
, which is of course unavailable. I could have used a simple sequence number 1,2,3, ...
, but that creates a heavy bias because we are not producing src
that covers the full range. This only affects the statistics, not the random number generation itself.
As you can see, both averages are very close to the expected ideal of
Let's feed the Random Bitstream Tester with bits from our generator.
testRandomPattern()
Well, that looks great! Finally, all tests passed, no error this time.
Now we have our own home-made 64-bit random number generator.
After the first experiment we have a distribution method d(n)
which bijectively maps inputs of a range to values of the same range with pseudo-random characteristics.
Approach
- Split a source message into input blocks (
$input_i$ with$i$ as block index) of the same length (add padding if required). - Use
$d(input_i)$ ($i=1..n$ ) as keys to encode the working state (same length as the block length) over and over again until the full input has been considered. - Return the working state as result, a hashcode of a fixed length.
This approach roughly follows the idea of a Merkle-Damgård-Construction.
My implementation can be found in DistributionCodecHashBuilder.
It generates 64-bit hashes for arbitrary inputs.
In DistributionCodecHashBuilderTest you can find the related tests and simulations.
testHash()
GenStats {
Total count: 4294967296
Minimum src diff: 60509038
Maximum src diff: 18446457744536997112
Average src diff: 6149018109453884815
Minimum suc diff: 4
Maximum suc diff: 18446356868256507009
Average suc diff: 6148884193996035760
Self-mapped: 0
}
To get an idea of the randomness of the created hashes I hashed a sample of 4 billion numbers out of the available space
The two main KPIs look excellent (expected was
testRandomPatternFromHash()
Of course a single hash value is way too short to perform any statistical tests, so I will simply hash the input and concatenate the hash values.
To make the test more challenging the available input will be a single short sentence.
"Fluffy, Tuffy and Muffy went to town. They all got killed in a terrible accident."
However, we must be sure to create different hash values in every round, so we add the current sequential number as a hash salt.
Let's feed the Random Bitstream Tester one again.
Excellent! All tests passed, despite the fact that the input was super repetitive.
Now we have our own home-made 64-bit hash function.
💡 If you are interested in this topic, 👉 here you can find an illustrated slide deck (German only) about hashing and its numerous use cases with a few more links for further reading at the end of the presentation.
In a final experiment I want to use my previously created functions to implement symmetric-key encryption.
The algorithm is quite simple:
- Take some plain text and a password
- Use the DistributionCodecHashBuilder to turn the password into a 64-bit hash value.
- Initialize a new instance of DistributionCodecRandomGenerator with the hash value as its seed.
- Use the random generator to create a oneTimePad of the same length as the input.
- Encrypt the message with the one-time pad using
XOR
:$encrypted = message \oplus oneTimePad$ . - Decrypt the encrypted message with the same operation:
$message = encrypted \oplus oneTimePad$
My implementation can be found in SecureDistributionCodec.
Please review the related simulations and tests in SecureDistributionCodecTest.
testRandomPatternFromEncryption()
For this test we will again use a short input sentences and repeat (concatenate) it until we get the desired length, so that the encoded version will have enough bits to feed the Random Bitstream Tester.
"The quick brown fox jumped over the lazy dog."
The resulting monotone (boring) plain text will be used to challenge the encryption method. Password shall be 1234
.
Let's feed the Random Bitstream Tester with the encrypted binary sequence.
Once again, we have passed all tests! So we can assume our encryption method creates pseudo-random output.
First of all: I plead guilty for cheating! 😉
If you just read through this article you may wonder about the excellent test results I achieved with minor effort. I have to admit, I have cheated a little bit.
No worries, I did not fake anything. All results you see are real, and you can reproduce the tests.
However, I had full control about the selected example inputs for hashing and encryption resp. the seed of the random number generator.
Why does it matter?
1975, Charles Godhart stated
"When a measure becomes a target, it ceases to be a good measure."
Although it was hard work to get this NIST test suite within the Random Bitstream Tester fully green for a particular input, it was not impossible. Consequently, if you change the input (test sentence, seed, password, etc.) some tests may suddenly fail, especially the Runs test is painful to pass.
It would be way harder to design an algorithm that passes all the tests for arbitrary inputs!
You are invited to play with the given examples and the code to get a better understanding of this effect.
I did not further investigate these issues.
The purpose of the this test is to determine if the number of visits to a state within a random walk exceeds what one would expect for a random sequence.
As I understand this test needs a considerable number of values to be collected in memory.
I speculate that the errors we see occasionally may stem from internal overflows when building internal structures for further analysis. So, the test neither says passed nor failed.
Another explanation might be that for this test the required sample size should be bigger (somehow depending on the characteristics of the sample?).
The KISS-principle also applies to our experiments. Adding too many parameters or fancy operations to your recipe makes it hard to understand what is happening and why. Also, obscure implementations are hard to optimize.
In the shady field of pseudo-randomness you often see results which seem to be close to perfect, only a little bit off, etc. Tweaking here and there seems to be so promising!
However, one observation during my experiments was that you should not try to fix a broken concept by adding further complex operations (another rotate, invert, whatever you like).
Whenever you do so, you just increase complexity, runtime and obscurity.
It is almost guaranteed that improving one end will lead to worse results on the other.
Conclusion: Concentrate on a few operations. If your approach doesn't work well, think about a new attempt, again with only a few operations.
Temptation is high to try improving results by applying the same operation again:
This is not a good idea. Sometimes, you may even see test results getting worse rather than better.
But why?
The NIST test suite among others focusses on unexpected regularities (patterns). By applying the same imperfect transformation multiple times you increase the likelihood of amplifying an unwanted pattern hidden in your approach.
If you want to apply your recipe in multiple rounds, you must ensure that every subsequent run clearly differs from the previous one (either algorithmic or through parameters), like
I have not further investigated the runtime of my implementations. However, it should be obvious that we will not break any speed records. 😉