Pseudo Random Number Generation by Lightweight Threads
We have brought the Data Encryption Standard (DES) block cipher out of retirement for a second career as a Pseudo Random Number Generator (PRNG). DES PRNG is intended for simulations that benefit from PRN generation at the granularity of lightweight (GPU) threads. An example of this type of simulation is Particle-in-Cell (PIC) with Monte-Carlo Collisions (MCC). In PIC-MCC a large set of charged particles are time evolved in self-consistent electromagnetic fields, with particle collisions against background neutral species modelled as probabilistic events. The MCC modeling consumes a large amount of PRNs and it is important that they are not correlated, or spurious effects can negatively affect the solution, sometimes in subtle ways. The DES PRNG has been thoroughly tested and found to produce higher-quality random numbers than all commonly used PRNGs. The simulation particles are characterized by their phase-space coordinates, commonly represented by six double-precision floating-point numbers. The DES PRNG requires only seven bytes of state, making it practical to store with the 48 bytes of coordinate data. It therefore allows PRNs to be generated with the finest granularity we can imagine to be practical, with one particle per thread. Our motivation for pursuing this fine-grained PRN generation is an expectation that it will remove congestion caused by threads consuming PRNs produced at the process level, thereby improving scalability of MCC modeling on GPUs.
This project originated at the Princeton-OLCF-NVIDIA GPU Hackathon in June, 2020. We successfully ported the Mersenne Twister (Matsumoto) PRNG to run on GPU using OpenACC directives. The Mersenne Twister (MT) has been the predominant PRNG for HPC for the last two decades. MT produces a PRN sequence with a period of 219937- 1 and the state needed to produce the next PRN is about 2.5 kB in size. For simulations using more coarse-grained parallelization (MPI processes and/or CPU threads) MT PRNs can be produced in parallel by using a different seed for each PRNG instance to initialize its state. Because the sequence period is so long, the risk of two seeds producing partially overlapping PRN sequences is statistically insignificant. The difficulty we encountered with MT on GPU is that its state is too large for the very numerous and lightweight GPU threads. A fellow Hackathon participant pointed us to the Random123 C++ library that has implementations of several GPU friendlier PRNGs (Salmon et al.). However, they all have states of 16 bytes or larger, and threading is accomplished with either CUDA or OpenCL, but not OpenACC. We therefore decided to look for a block cipher to use for a new PRNG, ideally suited for PIC-MCC with OpenACC threading.
The DES block cipher, developed by IBM and NSA, was approved as a federal standard for encryption of non-classified information in 1976 and was in use for 25 years until it was superseded by AES. DES uses an 8-byte key to encrypt an 8-byte block of clear text into an 8-byte block of cipher text. However, the least-significant bit of each byte of the key does not affect the cipher text. DES, and other block ciphers, can be used as a PRNG by using a counter as as the clear-text block to get a PRN in the form of the corresponding cipher-text block. For DES, with a block size of 8 bytes, unsigned long is a convenient data type for key, counter and PRN. For PIC-MCC one can use the particle number to generate a key, and combine time step itime and collision number icoll into a counter icount. To avoid having up to 256 particles use the identical PRN sequence, the DES PRNG library only accepts particle numbers in the range of 0 to 256-1 and calculates a unique key for each, which we refer to as the identifier. The counter is calculated as icount = (itime << 16) + icoll. DES PRNG thus supports simulations with up to 256 particles (and as many threads) and 248 time steps, each with 216 collisions.
To test the quality of PRNs, the Crush test suite, which is part of the TestU01 library, is the de facto standard. As shown in Fig. 1, MT fails two out of 144 Crush tests, but DES PRNG passes all of them.
The Crush test suite generates about 235 PRNs and is compute bound. The CPU time can therefore be used as a proxy for the number of integer operations performed, 83% more for DES PRNG than for MT. PIC-MCC simulations are typically memory bound, with more than 90% of the simulation time spent waiting for data to arrive. For realistic use cases, we therefore expect better scalability to far outweigh some more computation for the DES PRNG. DES PRNG also passes all the tests of the TestU01 Diehard and FIPS-140-2 test suites, as shown in Fig. 2.
Diehard was the standard PRN test suite before Crush. FIPS-140-2 implements the tests described by NIST document FIPS PUB 140-2, Security Requirements for Cryptographic Modules, p. 35.
DES PRNG has yet to be used for realistic PIC-MCC simulations, but the results from a very small test simulation are shown in Fig. 3.
DES PRNG is available as an open-source library, implemented in ANSI C (C89) with OpenACC directives, under the permissive Simplified (2-clause) BSD license.
Acknowledgement: The author is pleased to acknowledge that the work reported on in this paper was substantially performed using the Princeton Research Computing resources at Princeton University which is consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and Office of Information Technology's Research Computing. This work was performed under the auspices of the Princeton Plasma Physics Laboratory LDRD program (subcontract S017862 from DOE prime contract DE-ACO2-76CHO3073) and was done in close collaboration with Mat Colgrove, The Portland Group, Inc.
Matsumoto, M., and T. Nishimura. “Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator.” ACM Transactions on Modeling and Computer Simulation, vol. 8, no. 1, 1998, pp. 3-30. doi:10.1145/272991.272995.
Salmon, John K., et al. “Parallel Random Numbers: As Easy as 1, 2, 3.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC11), 2011.doi:10.1145/2063384.2063405