Park-Miller-Carta Pseudo-Random Number Generator

The generation of random numbers is too important to be left to chance - Robert R. Coveyou
Investigating fast implementations of the well-known Park-Miller "minimal standard" (16807 and 2^32 -1) PRNG in C, assembler and soon Fortran.  We are particularly interested in DSP applications for  audio etc. (or for networking applications such as routers) and in implementing the algorithm without division and with CPUs which are not capable of 32 x 32 = 64 bit multiplication.

Last substantial update: 2005-09-24.  Last minor update: 2013-02-26.  Minor update 2022-01-19.

When I complete the updates to this page, I plan to have C, C++ and dsPIC assembly code for the following arrangements:
Language    16x16=32 or     Andrew          Alternative
            32x32=64 bit    Simper's        multiplier such
            multiplies?     optimisation    as 48271?


C           32              Yes
            64              Yes             Yes

C++         32              Yes
            64              Yes             Yes

dsPIC       32              No              No


This page concerns the pseudo random number generator work of David Gerald Carta, who passed away in February 2013. 

I had a brief and enjoyable correspondence with David in 2005.  His son David Jr. wrote to me in 2013 that his father (a rocket scientist by profession)
loved math and problem solving and that he was delighted with this page and that his algorithm is so widely utilized.  A memorial page for David G. Carta is:

https://sites.google.com/site/dcarta/home/david-g-carta-memorial




Some early developments.  A Contents listing follows and here is an update history is at the end of this page.
2005-09-23: I am corresponding with David G. Carta

2005-09-26: Please keep an eye on this page as I intend to update it with consideration of the 1969 paper "Coding the Lehmer Pseudo-random Number Generator" (PR&B69), (cited by David Carta 1990 and others) which uses the same shift and add approach as I discuss here. The history of this approach is more complex than depicted below.

2005-09-27: I am hoping to include Fortran and assembler code, with various other items of research, from Elliot the USA who has been working on PRNGs for 20 years and is working with this algorithm right now.  He is also investigating the multiplication constant 48271. 

2005-09-28: See the Csound mailing list for this message from Istvan Varga in which a 32 x 32 bit multiply generates a 64 bit result (uint64_t is a 64 bit integer) is used to implement the Carta algorithm without the need for two 16 x 16 bit multiplies as I use here.  My aim was to do this all with 32 bit limits, but Pentiums and the like are perfectly happy to do 32 x 32 bit multiplies, and C compilers are able to handle this too.  Istvan Varga's message mentioned "rand31" - but this is not the same code as discussed below.  It is "csoundRand31" and is probably a faster approach than my "rand31" for any C compiler and CPU combination which can do 32 x 32 bit multiplies directly.  Istvan's message includes code for "csoundRand31" and for the Mersenne Twister and for comparing their speeds.  He finds "csoundRand31" to be marginally faster than the Mersenne Twister.  The Mersenne Twister (as far as I know) has a much better reputation for producing numbers than any linear congruential PRNG.  Please remember I am not trying to promote linear congruential PRNGs in general, or the 16807, 2^31 - 1 PRNG as being any better than they are.  My aim is suggest they may be adequate for some DSP applications, especially concerning audio, to show they can be implemented efficiently on 16 x 16 multiply CPUs without division, to examine the history of this approach, and to document how the Payne / Carta approach works in a simpler manner than has previously been published.

2005-09-29Andrew Simper, on the Music DSP mailing list, suggested a nifty optimisation to the final mod(0x7FFFFFFF) operation, which usually is a test and conditional branch, in both assembly code and high level languages.
Change 
       if (lo > 0x7FFFFFFF) lo -= 0x7FFFFFFF;
to:
       lo = (lo & 0x7FFFFFFF) + (lo >> 31);
I figured an add of the result of two logic/math operations which do not depend on each other would be faster than a test and a conditional branch.  This may be true on some CPUs, but my attempt to do this with the dsPIC code resulted in more cycles.  The dsPIC is a microcontroller - not an Intel or AMD CPU which has a go at doing 4 instructions in a single clock cycle.  The main reason for the efficiency of my original dsPIC code is that the conditional branch is taken only 25% of the time. 

Using it in C code for Pentium 3 and 4 with optimisations -O2 and a few others produced somewhat faster results.  See my message to the Music-DSP list for details.   I have a 1.7 GHz P4 Celeron producing 32.05 million 31 bit pseudo-random numbers per second, including function call overhead and while loop comparing the result with 1.  It produces the entire 2 billion long sequence in 67 seconds.
 
2005-10-03:  In email conversations with David Carta, reflecting on the 1969 Payne, Rabung and Bogyo paper , David recognises this paper, rather than his own in 1990, as the first publication of the simpler, divisionless, approach to implementing this PRNG.

Regarding the second part of his paper, which describes very easy to implement PRNGs with limited sequence length, he also points out that there are applications for pseudo-random numbers where efficiency of calculation is important, and where "quality" of the pseudo-random numbers need not be as rigorous as many pseudo-random number folk assume.  For instance, in a router which requires pseudo-random times for retry attempts, it doesn't matter if the sequence is only a few million or tens of millions long and if it the quality of the numbers would not pass muster in a simulation, cryptographic or audio DSP setting.
Please let me know suggestions for improvement! 
Mathematical methods in large-scale computing units.
In Proceedings 2nd Symposium on Large-Scale Digital Calculating Machinery. Cambridge MA 141-146. 1949
(2022-01-19 update thanks to Sebastián.):   http://www.bitsavers.org/pdf/harvard/Proceedings_of_a_Second_Symposium_on_Large-Scale_Digital_Calculating_Machinery_Sep49.pdf

D. H. LEHMER, Mathematical methods in large-scale computing units.
Annals of the Computation Laboratory of Harvard University, No. 26,
Proceedings of a Second Symposium on Large-Scale Digital Calculating Machinery, p. 141 (1951)

2005-12-03:  Michael Pohoreski wrote to me explaining that the mathematical identity I was looking for here is a type of generating function, for the {power} series 1/(1-x).  http://mathworld.wolfram.com/GeneratingFunction.html .


2008-01-26: The algorithm is to become part of an Internet standard for Forward Error Correction.  An Internet Draft, which will probably become an RFC - "Low Density Parity Check (LDPC) Staircase and Triangle Forward Error Correction (FEC) Schemes" tools.ietf.org/html/draft-ietf-rmt-bb-fec-ldpc-08 - uses this algorithm.  (Please check this link for later versions of this Draft or for its emergence as a standards track RFC.)   Forward Error Correction involves extra data transmitted so the original data can be recovered even when some packets are lost.

These FEC schemes are  intended to be used in an Internet reliable multicast system, but it can also be used in non-Internet multicast systems such as for DVB digital audio/video broadcasting.  Some background on the need for long block-length Forward Error Correction schemes  is here:  

Large Scale Content Distribution Protocols
Christoph Neumann, Vincent Roca and Rod Walsh
ACM SIGCOMM Computer Communication Review 85 Volume 35, Number 5, October 2005
planete.inrialpes.fr/people/roca/doc/ccr05_neumann.pdf  

Section 5.7 of the Internet Draft defines the Park-Miller Minimal Standard PRNG as the one which must be used for these FEC schemes and cites this page as a reference for the history of the algorithm and the file here rand31-park-miller-carta.cc.txt as an optimised implementation.

The Park-Miller "Minimal Standard" PRNG is attractive for DSP applications involving sound, vision and other applications such as spread-spectrum radio. I present C, C++ and dsPIC source code for this generator, using David G. Carta's optimisation which enables the generator to be implemented with 32 bit math and without any division instructions.  While it is not suitable for cryptography or simulation, the "minimal standard" generator is probably fine for many DSP applications. 

The dsPIC is a series of DSP oriented microcontroller chips from http://www.microchip.com which run at 20 to 30 MIPS (80 to 120 MHz clock), cost $5 to $15 and comes in packages from 18 to 64 pins.  The subroutine I provide can generate a new result in 18 instruction cycles on average, including call and return overhead.  The C code produces about 13 million results a second on an 800MHz Pentium III - about 62 clock cycles per result, including call and test loop overhead.

Pseudo-random number generation is something of a Black Art and it is generally best to use a well-known PRNG, after studying its pedigree.

This page traces the history of this PRNG and provides public domain source code of fast implementations of it.  Please let me know your comments and suggestions for improvements - and any source code for other CPUs you wish to make public.
Robin Whittle rw@firstpr.com.au    2005 September 26

To the page on Pink Noise (AKA 1/f noise): ../pink-noise/  
To my main First Principles page, with the world's longest Sliiiiiinky and many other distractions ../../ .

Contents

If you don't care about the minutiae of the implementation, but just want to use the code, please read or at least review the sections marked with an asterisk. I think everyone who chooses to use a pseudo-random number generator should know its background so they can assure themselves it is good enough for their intended application.
>>>>  * Introduction

>>>>  * Primary References

>>>>  * History of the (seed * 16807 mod(2^31 - 1)) Linear Congruential Generator (LCG)

>>>>     History of the Implementation Debate

>>>>     A Simpler Explanation of David G. Carta's Optimisation

>>>>     Implementations using Schrage's and Carta's approaches

>>>>  * Quality of the Pseudo-Random Numbers

>>>>  * Source Code

>>>>     Improving the Quality of Pseudo-Random Numbers

>>>>     Extending the Carta Algorithm Towards 64 bits

>>>>     Musings on the Philosophy of Science

>>>>     Links

>>>>     Update History



C code without comments:

Park-Miller-Carta pseudo-random number generator - C and C++
Less than 33 CPU clock cycles (including function call overhead and a loop with an output comparison test) on a Pentium III, with some compiler optimisations.  (Less then 28 clock cycles with Andrew Simper's optimisation..)

dsPIC code without comments:
Park-Miller-Carta pseudo-random number generator for dsPIC
18 CPU cycles including call and return.  A dsPIC microcontroller can run at 20, 30 and now, with a new series, 40 MIPS.

* Introduction

Please read the history of this algorithm and the analysis of its qualities before using the code.  These sections have headings starting with *.

Some general information on PRNGs is at http://en.wikipedia.org/wiki/Pseudo-random_number_generator .  Most of the really rigorous work on PRNGs is for people using them for simulation and cryptography.  For instance see the team at the University of Salzberg: http://random.mat.sbg.ac.at/ .  Simulation researchers used linear congruential generators extensively in the 1960s to the 1990s, but they generally have much better methods now.  The crypto people have their own algorithms.  All these fancy modern methods are probably overkill if you just want some noise and randomness in real-time on limited hardware, such as for dither, for audio synthesis or for controlling stochastic composition processes. 

Unfortunately, linear congruential generators are a minefield of lousy algorithms implementations.  Park and Miller tried to sort this out in 1988 and the generator presented on this web page is what they recommended then as the one to use unless you really knew what you were doing and had something better.  Since the C, C++ and dsPIC code here is public domain and fast, you may like to adopt the same attitude to this code as Park and Miller tried to instill in programmers in1988: unless you really know what you are doing, choosing this algorithm is likely to be a wiser move than picking up something from a textbook or rolling your own.

* Primary References

The story of this PRNG is told largely in these four papers in the Communications of the Association for Computing Machinery http://www.acm.org .  Such papers are generally not publicly available, but hardcopy and electronic access is available through most university libraries.  I provide copies of them according to my interpretation of the copyright notice.

Payne, Rabung & Bogyo 1969 (PR&B69)
WH Payne, JR Rabung, TP Bogyo (Added 2005-10-03)
Coding the Lehmer pseudo-random number generator
Communications of the ACM, February 1969, Vol 12 Number 2  85-86 p85-payne.pdf
After I made this page, it emerged that this paper, rather than David Carta's, is the first publication of the simple algorithm of shifting and adding bits, rather than performing division, to achieve the mod(0x7FFFFFFF) result.

Pierre L'Ecuyer 1988 (L88)
Pierre L'Ecuyer
Efficient and Portable Combined Random Number Generators
Communications of the ACM, June 1988, Vol 31 Number 6 742-774  p742-l_ecuyer.pdf
His site is http://www.iro.umontreal.ca/~lecuyer/ with many papers, but not this one. 

Park and Miller 1998 (PM88)
Stephen K. Park and Keith W. Miller
Random Number Generators: Good Ones are Hard to Find
Communications of the ACM, Oct 1988, Vol 31 Number 10 1192-1201 p1192-park.pdf

This paper is widely referred to and is recommended reading for anyone who thinks its OK to use a library built-in "random" function, or to grab some algorithm which has not been properly scrutinised.

Carta 1990 (C90)

David G. Carta
Two Fast Implementations of the "Minimal Standard" Random Number Generator
Communications of the ACM, Jan 1990, Vol 33 Number 1 87-88  p87-carta.pdf

Remarks 1993 (R93)
George Marsaglia; Stephen J. Sullivan; Stephen K. Park, Keith W. Miller, Paul K. Stockmeyer
Remarks on Choosing and Implementing Random Number Generators
Communications of the ACM, Jul 1993, Vol 36 Number 7 105-110 p105-crawford.pdf

* History of the (seed * 16807 mod(2^31 - 1)) Linear Congruential Generator (LCG)

According to PM88, the first proposal to use what is now known as a linear multiplicative congruential generator for random numbers was made by D. H. Lehmer in 1951. (See below for a 1949 Lehmer reference.) Some generators involve additions and other contrivances, but the type we are interested in has a simple form.
To generate the new pseudo-random value, multiply the previous value by a constant, and then take the modulus of it by another constant.
("Modulus" means the remainder after division.  11 mod 4 = 3.  This is also written as 11 % 4  or as a C floating point function fmod(11, 4) ).
This is done with integer maths, and typical values of the modulus constant are in the 231 to 232 range, though earlier, very limited range generators used constants around 216.

Basically, it multiplies the current number by some large number and clips the result back to the modulus range.  The result is, of course, entirely predictable, but for some or many purposes, it is sufficiently unrelated to the previous, recent and subsequent results to be valuable in a setting which requires something approaching "randomness".

The most convenient modulus constants for digital computers are integer powers of 2.  If calculations are done with precision such as 32 bits, and it is required to do mod(231) then it is a simple matter to calculate the result by clearing the left-most bit .  Likewise, mod(232) is highly convenient for machines with 32 bit registers - the calculations simply ignore overflow, since the overflow is getting rid of the integer number of 232s in the result, leaving the remainder, which is the modulus.

The generator we are studying here uses a modulus constant of 231 - 1.  Here it is in various forms:
Binary:     0111 1111 1111 1111 1111 1111 1111 1111
Hex:       
0x7FFFFFFF
Decimal:    2,147,483,648
231- 1 is a Mersenne Prime number ( http://mathworld.wolfram.com/MersennePrime.html ) - a prime number which is one less than an integer power of 2.  The prodigious Leonhard Euler [oi'lər] discovered this, the 8th Mersenne Prime, in 1750.  There's a whole bunch of mathematical stuff on this, which I haven't looked into, but it is clear that this number has quite a history.

The multiplier in the LCG can be any positive integer, typically less than the modulus constant.  Most numbers do not lead to good PRNGs because the series of numbers they generate gets stuck in a short loop.  Sometimes there are multiple loops to get stuck in, depending on the seed with which the algorithm is started with. 

A carefully chosen multiplier number ('a' in most of the literature), with the right modulus constant ('m'), can generate a sequence of numbers which includes every number from 1 to one less than the modulus constant itself.  Such a multiplier is called a "full-period-multiplier".   The sequence is this length - the modulus constant minus 1 - and each number appears exactly once.  For this remarkable feat to be performed, the multiplication constant,  when taken to the 1st power, the 2nd power, the 3rd power etc. (itself, squared, cubed etc.) and so on to the power of the modulus constant, must be values which when subject to the modulus operation, produce different remainders. 

The lowest such number for an LCG with a modulus constant of (231 - 1) is 16807.  According to PM88 there are over 534 million other numbers which are also full period multipliers for a modulus constant of (231 - 1).  Lewis, Goodman and Miller (not Keith Miller) first suggested 16807 in 1969.

The LCG of:
previous value * 16807 mod (231 - 1)
has been widely used and studied since 1969. 

In 1988 Stephen K. Park and Keith W. Miller wrote their paper, pointing out some of the failings in widely used PRNGs, and making a plea to adopt the above generator as a "minimal standard" - to use unless there was something which was known to be better.

They pointed out that the choice of 16807 was not the best one for producing numbers which passed various statistical tests of randomness - such as independence of one value from its neighbours and the absence of periodic fluctuations in the values of the stream.

They reported other people's research and suggested that in the future, they would probably recommend 47271 or 69621 as better alternatives, but that since the output of 16807 was better studied, it was still the best choice for a minimal standard.

A Google search http://www.google.com/search?as_q=random+16807 for random and 16807 turns up 24,900 pages!


If you want to implement an LCG with one of these higher numbers using floating point, then there are no problems.  However, the nature of the integer approach I use here, based on the work of David Carta, means that the multiplier needs to be 16 bits.  As far as I can tell, 16807 is the only well respected multiplier which is 15 bits.  Perhaps there are no other 15 bit full period multipliers.  It may be possible to create elegant, fast,  integer implementations of an LCG with m = (231 - 1) and with 16 bit or larger multipliers - but it would require careful thought.  I suggest that if you are fussy enough about pseudo-random numbers to be dissatisfied with 16807, that you are probably doing work, where quality is important enough that it would be worth using a much more respected, modern, approach such as the Mersenne Twister:  http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html which is probably slower and certainly uses more memory.

So lets turn now to questions of efficiency in implementing the Park-Miller "minimal-standard" PRNG.

If you want to skip the details of the implementation, and get on with reviewing the pedigree of this PRNG, please skip the next few sections and go to the Quality section now.

History of the Implementation Debate

A properly implemented program will behave identically to any other implementation.  This PRNG produces a circular list of 2,147,483,646 31 bit numbers and when started from 1 will generate the following values.  For instance, with an input of 1, the result is 16807.  With an input of 16807, the result is 282475249 and so on.  Eventually, at the end of the sequence, the output returns to 1 and the sequence starts again.
     Value          Number of results after seed of 1

     16807          1
 282475249          2
1622650073          3
 984943658          4
1144108930          5
 470211272          6
 101027544          7
1457850878          8
1458777923          9
2007237709         10

 925166085       9998
1484786315       9999
1043618065      10000
1589873406      10001
2010798668      10002

1227283347    1000000
1808217256    2000000
1140279430    3000000
 851767375    4000000
1885818104    5000000

 168075678   99000000
1209575029  100000000
 941596188  101000000

1207672015 2147483643
1475608308 2147483644
1407677000 2147483645
         1 2147483646  <<< Starting the sequence again with the original seed.
     16807 2147483647

The debate is about how to write a program which performs this function, to achieve a variety of goals.
  1. To make it as portable as possible - different compilers and CPU hardware.

  2. To make it execute as fast as possible.
There are two major problems in the brute-force implementation:
  1. The (seed * 16807) operation results in a 46 bit product, but most CPUs (until recently) had registers and integer ALUs which were limited to 32 bits.

  2. The most obvious way of doing the modulus operation is to divide this 46 bit result by the 31 bit 0x7FFFFFFF value - which is a daunting task beyond the hardware of most CPUs.  Furthermore, division is generally a multi-cycle operation on most CPUs, while multiplication and addition are typically done in single cycles.
PM88 give Fortran code for three implementations:
  1. Integer Version 1: Requires the compiler to handle integers as long as 46 bits, and to be able to do division on such numbers.

  2. Integer Version 2: Based on the 1979 work of L. Schrage.  This avoids the generation of such long integers and the consequent need to subtract large numbers of 0x7FFFFFFF from them.  It requires 32 bit wide math operations, including divide.  (Schrage's approach is described in section 7.1 of Numerical Recipes in C, which I give details of below.)

  3. Real Version 2: Similar to the above, but implemented with floating point, so that the compiler can generate whatever code it likes to handle the 32 bit requirements, without relying on CPU hardware, which may be unable to support the multiplications and divisions directly.
Of these three, the Schrage Integer Version 2 has been most widely used - as recommended by Park and Miller.

In 1990, David Carta (who was not working as an academic - he is described as a Pasadena based "independent PC consultant and control systems engineer with interests in physical, biological and business applications") wrote a paper describing how to implement this PRNG without division. 

He suggested generating the 46 bit multiplication product in two sections:
q - the lowest 31 bits: 30 to 0.
p - the high 15 bits: 45 to 31.
Rather than consider these as a 46 bit value, which must be multiplied by 0x7FFFFFFF, and have an integer number of 0x7FFFFFFFs subtracted from it, he suggested that the correct result can be obtained by simply adding p to q, and then doing a simple mod(0x7FFFFFF) on the 32 bit result.  See the diagram below, in the discussion of Ray Gardner's code.

Here is his explanation.  It is pretty daunting.  I have a much simpler explanation, so skip over David Carta's explanation unless you are really keen.










David Carta did not provide any computer code to implement his proposal.  The second part of his paper considers even simpler implementations, with reduced sequence lengths, and how these degraded PRNGs might be suitable for some applications. 


In 1993 three separate pieces appeared on the subject of the "minimal standard" PRNG (R93).
George Marsaglia criticised Park and Miller's 1988 paper in a number of ways, casting aspersions on many aspects of the paper, and also on David Carta's paper.  He then proposed an LCG with the same multiplier - 16807 - and a modulus constant of (232 -2) = 0xFFFFFFFE.  This produces 32 bit results, with a period twice as long as Park Miller's minimal standard.  He also proposed another PRGNs, using a modulus constant of (232 -1) and a variety of more complex PRNGs.

Stephen J. Sullivan pointed out what he considered to be a "serious flaw" in the Park Miller minimal standard PRNG.

Stephen K. Park, Keith W. Miller and Paul K. Stockmeyer responded to the above two pieces and commented on David Carta's 1990 paper.  I won't go into the critiques or the responses - as much as I understand them none of the critiques show that the Park Miller minimal standard PRNG with 16807 and 0x7FFFFFFF is anything less than Park and Miller claimed it to be: a better than average PRNG, which has been well studied and should be used unless there is a better alternative.

We now consider what Park, Miller and Stockmeyer had to say about David Carta's 1990 paper:
Comments on Carta
In retrospect, when Carta's original article was published, we should have commented on it.  We didn't, however, so now is the time to do so.  Although the article's title may suggest otherwise, the generator as implemented by Carta is not the minimal standard; it isn't even a full-period generator.  We know of no good reason to use Carta's Generator.  Moreover, for the reasons mentioned previously, we reject the associated register-level programming he advocates as antithetic to our random number generation philosophy,

Spare a thought for David Carta at this point.  He figures out a nifty, fast, apparently elegant way of simplifying the calculation of the Park Miller minimal standard PRNG and gets his paper published in the same journal as Park and Miller wrote in.  Two years later, Park and Miller reject his optimisation and his philosophy, without a word of detail, as a footnote to another discussion - not even bothering, it seems, to critique his work promptly in private or public.

But Park and Miller were wrong.  Carta's algorithm works like a charm.


My guess is that Park and Miller were bamboozled by Carta's complex explanation of why his algorithm works.

I have never seen any other explanation, and the one I came up with on 2005 September 9th is not like anything I have read.   My explanation is below.

Who was that Masked Man?  Seeking David G. Carta

Quick update: I am corresponding with David Carta!

Graduated as an Engineer at Caltech in 1962.  Details of a 1977 paper "Minimax approximation by rational fractions of the inverse polynomial type" here list him being at Jet Propulsion Laboratory Pasadena.  He wrote an article http://www.reasons.org/resources/apologetics/icrvictory.shtml probably in 1992 on legal wrangling concerning the Institute for Creation Research.  The byline says: "David has a bachelor's degree in Physics from Caltech and a Ph.D. in computer sciences from Washington University, St. Louis. He is currently a consultant for personal computer applications."  

A Simpler Explanation of David Carta's Optimisation

David Carta's paper says that instead of doing a mod(0x7FFFFFFF) on a 46 bit number, using division and then multiplication and subtraction, or by using Schrage's approach (which uses division but avoids ever generating a number greater than 32 bits) that we split the result into two sets of bits, "p" and "q":

          4444 4444 3333 3333 3322 2222 2222 1111 1111 11           
          7654 3210 9876 5432 1098 7654 3210 9876 5432 1098 7654 3210
   q 31                        qqq qqqq qqqq qqqq qqqq qqqq qqqq qqqq
   p 15     pp pppp pppp pppp p
and to get the final result, mod(0x7FFFFFFF) we simply add "p" and "q",
   q 31                        qqq qqqq qqqq qqqq qqqq qqqq qqqq qqqq
   p 15                     +                      ppp pppp pppp pppp
                            = Cxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx                          
and then do a mod(0x7FFFFFFF) on the result - an operation which at most removes one instance of 0x7FFFFFFF.

His explanation is above - and it is difficult to understand.  To get from his equation [2] to [3] involves an identity (can anyone tell me what it is called? Yes! Thanks Michaelhttp://mathworld.wolfram.com/GeneratingFunction.html ):
1/(x-1) = 1/x + 1/x2 + 1/x3 + 1/x4 + 1/x5 + 1/x6 + . . . . . 
Once I figured that out, and was able to follow the rest of it (the transition from (4) to (5) was a doozy too . . . ) but I had no physical feel for what it meant.  Here is the explanation I developed independently of Carta's explanation:

Lets define some labels:

Q = 31 bits 30 to 0.
P = 15 bits 14 to 0. 

If we were doing 46 bit math, the
multiplication product (seed * 16807)
would be:

 
     Q
  + (P * 0x80000000)
 

Observe that this is the same as:

     Q
  + (P * 0x7FFFFFFF)
  + (P * 0x00000001)
 
                                         
However, we don't need or want a 46 bit
result.  We only want that result

mod(0x7FFFFFFF).  Therefore we can
ignore the middle line and use for

our intermediate result:
  
      Q
    + P  

This potentially 32 bit value still
needs to be subjected to a
mod(0x7FFFFFFF) operation to produce
the final pseudo-random result. 

Because of the nature of the Q and P
bits, and the nature of the modulus
constant, the modulus operation can
can be done with a simple bit 31 test
and the only action which may need to
be taken is to subtract a single
instance of the modulus constant. 
Because it is 0x7FFFFFFF (and because
we know that the intermediate result
will not exactly equal 0x7FFFFFFF, it
suffices to test bit 31, and if it is
set, to zero it and increment the
remaining bits, which are guaranteed
not to overflow.  This is subtracting
0x80000000 and adding 1.

That's the end of my explanation, which I updated on 2005-09-26 to cover the simpler approach to the modulus operation which follows the dramatic bit shift and add.


Now that I see this simpler explanation, I considered a number of questions:
Was I the first person to discover this explanation? 
I can't find any published explanation other than David Carta's.
Did David Carta see it my "simple" way?
Yes: Then why didn't he explain it this way?

No: Then how did he figure out his complex solution to the problem?  Do people really lie in bed thinking in such complex terms about PRNGs?
I am unable to rule out the possibility.
Why didn't David Carta provide code examples?
He had programmed it but didn't include code in his paper:
He wrote his test code in assembler and didn't have a C, Pascal etc. version?  (David Carta told me he tested the code on a PC, so I guess it was in assembler.)
Why did Park and Miller reject Carta's approach?
They misunderstood the functionality of his proposal and coded or analysed that instead?
Possible.
They didn't care about the functionality - they just followed the complex math and found fault with it.
Seems pretty likely.
If Carta had explained it as simply as I now see it, would Park and Miller have accepted, adopted and promoted it?
Quite likely!
A further question is where else, apart from the papers already cited, might we expect to find discussion of optimisation techniques for the minimal standard PRNG.  Here is one such paper:

Random Number Generation
Pierre L’Ecuyer
http://random.mat.sbg.ac.at/ftp/pub/data/lecuyer_03_handstat.pdf

but there is no direct mention of Carta or any such optimisation.  This paper cites another:

Multiplicative, congruential random-number generators with multiplier ± 2k1 ± 2k2 and modulus 2p - 1
Pei-Chi Wu June 1997          
ACM Transactions on Mathematical Software (TOMS),  Volume 23 Issue 2

Maybe the algorithm on page 259 has something to do with Carta's approach - but there is no direct reference.
Can anyone point to explanations of Carta's algorithm other than Carta's original paper?  I am corresponding with David Carta. See below.

Implementations using Schrage's and Carta's Approaches

There's nothing in this section which relates to the quality of the pseudo-random numbers.  Skip forward below to pursue that.

Following PM88, there were a bunch of implementations following Schrage's approach.  This works fine, but it involves division.

My interest here is finding implementations of Carta's approach - and any explanations which may accompany them.  So far, I found several pieces of code and one report of another implementation.  In approximate date order, they are:

Floyd ~1990

In 2005 September at http://www.cstone.net/~bachs/ddj/3sa12.htm I found:
Park-Miller generator, 8086 vers. of Carta implementation [31-bit], in "An Existential Dictionary" (E.T. Floyd), Nov90, 20; Jan91, 8; Jun91, 8
in what is evidently a subject listing for Dr Dobbs Journal (of Computer Calisthenics and Orthodontia) http://www.ddj.com . The articles are not available online as far as I know.  I haven't yet tried to this issue from a university library. 

William B. Jones ~1992

Searching for "random" and "16807" I found several sites with the following code.  A good string to search for to find other copies is: ""RANDOM.ASM--implementation of minimal standard random number".  For instance:
http://www.cecs.csulb.edu/~brewer/325disk/Utility/RANDOM.ASM
Here is the listing:

;; RANDOM.ASM--implementation of minimal standard random number
;; generator. See article by David G. Carta, CACM Jan 1990, p. 87
;;
;; Calling Sequence:
;;
;; EXTRN Random: NEAR
;; call Random
;; 0 < value < 2**31 - 1 returned in dx:ax
;;
;; Library source text from "Assembly Language for the IBM PC Family" by
;; William B. Jones, (c) Copyright 1992, Scott/Jones Inc.
;;
.MODEL SMALL
.DATA
Seed DD 1 ; Chosen to test routine
A EQU 16807
S0 EQU WORD PTR Seed ; Low order word of seed
S1 EQU WORD PTR Seed+2 ; High order word of seed

.CODE
PUBLIC Random
Random PROC
push ds ; No params or local data, so bp not needed
mov ax, @data
mov ds, ax
;
; P2:P1:P0 := (S1:S0) * A
;
mov ax, S0
mov bx, A
mul bx
mov si, dx ; si := pp01 (pp = partial product)
mov di, ax ; di := pp00 = P0
mov ax, S1
mul bx ; ax := pp11
add ax, si ; ax := pp11 + pp01 = P1
adc dx, 0 ; dx := pp12 + carry = P2
;
; P2:P1:P0 = p * 2**31 + q, 0 <= q < 2**31
;
; p = P2 SHL 1 + sign bit of P1 --> dx
; (P2:P1:P0 < 2**46 so p fits in a single word)
; q = (P1 AND 7FFFh):P0 = (ax AND 7fffh) : di
;
shl ax, 1
rcl dx, 1
shr ax, 1
;
; dx:ax := p + q
;
add dx, di ; dx := p0 + q0
adc ax, 0 ; ax := q1 + carry
xchg ax, dx
;
; if p+q < 2**31 then p+q is the new seed; otherwise whack
; off the sign bit and add 1 and THATs the new seed
;
test dx, 8000h
jz Store
and dx, 7fffh
inc ax
Store:
mov S1, dx
mov S0, ax
pop ds
ret
Random ENDP
END

Again, no indication that the author understood why the algorithm works. (I don't have time to figure out exactly what this code is doing - it wouldn't be hard if it was well commented.)

NetBSD ~1994

I found a discussion at: http://mail-index.netbsd.org/current-users/2000/10/14/0013.html referring to a "random.S" file presumably in the library for the BSD kernel.  Further mentions of this code can be found by searching for a substring:
http://www.google.com/search?q=+%22the+high+31+bits+starting+at+bit+32%22
The earliest date mentioned on any of these is October 1994:
http://safariexamples.informit.com/0201799405/netbsdsrc/sys/arch/vax/vax/random.s
The implementation is of Carta's algorithm, but still there is no explanation.  The copyright is 1994 "Ludd" of University of Lule, Sweden.  Here is some pseudocode in the comments:

Here is a very good random number generator.  This implementation is
based on ``Two Fast Implementations of the "Minimal Standard" Random
Number Generator'', David G. Carta, Communications of the ACM, Jan 1990,
Vol 33 No 1. Do NOT modify this code unless you have a very thorough
understanding of the algorithm. It's trickier than you think. If
you do change it, make sure that its 10,000'th invocation returns
1043618065.

p = (16807*seed)<30:0> # e.g., the low 31 bits of the product
q = (16807*seed)<62:31> # e.g., the high 31 bits starting at bit 32
if (p + q < 2^31)
seed = p + q
else
seed = ((p + q) & (2^31 - 1)) + 1
return (seed);

The result is in (0,2^31), e.g., it's always positive.

There is no explanation and the assembly code itself is rather long (longer than my dsPIC code, anyway) and completely devoid of comments, though at least it has this pseudo-code explanation and a warning about messing with the details.

Waldspurger and Weihl ~1994
This paper:
Lottery Scheduling: Flexible Proportional-Share Resource Management
Carl A. Waldspurger William E. Weihl
MIT Laboratory for Computer Science
http://www.usenix.org/publications/library/proceedings/osdi/full_papers/waldspurger.pdf
includes some MIPS CPU assembly code:



Note the Ca90 reference to Carta's paper.

This CPU is evidently capable of generating (with multu) a 64 bit product of two 32 bit multiplicands.  The write that the code executes in approximately 10 RISC instructions.  This is a very widely cited paper, and it may have increased the adoption of the Carta algorithm, despite the code above probably looking like gobbledygook to anyone but a MIPS assembly programmer.  The code is admirably commented, so it is easy for someone to copy at the pseudo-code level for another CPU or high level language.

As far as I can tell, they split the product into two separate sets of 31 bits, in 32 bit registers, and then simply add them together and perform mod(0x7FFFFFFF) on the result.  The "overflow" code does the "subtract 0x7FFFFFF" operation by zeroing bit 31 and then incrementing.  The bit zeroing is done by shifting left, so that bit winds up in the carry flag, and then shifting right with a zero (I guess) taking its place.  Evidently this was faster than actually subtracting 0x7FFFFFF.

Ray Gardner ~1995 - and an exploration of Carta's approach

Ray Gardner wrote a public-domain snippet of C code which has been widely copied and used.  Google finds older versions of it, which are the same in terms of the code, but with a "Date last modified: 02-Nov-1995"  So I figure this code was written sometime between 1990 and 1995. 

The code can be found at http://c.snippets.org/snip_lister.php?fname=rg_rand.c and in many other places, by searching for: longrand and "Ray Gardner".  Here is the main part of the code - the earliest dated version I could find, at: http://juggernaut.eti.pg.gda.pl/KATEDRY/kecs/lab-cpp/snippets/RG_RAND.C .
/* +++Date last modified: 02-Nov-1995 */

/*
** longrand() -- generate 2**31-2 random numbers
**
** public domain by Ray Gardner
**
** based on "Random Number Generators: Good Ones Are Hard to Find",
** S.K. Park and K.W. Miller, Communications of the ACM 31:10 (Oct 1988),
** and "Two Fast Implementations of the 'Minimal Standard' Random
** Number Generator", David G. Carta, Comm. ACM 33, 1 (Jan 1990), p. 87-88
**
** linear congruential generator f(z) = 16807 z mod (2 ** 31 - 1)
**
** uses L. Schrage's method to avoid overflow problems
*/

#define a 16807 /* multiplier */
#define m 2147483647L /* 2**31 - 1 */
#define q 127773L /* m div a */
#define r 2836 /* m mod a */

long nextlongrand(long seed)
{
unsigned long lo, hi;

lo = a * (long)(seed & 0xFFFF);
hi = a * (long)((unsigned long)seed >> 16);
lo += (hi & 0x7FFF) << 16;
if (lo > m)
{
lo &= m;
++lo;
}
lo += hi >> 15;
if (lo > m)
{
lo &= m;
++lo;
}
return (long)lo;
}

static long randomnum = 1;

long longrand(void) /* return next random long */
{
randomnum = nextlongrand(randomnum);
return randomnum;
}

. . .

There are several things of interest.  Firstly, the code was originally written to implement Schrage's method - as the comments and the constants q and r attest.  However these constants are not used, and the code has nothing to do with Schrage's method.

It references Carta's paper, and it does implement Carta's approach - but it took me a while to figure this out.

I used this code in 1995 and it has become part of Csound.  In 2005 when I wanted to implement this for the dsPIC, I realised I did not understand how it worked.  This lead to the research which this page is based on.

There are two mod(0x7FFFFFFF) operations in Ray Gardner's code:
if (lo > m)
{
lo &= m;
++lo;
}
These subtract 0x7FFFFFF by zeroing bit 31 of the word (logical AND with m, which is 0x7FFFFFF) and then incrementing.  It is perfectly allowable in ANSI C 32 bit integer math to subtract 0x7FFFFFFF, so why did Ray Gardner do it this way?  Perhaps he copied the code from Waldspurger and Weihl, by looking at the comments:
overflow:
sll $2, $2, 1 | zero bit 31 of S'
srl $2, $2, 1
addiu $2, 1 | increment S'
Ray Gardner's code follows Carta's suggestions, except that Gardner's approach to creating two 15 bit values p and q involves a bit of work - and Gardner does a mod(0x7FFFFFFF) operation on the q partial result before adding p to it.

Carta suggests that the 31 bit seed and the 15 bit constant:
          4444 4444 3333 3333 3322 2222 2222 1111 1111 11           
          7654 3210 9876 5432 1098 7654 3210 9876 5432 1098 7654 3210

seed                           sss ssss ssss ssss ssss ssss ssss ssss
Constant 16807 = 0x41A7                            100 0001 1010 0111

be multiplied to create a 46 bit result, split into two sections of 15 and 31 bits:
          4444 4444 3333 3333 3322 2222 2222 1111 1111 11           
          7654 3210 9876 5432 1098 7654 3210 9876 5432 1098 7654 3210
   q 31                        qqq qqqq qqqq qqqq qqqq qqqq qqqq qqqq
   p 15     pp pppp pppp pppp p
Rather than regard it as one 46 bit result to do mod(0x7FFFFFFF) on, he says we can simply add p to q:
 
          4444 4444 3333 3333 3322 2222 2222 1111 1111 11           
          7654 3210 9876 5432 1098 7654 3210 9876 5432 1098 7654 3210
   q 31                        qqq qqqq qqqq qqqq qqqq qqqq qqqq qqqq
   p 15                    +                       ppp pppp pppp pppp
                           =  Cxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
and then do a mod(0x7FFFFFFF) to achieve the final result. 

Since the sum of p and q can't be more than twice 0x7FFFFFFF the modulus operation involves either doing nothing or subtracting a single instance of 0x7FFFFFFF, whereas a generalised modulus operation needs to be able to subtract any number of instances.  Its easy to test to see if the result is more than 0x7FFFFFFF - bit 31 will be set.  However, this does not test for the result equaling 0x7FFFFFFF - which the modulus operation should reduce to 0.

Theoretically, Ray Gardner's code should check for "greater than or equal", not for "greater than".   In my dsPIC implementation, I test for" greater than", by testing bit 31. Neither of these approaches (Ray Gardner's C code and my dsPIC assembly code) does the modulus operation correctly, because they fail to handle the situation where the input value is equal to 0x7FFFFFFF.  In that instance the output value should be 0, but these pieces of code would leave the result as 0x7FFFFFFF. 

I get away with it in the dsPIC code because this PRNG never has the internal result 0x7FFFFFFF - which is another way it never has an output value of zero.  My C and C++ code really should use ">=" instead of ">" - I should update this, but it makes no difference to the results.

However, theoretically at least, Ray Gardner's code is not so secure, since it also does this bodgy (suspcious, probably wrong) mod(0x7FFFFFFF) on a partial result.  The fact that his code works shows either that this partial result never occurs, or that when it does, it does not matter that it is not reduced to 0.  If it does occur, it will be wrapped around with the second modulus operation, since the second value which is added will take above 0x7FFFFFFF without taking it to 2 * 0x7FFFFFFF.

Ray Gardner's code assembles p and q as Carta suggests, but does so in several steps. First, two 32 bit words are created as follows.
lo (31 bits) =   16807 (15 bits) * seed low (16 bits)
lo = a * (long)(seed & 0xFFFF);
hi (30 bits) = 16807 (15 bits) * seed hi  (15 bits)
hi = a * (long)((unsigned long)seed >> 16);

Here is a map of the bits:
         4444 4444 3333 3333 3322 2222 2222 1111 1111 11           
         7654 3210 9876 5432 1098 7654 3210 9876 5432 1098 7654 3210

seed low bits 16                                 ssss ssss ssss ssss

seed hi bits  15              sss ssss ssss ssss
       
Constant 16807 = 0x41A7 15                        100 0001 1010 0111

lo 31                        0xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx           
hi 30    00aa aaaa aaaa aaaa abbb bbbb bbbb bbbb

Then he adds the low 15 bits of "hi" into bits 30 to 16 of "lo".  Above the bits which are added are marked "b".

This creates a partial result which may carry into bit 31.  He does a so-called "mod(0x7FFFFFFF) operation on the result.

So we have a partial result in lo, having potentially removed one amount of 0x7FFFFFFF.  This is purely to keep the value within limits so that we don't get more than 2 *  0x7FFFFFFF when we add in the "a" bits in the next step.  But thinking about this a little more, such a condition could not happen.  In principle, instead of doing this:
          4444 4444 3333 3333 3322 2222 2222 1111 1111 11           
          7654 3210 9876 5432 1098 7654 3210 9876 5432 1098 7654 3210

lo 31                         0xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx           
                         +     bbb bbbb bbbb bbbb
                         =    Cxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
            
                         mod  0111 1111 1111 1111 1111 1111 1111 1111
                         =     xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

                         +                        
aaa aaaa aaaa aaaa

we could swap the bits in hi around and add it all at once:
          4444 4444 3333 3333 3322 2222 2222 1111 1111 11           
          7654 3210 9876 5432 1098 7654 3210 9876 5432 1098 7654 3210

lo 31                         0xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx           
                         +     bbb bbbb bbbb bbbb 0aaa aaaa aaaa aaaa

Alternatively we could add the b and the a bits separately, and then do mod(0x7FFFFFFF).

So there is no need for an intermediate "mod(0x7FFFFFF)" operation!  My C and C++ implementations had this initially, since I was working from Ray Gardner's code, but as I wrote this doco, I recognised that the first operation wasn't needed and when I commented the line out, the code still worked perfectly well!

The result is still not going to be more than 2 * 0x7FFFFFFFF, so there will be no carry outside the 32 bit register, and our simple "mod(0x7FFFFFFF)" operation can be used to constrain the final result.  Unfortunately, as far as I know, there is no direct way of telling the C or C++ compiler to transform a 32 bit word like this:
    00aa aaaa aaaa aaaa abbb bbbb bbbb bbbb
to 
0bbb bbbb bbbb bbbb 0aaa aaaa aaaa aaaa
It would be easy to create the word already shifted to the left, but multiplying by a constant which is 2 * 16807. (The MIPS code above does this.)

My C and C++ code adds the two sets of bits separately.

NS-2 1997

A C++ implementation of Park-Miller-Carta copyrighted 1997 is http://rp.lip6.fr/ns-doc/ns226-doc/html/rng_8cc-source.htm or  here and in other places and can be found by searching for:
"RNGImplementation" Carta
The report http://www.tkn.tu-berlin.de/publications/papers/ns_ak_final.pdf indicates it is part of an NS-2 and later Akaroa-2 program for stochastic simulation of network protocols.  The active code itself is simple:
long RNGImplementation::next()
{
long L, H;
L = A * (seed_ & 0xffff);
H = A * (seed_ >> 16);

seed_ = ((H & 0x7fff) << 16) + L;
seed_ -= 0x7fffffff;
seed_ += H >> 15;

if (seed_ <= 0) {
seed_ += 0x7fffffff;
}
return (seed_);
}
but a little odd.  0x7FFFFFFF is always subtracted after the first addition and then the final modulus operation is accomplished with an optional addition of 0x7FFFFFFF rather than the usual subtraction. 

The comments include an explanation which seems to be a transcription of Carta's explanation.  This was apparently derived from some Sparc code.
/* The sparc assembly code [no longer here] is based on Carta, D.G., "Two Fast
* Implementations of the 'Minimal Standard' Random Number
* Generator," CACM 33:1, Jan. 90, pp. 87-88.
*
* ASSUME that "the product of two [signed 32-bit] integers (a, sn)
* will occupy two [32-bit] registers (p, q)."
* Thus: a*s = (2^31)p + q
*
* From the observation that: x = y mod z is but
* x = z * the fraction part of (y/z),
* Let: sn = m * Frac(as/m)
*
* For m = 2^31 - 1,
* sn = (2^31 - 1) * Frac[as/(2^31 -1)]
* = (2^31 - 1) * Frac[as(2^-31 + 2^-2(31) + 2^-3(31) + . . .)]
* = (2^31 - 1) * Frac{[(2^31)p + q] [2^-31 + 2^-2(31) + 2^-3(31) + . . .]}
* = (2^31 - 1) * Frac[p+(p+q)2^-31+(p+q)2^-2(31)+(p+q)3^(-31)+ . . .]
*
* if p+q < 2^31:
* sn = (2^31 - 1) * Frac[p + a fraction + a fraction + a fraction + . . .]
* = (2^31 - 1) * [(p+q)2^-31 + (p+q)2^-2(31) + (p+q)3^(-31) + . . .]
* = p + q
*
* otherwise:
* sn = (2^31 - 1) * Frac[p + 1.frac . . .]
* = (2^31 - 1) * (-1 + 1.frac . . .)
* = (2^31 - 1) * [-1 + (p+q)2^-31 + (p+q)2^-2(31) + (p+q)3^(-31) + . . .]
* = p + q - 2^31 + 1
*/
This typifies what I perceive as a hand-me-down pattern.  Someone writes some code based on David Carta's algorithm, but probably got it from another piece of code which was written in the same manner . . .   Since Carta's paper is not publicly available (it requires access to hardcopy in a library or a library gateway or association membership to get it on the Net) it is hard for many people to read the original paper.

I tried Googling "From the observation that: x = y mod z is but" for the original Sparc code, and found some pages with more C code, such as:
http://wine.icu.ac.kr/lxr/http/source/tools/rng.cc?a=sparc
This is archeology - trying to digging deeper to look back into the past.  The above file includes:
* File: RngStream.cc for multiple streams of Random Numbers
* Language: C++
* Copyright: Pierre L'Ecuyer, University of Montreal
* Notice: This code can be used freely for personnal, academic,
* or non-commercial purposes. For commercial purposes,
* please contact P. L'Ecuyer at: lecuyer@iro.umontreal.ca
* Date: 14 August 2001
*
* Incorporated into rng.cc and modified to maintain backward
* compatibility with ns-2.1b8.  Users can use their current scripts
* unmodified with the new RNG.  To get the same results as with the
* previous RNG, define OLD_RNG when compiling (e.g., make -D OLD_RNG).
* - Michele Weigle, University of North Carolina (mcweigle@cs.unc.edu)
* October 10, 2001

but this is in fact later - 2001 - another instance of someone copying code from someone else, without necessarily understanding or explaining how it works any better than whoever they copied from. 

Now lets Google RngStream.cc .  A paper leads me to Pierre L'Ecuyer's site:
http://www.iro.umontreal.ca/~lecuyer/myftp/streams00/
where can be found (in each of the C++, C and Java directories) a paper documenting portable code for long streams of pseudorandom numbers, but I think this is a red-herring - there's no mention of 16807 or Carta in the code.   The paper:
An Object-Oriented Random-Number Package with Many Long Streams and Substreams
Pierre L'Ecuyer, Richard Simard, E. Jack Chen and W. David Kelton  December 2000
http://www.iro.umontreal.ca/~lecuyer/myftp/streams00/c++/streams4.pdf
does mention the Park Miller minimal standard PRNG and shows an instance of its use giving bad results in a simulation program.  The code itself uses a "combined multiple recursive generator (CMRG) MRG32k3a".

I don't have time to chase all these implementations.  I just observe that in many cases, there is a hand-me-down pattern and the result is that the resulting software is typically not optimal, well explained or properly referenced. 

David Mosberger-Tang 2002

Googling "carta_random.S" turns up many copies of a subroutine written in IA64 (Intel Itanium) assembly language.  For instance:
http://natasha.stmarytx.edu/pub/linux/kernel/v2.4/linux/arch/ia64/lib/carta_random.S
Some discussion of the adoption of Carta's algorithm is here: http://lkml.org/lkml/2003/10/16/131.

The code is devoid of comments and I don't know anything about the Itanium, but I am surprised that there needs to be 18 instructions for with this humongous 64 bit CPU, when I use about the same number of instructions with a 16 bit dsPIC.

* Quality of the Pseudo-Random Numbers

If we consider the relationship between one result and the next, there can be a very clear relationship between the two results:  if the first result is below 127,773, then the second result will be exactly 16,807 times the first result.  On average, this only occurs in about 1/16,807 cases which is unlikely to be a problem in DSP. Also, for audio, we are  usually thinking of bipolar signals, and converting the unipolar raw random number into a bipolar would make any such relationship practically irrelevant.

The relationship between the first result and the next can be plotted on an X<>Y graph, and forms a sawtooth starting at 0<>0, rising almost to 0x7FFFFFFF over the first 127,773 X values, before starting near 0 at 127,774<>13,971, because 13,971 = (127,774 * 16807) - 0x7FFFFFFF)).

This sawtooth shape continues, for a total of 16,807 diagonal lines over the entire graph, which is 0x7FFFFFF in both X and Y dimensions.  Dividing the raw numbers by 0x7FFFFFFF gives a 0 to 1.0 range, and plotting a small part of that graph's horizontal range, with the full vertical range, gives:



Note that only 1/1000 of the X range is shown.  There are 16.807 sawtooths here - 16,807 for the entire X range.

This is from Pierre L'Ecuyer's paper L88 referred to above.

These sorts of correlations may cause trouble for certain simulation applications, but if all you want is noise, this PRNG produces fine and dandy white noise.

I haven't looked in detail at the various ways of analysing PRNG output, such as spectral analysis and three-dimensional plots of correlations between three successive results.

Two papers which mention Park and Miller's "minimal standard" PRNG are:
A collection of selected pseudorandom number generators with linear structures - advanced version
Karl Entacher 2000
http://crypto.mat.sbg.ac.at/results/karl/server/server.html (June 2000 version.) The Park-Miller minimal standard PRNG is discussed here.
(1997 version http://random.mat.sbg.ac.at/ftp/pub/data/genacpc.ps  Convert to PDF link below.)

Bad Subsequences of Well-Known Linear Congruential Pseudorandom Number Generators
Karl Entacher
ACM Transactions on Modeling and Computer Simulation (TOMACS), Vol. 8, No. 1, January 1998, Pages 61–70.
This is available in PostScript (convert to PDF via this link below) at Karl Entacher's site: http://www.users.fh-sbg.ac.at/~entacher/htdocs/publications.html

This is something of a comparison of many PRNGs.  The Park Miller minimal standard PRNG is listed on page 63 as number 2: "LCG(231- 1, 75 = 16807, 0, 1)".  The Park Miller "Minimal Standard" PRNG is not on the list of worst or best PRNGs.

Please see a following section on improving the quality of pseudo-random numbers - especially the section 7.1 of Numerical Recipes in C which discusses various quality issues.

* Source Code

OK - here it is!  All the code is in a zip file, but you can browse .txt versions of them directly.

Copyright public domain Robin Whittle 2005 - but please do not strip out the comments if you use this code.  It is a very difficult piece of code to understand - and is completely impossible to understand without decent documentation!
rand31-park-miller-carta.zip
The C code is for GCC and the C++ code is for G++.  It should work with any ANSI C compiler or modern C++ compiler.  Please report any problems and make suggestions for improvements.


rand31-park-miller-carta.cc.txt
C++ code with classes for:
The program runs the two generators side-by-side and compares their outputs, displaying some of the results.  Grab the rand31pmc class for your own projects, but don't mess with those comments daddyo . . .

rand31-park-miller-fp.c.txt
rand31-park-miller-fp.h.txt
C source code for my slow, double-precision floating point implementation of the Park-Miller minimal standard PRNG, with integer and floating point output.
rand31-park-miller-carta-int.c.txt
rand31-park-miller-carta-int.h.txt
C source code for my implementation of Carta's algorithm for the Park-Miller minimal standard PRNG.  Fast, 32 bit integer code with no divisions.  Integer and floating point output.

rand31-park-miller-carta-c-test.c.txt
C code includes the two pairs of files above.  Three modes of operation.  Firstly, it prints out some results from the Park-Miller-Carta generator.  Secondly, it runs one or the other generator for a full sequence from a seed of 1 - until it produces a result of 1.  This is intended as a crude way of checking the speed of the code with different compilers and CPUs.

rand31-park-miller-carta-dsPIC-asm.s.txt
Code for the dsPIC microcontroller.  I really worked on this!  The subroutine call overhead, including call and return is 18 CPU clock cycles.  Since dsPICs can run at 20 and 30 MIPs, they can generate a million results a second.  I think the dsPIC takes a lot of getting to know, but that it is a well-designed CPU with many carefully thought out instructions and I/O features.

Improving the Quality of Pseudo-Random Numbers

There's probably a limit to how much effort should be put into improving the quality of numbers from a fast PRNG.  As long as the output of the Mersenne Twister is held in such high regard, its speed and memory requirements set some kind of upper limit to such efforts.  Does anyone know how fast it is?

First, one must characterise the problems with a particular PRNG.  I haven't really studied this, but the papers cited here are a starting point for those who want to pursue this.

One problem is that there is some kind of relationship between results one, two or three apart in the sequence.  An approach to resolving this is "shuffling".  One apparently widely used implementation of this is described in a book: Numerical Recipes in C: The Art of Scientific Computing by William H. Press (who Park and Miller cite).  Thanks to the US government Los Alamos National Laboratory  http://library.lanl.gov/numerical/ the entire text of all the series of Numerical Recipes is available for download!  The relevant section is intriguingly titled Uniform Deviates:
http://library.lanl.gov/numerical/bookcpdf/c7-1.pdf
It is also available as a sample chapter from the publisher and at http://www.library.cornell.edu/nr/bookcpdf/

The Bays-Durham shuffling algorithm described by William H. Press involves an array of 32 PRNG results.  These are presumably loaded into the array before normal operation begins.  One is chosen to the be shuffler result and the cycle of operations after that involves using the previous shuffler result to choose a location in the array to become the next shuffler result.  When that result is read from the array, the new PRNG result is written in its place.  This seems like a good way of breaking up short-term correlations in the results from linear congruential generators.

Another approach is to run two linear congruential generators in tandem, adding their results.  The best outcomes are achieved when the generators have different periods.  I wonder whether an XOR would be just as good.  The trap would be that the two generators with different sequence lengths  would presumably have different lengths  of bits for each output word, or at least different ranges even if the output was 31 bits.  As long a they were both 31 bits, I imagine that XORing would be fine.  Adding, if the ranges were not the same, would be a pain, because to restore the sum to a similar value, would involve a division and multiplication.

I can imagine having a short list, such as 7 words, of fixed but reasonably random-looking bits, to XOR each PRNG result with.  Each result would be XORed with the next word in the list on a rotating basis.  That would seem to be a good way of breaking up any patterns in the values of the PRNG - in this case in the "value" dimension rather than in the "time" dimension, as with shuffling.

A simple, fast, way of breaking up patterns would be to swap bits within each PRNG result, such as swapping bits or nybbles, or reordering bits backwards - though that is not the sort of things CPUs are typically capable of in a single instruction.

I think there are probably some fast ways of breaking up patterns in PRNG results.  Its a rather attractive proposition - after all the focus on exactitude and rigour, to spend some effort devising, in as few CPU cycles as possible, a means of making a nice scattered mess of a stream of bits which aspire to being uncorrelated, but haven't yet completely attained that exalted state.

If there was a need for a really fast source of pseudo-random numbers, instead of subroutine calls, the code could be placed inline wherever a number was needed, and perhaps registers could be dedicated to storing the seed and any other bits needed for shuffling, scrambling or otherwise munging the raw results into the final output. 

While I think these scrambling algorithms could be tightly integrated with the linear congruential generator, before any such efforts were made, I think it would need to be shown that the raw results were in some way inadequate for the task at hand.  So far, the Park Miller "minimal standard" PRNG is fine for my uses in music, so I haven't pursued the matter further.

Extending the Carta Algorithm Towards 64 bits

I thought a little about making the Carta optimisation work with seed lengths approaching 64 bits - but have not pursued it.  The idea would be to use 32 x 32 = 64 bit multiplication, which exists on some modern CPUs, doing the two smaller multiplies (16 x 15 and 15 x 15 bits in the standard Carta approach) as 32 x 30 bits or similar, and combining the two results with shifts and addition.

Here is a paper in a relatively obscure journal, the Journal of the South Carolina Academy of Science http://faculty.uscupstate.edu/dkferris/SCAS/Journal.shtml (Google finds only 45 references to it. It is online only and I was the 500th visitor to its homepage in 6.5 months.)
Some Notes on Multiplicative Congruential Random Number Generators with Mersenne Prime Modulus 261-1
Dr James Harris 
Department of Mathematics and Computer Science, Georgia Southern University, Statesboro, GA 30460
Journal of the South Carolina Academy of Science, Vol 1 no. 1 On-Line Edition Fall 2003 p 28-31.
http://faculty.uscupstate.edu/dkferris/SCAS/vol1/Journal_V1.pdf 
This paper cites a 1949 reference for D. H. Lehmer's work:
Mathematical methods in large-scale computing units.
In Proceedings 2nd Symposium on Large-Scale Digital Calculating Machinery. Cambridge MA 141-146.

James Harris observes that since 1991, with the introduction of the MIPS R4000 CPU (chip photo, user manual) 64 bit CPUs have been commercially available.   He explores using a modulus constant larger than 231-1 which is also a Mersenne prime.  He states that the largest Mersenne prime under 264 is 261-1.  He proposes generating the multiplication product in two 64 bit registers p and q, where p contains the bits 63 and above.  He then presents a complex series of transformations similar to that in David Carta's paper, and arrives at the formulation that the nest pseudo-random result (sn+1) can be found, without division, by:
if (4p + q) < (261-1),        then sn+1= (4p + q)
else, if (4p + q) >= (261-1), then sn+1= (4p + q) - (261-1)
So this is similar to the Carta approach - with a factor of 4 for p, for some reason I haven't tried to understand.

I could try to show this in the simpler way I explain Carta's algorithm, but I am not sure that what Harris suggests is exactly correct.  I have a feeling that he has lost bits 61 and 62. 

I can't find details on the AMD 64 bit Opteron multiplication instructions, but I read somewhere it can do 64 x 64 = 128 bit multiplication in 5 CPU cycles. 

Before precisely devising the algorithm to use, the multiplier constant must be chosen.  This represents a formidable challenge - to find a number which produces the full sequence of 261-2 results.  James Harris writes:
However, as the size of the modulus in Lehmer generators increases, the number of possible multipliers increases, making the selection process for the "best" multiplier more difficult.  A method different than exhaustive analysis is needed.
This sounds like a fine research project!  But I am happy with 31 bit random numbers, and won't be pursuing longer sequences.

Musings on the Philosophy of Science

Most of us just want to get on with things and not fuss over the details of low-level processes.

If we can find a canned algorithm to to a task, we will generally grab it and use it without too much thought.  (I did this with Ray Gardner's code in 1997 or so - and it has been a part of Csound since then, without me understanding how it worked or chasing the references.)

The result may be that a sub-optimal algorithm can be widely used because no-one has tried to make it more efficient.

Part of the reason for this lack of improvement may be lack of need to make it efficient, but in this case there was a need.  With the vast numbers of pseudo-random numbers needed for some applications, and especially with the desire to use such algorithms on embedded, low-cost, CPUs, the question of efficiency is important.  Park and Miller recognised this and promoted the Schrage approach.

However, I observe a pattern, in general in life - not just science and computing - that when we make some progress on a previously formidable problem, cutting the solution down to something impressively efficient by comparison with the previous best solution, we may decide that this is the best which could be achieved.  (My initial dsPIC code took 24 CPU cycles and I figured I was at the limit when I got it down to 21. The next day I got it to 19.5 and figured this was really the best.  The next day I got it to 18.5 cycles!  After sleeping on it, I found a way of splitting the return paths and making it a little shorter for the most commonly taken path - this resulted in 18 cycles on average.)

A few people have a vision of doing neat things even more neatly than anyone else imagines is possible.  In this instance, David Carta came up with a way of greatly optimising the Park Miller minimal standard PRNG. 

Unfortunately, the academic press system failed him by allowing a false critique to appear without his right of reply.  Nonetheless, his algorithm has become widely used.  I also think the traditional academic press arrangements of locking up papers on controlled access sites, rather than making them public on the Net, greatly hampers people's ability to get their programming right, and to correctly reference their code.  I hope that this page will help clean up the very messy situation which has developed, where many implementations are just copies of someone else's implementation, with or without adequate references or explanations.  Often, this can result in sub-optimal performance, or in a programmer misunderstanding what is needed, and programming the algorithm incorrectly.

I find David Carta's explanation for his algorithm very complex - but that is only because I now have a much simpler explanation. 

Perhaps, if I had fully understood his explanation when I first tried to understand why his algorithm worked, I would have not gone beyond his explanation and would not regard it as "complex" or "overly complex".   I might simply regard it as a perfectly direct explanation of something that really is that complex.

As this research project proceeded, I didn't fully accept his explanation because (due to my limited mathematical knowledge) I didn't fully understand the transformation between his equations 2 and 3.  (I later worked it out what his transformation was and proved it to myself it was OK, but I still didn't know what it was called.)  I drew a block diagram, with bit-fields, for Ray Gardner's code and I suspected there was something special about this simple shift and add trick - something to do with the number concerned being a large power of 2 minus 1.  This is how I figured out my simple explanation.

Perhaps if I hadn't been forced to figure it out, I would currently believe David Carta's explanation as being optimal.


This is the fascinating thing about complex questions and seemingly sophisticated answers: Maybe there's a different way of looking at the whole thing which is a lot more elegant, satisfying, and which gives greater insights into the underlying mechanisms. 

We often can't find a radically new approach by staring at the current approach. 

I don't think my understanding of Carta's optimisation is likely to arise from carefully scrutinising his explanation.

To make progress, we often need to look at the thing from a completely different perspective.  The paradigm we operate within often makes this impossible - so I understand that progress is often made by people who are young, or new to the scene, and so are unaware of, or keen to overturn the currently dominant paradigm.  Thomas Kuhn wrote about this in The Structure of Scientific Revolutions

The paradigm within which our observations and theories are framed typically makes it impossible to view the situation in a sufficiently different way to make real progress.  Maybe the current paradigm is optimal - maybe there is no other better way of conceptualising the situation.  But how can we be sure?
Footnote: My attempts to find a more elegant explanation for the cosmological redshift and heating of the solar corona are at:  http://astroneu.com . Cosmology is a much harder field than pseudo-random numbers!  I think neuroscientists have missed a relatively straightforward explanation for Restless Legs Syndrome, and plan to publish a new research project on this at http://aminotheory.com .


Links

Please suggest sites to link to.  If you have your own PRNG code, or any relevant discussions, which you can make public, I can link to it or keep copies here.

The dsPIC Discussion Forum
http://forum.microchip.com/tt.asp?forumid=153  I announced this page with this message and it lead to some discussion.

The Music-DSP Mailing List
http://shoko.calarts.edu/musicdsp  High signal/noise ratio discussion, FAQs, DSP links etc. Please follow this link for general DSP links - I won't try to replicate them here.

Csound Mailing list
Csound is a well established language for music synthesis.  Since about 1997 it has used the Park-Miller-Carta algorithm for random numbers (I contributed a cut-and-paste of Ray Gardner's code, without trying to understand it or chase the references).  Other parts of Csound also use this algorithm, though implemented in different ways.  See the discussion around 2005 September 23 for a discussion on this, and how particular changes to the C code cause the GCC compiler to produce better results for particular CPUs (later Pentiums etc.).  The main site for Csound is: http://www.csounds.com . The list archives are under the Community heading. Only one of the archives seems to be worth looking at: http://groups.yahoo.com/group/csound--maths.ex.ac.uk/  Unfortunately Yahoo Groups trims off leading spaces on lines, screwing up source code.

The Mersenne Twister
Makoto Matsumoto's highly regarded Mersenne Twister PRNG has a ridiculously long sequence length and is reasonably fast, though probably not quite as fast as a tightly coded linear congruential generator. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html   I am not sure how applicable it would be to assembly code and 16 x 16 = 32 bit DSP systems.  On large CPU systems, like Pentiums, the Mersenne Twister's memory requirements - 624 x 32 bit words - is probably not a problem, but this is unlikely to be acceptable in embedded DSP systems.

C bit manipulation
A page by Sean Eron Anderson might be of interest - a page of C techniques for bit manipulation:  http://graphics.stanford.edu/~seander/bithacks.html 

Usenet newsgroups

http://groups.google.com/group/comp.dsp
http://groups.google.com/group/comp.lang.c.moderated
http://groups.google.com/group/comp.arch.embedded

Web and Usenet searches

http://scholar.google.com/advanced_scholar_search

http://www.google.com/search?as_q=Park+Miller+Carta+16807
(This finds a lot of relevant pages.)

http://www.google.com/search?q=Park+Miller+dsPIC+16807  << Nothing in September 2005.

http://groups.google.com/groups?q=dsPIC

http://groups.google.com/groups?q=Park+Miller+Carta+random+-house

Advanced Pseudo-Random Number Research
As previously mentioned, for really deep analysis of pseudo-random number generators, especially for simulation, these sites are essential reading:
University of Salzberg: http://random.mat.sbg.ac.at/
Pierre L'Ecuyer http://www.iro.umontreal.ca/~lecuyer/
Software for Spectral Testing of linear congruential PRNGs: http://random.mat.sbg.ac.at/results/karl/spectraltest/

Converting Postscript (.ps) files to PDFs
Free online conversions can be done at http://www.ps2pdf.com .  This uses the Ghostscript PostScript interpreter.

Adobe's commercial Acrobat (previously known as Acrobat Distiller) does the job nicely, but costs $ and is not free. Adobe has a no-charge service, for converting Postscript and many other file formats to PDFs: http://createpdf.adobe.com .

The Ghostscript PostScript interpreter combined with GSView can do the same job - and it is free, open-source software: http://www.cs.wisc.edu/~ghost/


Update History


.