Thursday, October 20, 2016

Generating probability distributions from a given probability distributions


Here, we contemplate on generating a new uniform distribution given another uniform distribution. Specifically, for example, when we have a function rand7(), which generates random numbers from 1 to 7 (or 0 to 6), we want to compute rand5(), that generated random numbers from 1 to 5. Later, we look at the reverse problem - given rand5(), can we generated rand7().

First problem: Given rand7(), can we generate rand5() i.e. random numbers between 1 and 5?. This is easy - use rand7() to generate a random number, say x, and then output the result as y = x if x <=5. What if x = 6 or 7? Then, we simply repeat the experiment, i.e. we keep calling rand7() until we get a random number between 1 to 5. So that was simple? But, we still have to prove mathematically that this definition of rand5(), generates uniform random numbers i.e. the function outputs the numbers between 1 and 5 with probability 1/5 each. Let's try!

Let, A be the probability that rand5() outputs number 1, then

A=
Pr[rand5() outputs 1]
= Pr[y = 1 ]
=  Pr[ x= 1 ] + Pr[ x = 6 or 7] . Pr [ rand5() outputs 1]
= 1/7 + 2/7 * A

Thus,  A = 1/5. Similarly, Probability for other numbers 2 till 5 is same as 1/5 ! That was easy too! Right? Let's see a tough problem.

Now consider the reverse problem - Given rand5(), we want to implement rand7(). How do we go about it? Intuitively, we see that we need to use rand5() multiple times since we want to generate numbers 6 and 7 as well, which cannot be generated by a single call to rand5(). The idea is, since rand5() generates numbers till 5, we want a formula such that it produces the remaining numbers i.e. 6 and 7. For example, if we wanted to implement rand19() using rand5(), then this formula should generate all the numbers after 5 as well. We won't mind if this formula generates number beyond 19, since we now know, how to generate only 19 numbers randomly from a big random generator (see first problem!). Coming back to rand7(), consider the formula:

F = 5 * [rand5() - 1] + rand5()

This formula uses rand5() two times. We see that, it generates all the numbers from 1 to 25 only once. You can verify this. (Actually, this formula resembles, formula for Base-5 representation, and, in fact,  it draws inspirations from it). Hence this is rand25(), each number has probability of 1/25. Now, using rand25(), we can easily generate rand7(), by using first problem's solution i.e. use rand25() and output the number if it is less than equal to 7 or else call rand25() again. But, we will be little more smarter for this problem. This approach has a subtle issue that we will talk about later. But for now consider, new solution for generating rand7() from rand25():

Use rand25(), let y be the output number of rand25(). If y <= 21, then rand7() outputs number x = y mod 7. Or else, we repeat the call to rand25(). The idea is, 21 is multiple of 7 and hence when we do mod 7 to the number between 1 and 21, then we see that each number between 1 and 7 has occurred equally i.e. 3 times

Why is second idea more smarter than the first one? Because, if we implement rand7() using rand25() in computer program, then second solution is faster! The reason is, for first solution, the probability of getting a number between 1 and 7 using rand25() is only 7/25 whereas for the second solution, it is 21/25. Hence, on an expectation, second solution would produce output within 2 iterations whereas first solution would take more than 3 iterations!




No comments: