Normal Approximation to Binomial Distribution

A qualitative analysis.

8 min readNov 23, 2020

Introduction and Motivation

Binomial distribution plays an important role in real world applications. It measures the probability of having k successes out of n i.i.d. trials with each trial having a success rate p. The PDF of binomial distribution with success rate p, total number of trials n and the number of successes k is

However, it may be difficult to directly use formula because it may contain large and small terms. Typically,

These large (or small) quantities may cause overflow (or underflow) when computing them. It would be more convenient if we could reformulate the formula that avoids this issue. Is it possible to do? Maybe we can observe an example of this distribution to get some ideas.

Given the success rate p of i.i.d. trials and the total number of trials n, the above equation gives the relationship between k and f(k,n,p). Below is an example of binomial distribution PDF with n=3000 and p=0.3.

The above figure looks very similar to a normal distribution, which makes intuitive sense. The peak of the distribution should correspond to k=np. Furthermore, it is also very likely that k can deviate around np because it’s very possible that we may get some more or fewer successes than expected. However, it is unlikely that the k we get deviates too far from np. All of these intuitions can be visualized on the graph. Therefore, one may ask, is it possible to have a normal distribution to approximate the binomial distribution? It turns out the answer is yes. In this project, we will study the relationships between binomial distributions their corresponding normal approximation.

Derivation

We begin by presenting the derivation from this paper[1]. Here, I will only summarize the main steps, skipping much of the detailed derivations.
Before we start, it is worth noting an approximation for factorials called Stirling’s formula[2].

Equipped with this formula, we can proceed on making the approximation.

Defining δ=k−np, we have k=δ+npk and n−k=n(1−p)−δ. Expanding

We have

Exponentiating the result, we have

Because δ is just the difference between observed k and the expected value of k which is np, δ should be on the scale of a few standard deviations, which is √np(1−p). Therefore, we argue that δ∼√n.
Therefore,

As n gets large, the quantity

gets small.
Using

the equation above becomes

For the square root part, we have

Multiplying them together, we obtain

We know that for binomial distribution f(k) with fixed n and p, its mean of k is μ=np, and its variance is σ²=np(1−p). Therefore, we can rewrite the formula above as

This is exactly a normal distribution with mean μ and variance σ² with a correction term.

As n→∞, the term

becomes a small value compared to 1.

There’s a caveat here. we arrived at this result by assuming that δ∼√n, which means that this formula is applicable only when k is within a few standard deviations to np. This formula may fail for significantly deviated k. This makes intuitive sense. After all, the domain of f(k) is [0,n] whereas the domain of normal distribution is (−∞,∞). For k smaller than 0 or larger than n, our approximation returns a positive value whereas the true binomial formula returns 0, and the ratio between them will be infinity. This is not to be worried about because our approximation disregards those exceptional cases where δ∼n rather than δ∼√n.

We can now define the approximation g(k) of f(k) by neglecting the O(1√n) term.

Approximation Error Visualizations

As previously discussed, our approximation works well when n is large k is within a few standard deviations to μ. We can visualize both the approximation and the exact function with varying n. Here are some examples. The x-axis limits are determined by μ±5σ for all of the plots. The blue and red curves represent the approximations g(k) and the exact functions f(k), respectively.

As n gets larger, the differences between two plots becomes more similar. At n=400, they are essentially indistinguishable. However, we may still be interested in seeing how the differences decrease with increasing n. Defining d(k)=g(k)−f(k), we can visualize d(k).

We can see that as n increases the magnitudes of d(k) becomes smaller, but it gets wider. Moreover, its general structures are more or less preserved. Take n=400 and n=1000 as a comparison, if we scale y-coordinates the n=400 plot down, but scale up its x-coordinates, we will get some results similar to the n=1000 plot. In other words, if we ignore the x and y scales, the n=400 plot and n=1000 plot look pretty much identical.

We should be aware that the correction is also a function of k, because otherwise d(k) will be a constant function. Furthermore, the observations above suggest that on the x-axis, the pattern induced by the correction lengthens in accordance to σ. For example, if σ doubles, we should see the distance between the two local minima around the center peak of d(k) also doubles. That is, the plot becomes twice as wide. For a normal distribution, if σ doubles, we need to go √2 as far from μ to achieve the same percentile as before. Because σ scales with √n, it implies that when n is large, the contribution from the term

dominates those from other terms as a function

is √n times as wide as y(x), and the disproportionalities of the graphs caused by other terms, if they exist, are visually negligible.

We may also visualize the correction terms itself by defining a new function r(k) such that

Here’s the visualization.

The plots above confirms the claims that for large n and k not too far from μ, the correction terms are small and therefore the approximations are very accurate.

An interesting observation of r(k) is that its local maxima doesn’t seem to be at μ, but rather at some distances away from μ. We can zoom in around k=μ to see more details.

It looks like at k=μ, the correction r(k) is negative, making g(μ)>f(μ). As k deviates from μ, the difference becomes smaller and eventually g(k)<f(k) for positive corrections. Then, the corrections starts decreasing which will make g(k)>f(k) again.

Scale wise, while we already know that r(k) roughly widens as √n, its amplitudes decrease roughly as

Take n=100 and n=1000 as a comparison, r(μ) for n=1000 is roughly 10 times smaller than that for n=100.

Application: 1D Random Walk

A 1D random walk can be represented as a series of steps on the real number line, starting from an initial position which usually is 0. Each step has a probability p of moving to the right by 1, and a probability of 1−p of moving to the left by 1. We may be interested in knowing after n steps, what is the probability that the walker will end at a directional displacement s to the initial position 0? We can represent s as

where k is the number of steps moving to the right.

Because given n and p, there is a bijective mapping between s and k, we can just apply the formula we derived before, assuming that n is large.

Here is a visualization of 1000 random walks each having n=10000 and p=0.5.

The displacement of each random walk at the end is recorded and we can visualize how the displacements are distributed. A prediction using (scaled) P(s=2k−n) formula is also plotted for comparison.

Our prediction of the displacement result using normal distribution approximation seems to line up well with the actual observations.

Conclusion

We have constructed a normal approximation to a binomial distribution. The error analysis indicates that the approximation errors tends to be insignificant if n is large and k is not too far from μ. More specifically, the corrections stays “in sync” with the approximation in a sense that the final graphs corresponding to different n’s are roughly proportional. Moreover, the amplitudes of the corrections decreases roughly as n−1.

Generalizations

We may generalize our discussions about binomial distributions to multinomial distributions, using the same technique. In fact, the generalization has already been discussed. For a multinomial