Probability Binning

Probability binning is the algorithm used for partitioning data for the Chi Square test.

For the Probability Binning (PB) algorithm, the distribution is divided into a number of bins of the same height. The number of events falling into each of these bins is compared between a test and a control sample, and a Chi Squared-like comparison is performed on the counts. Rather than the standard binning algorithm, which selects bins of equal width, the PB algorithm selects bins such that each bin contains the same number of events. This results in bins that carry equal weight when used to estimate the probability of difference between test and control histograms (i.e., it minimizes the maximum expected variance for the bins).

The PB algorithm calculates a Chi Squared value based on differences in the number of events found in each corresponding bin between the test and control samples. This X2 value varies from 0 to 2 standard deviations with 0 indicating no difference and 2 indicating the maximum difference. However, the Chi Square value does not allow the ranking of several samples according to how different they are. The T(X) metric was thus developed in order to be able to rank different samples.

As the number of dimensions used in the Chi Squared test are used, the number of bins used for Probability Binning should be decreased, due to the “curse of dimensionality”, which states that as you divide data into bins in more an more dimensions (squares in 2 dimensions, cubes in 3 dimensions, and hyperplanes thereafter) the number of regions created increases exponentially, while the number of data points remains the same.  The point where most bins are empty, and thus most differences are trivial (1 event in a bin versus none) will be reached quickly.  This is mitigated by dividing the dimensions into fewer bins.