MultiCell histograms

Contents

MultiCell histograms#

Usually, one has to deal with n events to histogram by sorting each event into bins based on some variable x and by adding the value of a weight to the value stored in that bin.

However, there are also cases where one has to create multiple histograms for multiple independent weights that are all sorted into the same bin per event. Each event has one variable x and a list weights with a fixed number of k weights. A usage example could be to study a nominal weight and some systematic effects that manifest in variations of this weight without changing the the variable for selecting the bin (without changing x).

In such a case, one option would be to create k histograms (one for each weight) and sort each event into these k histograms (each time providing a different weight). The downsides of this approach is, that one has to do the sorting of the events into the bins k times whereas one time would be sufficient. One also has to create k histogram objects.

A more efficient approach would be to have only one histogram that stores a list of k weights per bin. This is achieved by using the boost_histogram.storage.MultiCell storage.

[1]:
import numpy as np

import boost_histogram as bh

A MultiCell histogram h has to be provided with k, the fixed number of cells (in this example the weights) per bin, at construction time:

[2]:
k = 10
h = bh.Histogram(bh.axis.Integer(0, 7), storage=bh.storage.MultiCell(k))

To fill the MultiCell histogram with n events, one has to provide a 2-dimensional array of weights with the shape (n, k) via the weight keyword of the fill() method:

[ ]:
n = 50
x = np.arange(n)
weights = np.arange(n * k).reshape(n, k)
h.fill(x, weight=weights)
Histogram(Integer(0, 7), storage=MultiCell(0)) # Sum: [210.0, 217.0, 224.0, 231.0, 238.0, 245.0, 252.0, 259.0, 266.0, 273.0] ([12250.0, 12300.0, 12350.0, 12400.0, 12450.0, 12500.0, 12550.0, 12600.0, 12650.0, 12700.0] with flow)

Now, each bin of h stores a list of k weights:

[31]:
h[1]
[31]:
[10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0]

One can interact with a MultiCell identical to any other histogram with a different storage type.

However, 2 important keypoints to remember:

  1. A view of a MultiCell histogram has the cells (weights) as first dimension. In our case, h.view() will have the shape (k, 7) with 7 being the number of bins from the bh.axis.Integer(0, 7) axis.:

[32]:
h.view()
[32]:
array([[ 0., 10., 20., 30., 40., 50., 60.],
       [ 1., 11., 21., 31., 41., 51., 61.],
       [ 2., 12., 22., 32., 42., 52., 62.],
       [ 3., 13., 23., 33., 43., 53., 63.],
       [ 4., 14., 24., 34., 44., 54., 64.],
       [ 5., 15., 25., 35., 45., 55., 65.],
       [ 6., 16., 26., 36., 46., 56., 66.],
       [ 7., 17., 27., 37., 47., 57., 67.],
       [ 8., 18., 28., 38., 48., 58., 68.],
       [ 9., 19., 29., 39., 49., 59., 69.]])

To get the histogram corresponding to the weight with index 3 one can use h.view()[3]

  1. MultiCell histograms do NOT track the variances of their cell contents (weights) (in contrast to histograms with the Weight storage). If required, one can track the variances by adding them as another weight to the histogram:

[ ]:
h = bh.Histogram(bh.axis.Integer(0, 5), storage=bh.storage.MultiCell(k + 1))
weights = np.zeros((n, k + 1))
weights[:, :-1] = np.arange(n * k).reshape(n, k)
weights[:, -1] = weights[:, 0] ** 2
h.fill(x, weight=weights)
Histogram(Integer(0, 5), storage=MultiCell(0)) # Sum: [100.0, 105.0, 110.0, 115.0, 120.0, 125.0, 130.0, 135.0, 140.0, 145.0, 3000.0] ([12250.0, 12300.0, 12350.0, 12400.0, 12450.0, 12500.0, 12550.0, 12600.0, 12650.0, 12700.0, 4042500.0] with flow)

Now, the weight with index k + 1 is the square of the weight with index 0 and therefore tracks the sum-of-weights-squared of the first weight.

Benchmarks#

Let’s estimate how much time one can save with MultiCell histograms in a ‘real world’ example in contrast to using multiple histograms with Weight and Double storage.

For a precicsion analysis we have to process 150 million events. We can fill 10000 events per iteration. Each event has 2000 different weights due to systematic variations. We also have some reweighting that increases the number of weights by a factor 10 to 20000 weights. The events are sorted based on two variables into a 8 x 8 histogram that uses variable axis.

We are only interested in the sum-of-weights-squared for the nominal weight (which would be 10 out of these 20000 in this case), therefore we will use histograms with a Double storage type as comparison to the MultiCell histogram.

[ ]:
n_total = 150_000_000
n_batch = 10_000
k = 20_000
axes = [
    bh.axis.Variable(np.linspace(-1, 1, 9)),
    bh.axis.Variable(np.linspace(-1, 1, 9)),
]
h_MultiCell = bh.Histogram(*axes, storage=bh.storage.MultiCell(k))
h_doubles = [bh.Histogram(*axes, storage=bh.storage.Double()) for i in range(k)]
[51]:
rng = np.random.default_rng()
variables = rng.normal(size=(2, n_batch))
weights = rng.normal(size=(n_batch, k))

Filling the Double type histograms:

[53]:
%%timeit
for index, hist in enumerate(h_doubles):
    hist.fill(variables[0], variables[1], weight=weights[:, index])
26.7 s ± 1.58 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Filling the MultiCell histogram:

[ ]:
%%timeit
h_MultiCell.fill(variables[0], variables[1], weight=weights)
325 ms ± 60.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On my test laptop, filling the k Double type histograms took 26.7 +- 1.6 s whereas filling the MultiCell histogram only took 325 +- 60 ms which is 82 times faster.

Therefore, using MultiCell histograms instead of multiple Double type histograms for filling all 150 million events would save us 110 hours (1.4 h for the MultiCell filling vs. 111 h for iterated Double type filling)