Aghast：聚合、类似直方图统计，可作为Flatbuffers共享。

这些详情尚未通过PyPI验证

项目链接

项目描述

aghast

Aghast是一个直方图库，它不填充直方图也不绘制它们。它的作用是在幕后，提供更好的直方图库之间的通信。

具体来说，它是对聚合的、类似直方图的统计的有序表示，可以作为“ghasts”共享。它具有与普通直方图相关联的所有“ bells and whistles”，例如条目数、未归一化均值和标准差、bin误差、相关拟合函数、轮廓图，甚至简单的ntuples（用于未归一化拟合或机器学习应用）。 ROOT具有所有这些功能；Numpy没有这些功能。

aghast的目的在于在将ROOT直方图转换为Numpy、或将两者相互转换，或者将两者都转换为Boost.Histogram、Physt、Pandas等工具时，充当中间桥梁。如果没有中间表示形式，要在N个库之间转换（以获得所有库的优势），则需要N(N-1)/2个转换程序；而有中间表示形式时，我们只需要N个，并且可以将特征映射到特征，通过公共语言来明确表示。

此外，aghast是一个Flatbuffers模式，因此它可以用多种语言进行解析，具有懒惰、随机访问的特点，并且使用少量的内存。一组直方图、函数和ntuples可以作为共享内存在进程间共享，用于远程过程调用，在内存映射文件中增量处理，或以未来兼容的模式演变保存在文件中。

通过包安装

像安装其他Python包一样安装aghast

pip install aghast                        # maybe with sudo or --user, or in virtualenv

(目前尚未支持conda安装.)

手动安装

在git-clone这个GitHub仓库并确保已安装numpy后，某种方式

pip install "flatbuffers>=1.8.0"          # for the flatbuffers runtime (with Numpy)
cd python                                 # only implementation so far is in Python
python setup.py install                   # to use it outside of this directory

现在您应该能够在Python中导入aghast或使用from aghast import *。

如果您需要修改flatbuffers/aghast.fbs，您还需要

使用flatc从flatbuffers/aghast.fbs生成Python源代码。我使用conda install -c conda-forge flatbuffers。(flatc可执行文件不包括在pip的flatbuffers包中，Python运行时也不包括在conda的flatbuffers包中。它们是独立的。)
在python目录中，运行./generate_flatbuffers.py(该脚本调用flatc并进行一些后处理)。

每次您修改flatbuffers/aghast.fbs时，都要重新运行./generate_flatbuffers.py。

文档

完整规范

教程示例

在Binder上运行此教程.

转换

aghast的主要目的是将汇总的、类似直方图的统计信息（称为“ghasts”）从一个框架移动到另一个框架。这需要转换高级领域概念。

考虑以下示例：在Numpy中，直方图只是一个具有特殊意义的二维数组元组——箱内容，然后是箱边缘。

import numpy

numpy_hist = numpy.histogram(numpy.random.normal(0, 1, int(10e6)), bins=80, range=(-5, 5))
numpy_hist

(array([     2,      5,      9,     15,     29,     49,     80,    104,
           237,    352,    555,    867,   1447,   2046,   3037,   4562,
          6805,   9540,  13529,  18584,  25593,  35000,  46024,  59103,
         76492,  96441, 119873, 146159, 177533, 210628, 246316, 283292,
        321377, 359314, 393857, 426446, 453031, 474806, 489846, 496646,
        497922, 490499, 473200, 453527, 425650, 393297, 358537, 321099,
        282519, 246469, 211181, 177550, 147417, 120322,  96592,  76665,
         59587,  45776,  34459,  25900,  18876,  13576,   9571,   6662,
          4629,   3161,   2069,   1334,    878,    581,    332,    220,
           135,     65,     39,     26,     19,     15,      4,      4]),
 array([-5.   , -4.875, -4.75 , -4.625, -4.5  , -4.375, -4.25 , -4.125,
        -4.   , -3.875, -3.75 , -3.625, -3.5  , -3.375, -3.25 , -3.125,
        -3.   , -2.875, -2.75 , -2.625, -2.5  , -2.375, -2.25 , -2.125,
        -2.   , -1.875, -1.75 , -1.625, -1.5  , -1.375, -1.25 , -1.125,
        -1.   , -0.875, -0.75 , -0.625, -0.5  , -0.375, -0.25 , -0.125,
         0.   ,  0.125,  0.25 ,  0.375,  0.5  ,  0.625,  0.75 ,  0.875,
         1.   ,  1.125,  1.25 ,  1.375,  1.5  ,  1.625,  1.75 ,  1.875,
         2.   ,  2.125,  2.25 ,  2.375,  2.5  ,  2.625,  2.75 ,  2.875,
         3.   ,  3.125,  3.25 ,  3.375,  3.5  ,  3.625,  3.75 ,  3.875,
         4.   ,  4.125,  4.25 ,  4.375,  4.5  ,  4.625,  4.75 ,  4.875,
         5.   ]))

我们使用连接器将这个转换成aghast的等效物（一个“ghast”），连接器是两个函数：from_numpy和to_numpy。

import aghast

ghastly_hist = aghast.from_numpy(numpy_hist)
ghastly_hist

<Histogram at 0x7f0dc88a9b38>

该对象是从由简单部件构建的类结构中实例化的。

ghastly_hist.dump()

Histogram(
  axis=[
    Axis(binning=RegularBinning(num=80, interval=RealInterval(low=-5.0, high=5.0)))
  ],
  counts=
    UnweightedCounts(
      counts=
        InterpretedInlineInt64Buffer(
          buffer=
              [     2      5      9     15     29     49     80    104    237    352
                  555    867   1447   2046   3037   4562   6805   9540  13529  18584
                25593  35000  46024  59103  76492  96441 119873 146159 177533 210628
               246316 283292 321377 359314 393857 426446 453031 474806 489846 496646
               497922 490499 473200 453527 425650 393297 358537 321099 282519 246469
               211181 177550 147417 120322  96592  76665  59587  45776  34459  25900
                18876  13576   9571   6662   4629   3161   2069   1334    878    581
                  332    220    135     65     39     26     19     15      4      4])))

现在它可以转换为一个ROOT直方图，使用另一个连接器。

root_hist = aghast.to_root(ghastly_hist, "root_hist")
root_hist

<ROOT.TH1D object ("root_hist") at 0x55555e208ef0>

import ROOT
canvas = ROOT.TCanvas()
root_hist.Draw()
canvas.Draw()

png

并且使用Pandas，再使用另一个连接器。

pandas_hist = aghast.to_pandas(ghastly_hist)
pandas_hist

	未加权
[-5.0, -4.875)	2
[-4.875, -4.75)	5
[-4.75, -4.625)	9
[-4.625, -4.5)	15
[-4.5, -4.375)	29
[-4.375, -4.25)	49
[-4.25, -4.125)	80
[-4.125, -4.0)	104
[-4.0, -3.875)	237
[-3.875, -3.75)	352
[-3.75, -3.625)	555
[-3.625, -3.5)	867
[-3.5, -3.375)	1447
[-3.375, -3.25)	2046
[-3.25, -3.125)	3037
[-3.125, -3.0)	4562
[-3.0, -2.875)	6805
[-2.875, -2.75)	9540
[-2.75, -2.625)	13529
[-2.625, -2.5)	18584
[-2.5, -2.375)	25593
[-2.375, -2.25)	35000
[-2.25, -2.125)	46024
[-2.125, -2.0)	59103
[-2.0, -1.875)	76492
[-1.875, -1.75)	96441
[-1.75, -1.625)	119873
[-1.625, -1.5)	146159
[-1.5, -1.375)	177533
[-1.375, -1.25)	210628
...	...
[1.25, 1.375)	211181
[1.375, 1.5)	177550
[1.5, 1.625)	147417
[1.625, 1.75)	120322
[1.75, 1.875)	96592
[1.875, 2.0)	76665
[2.0, 2.125)	59587
[2.125, 2.25)	45776
[2.25, 2.375)	34459
[2.375, 2.5)	25900
[2.5, 2.625)	18876
[2.625, 2.75)	13576
[2.75, 2.875)	9571
[2.875, 3.0)	6662
[3.0, 3.125)	4629
[3.125, 3.25)	3161
[3.25, 3.375)	2069
[3.375, 3.5)	1334
[3.5, 3.625)	878
[3.625, 3.75)	581
[3.75, 3.875)	332
[3.875, 4.0)	220
[4.0, 4.125)	135
[4.125, 4.25)	65
[4.25, 4.375)	39
[4.375, 4.5)	26
[4.5, 4.625)	19
[4.625, 4.75)	15
[4.75, 4.875)	4
[4.875, 5.0)	4

80行 × 1列

序列化

鬼怪也是一种Flatbuffers对象，它具有多语言、随机访问、小内存占用的序列化。

ghastly_hist.tobuffer()

bytearray("\x04\x00\x00\x00\x90\xff\xff\xff\x10\x00\x00\x00\x00\x01\n\x00\x10\x00\x0c\x00\x0b\x00\x04
           \x00\n\x00\x00\x00`\x00\x00\x00\x00\x00\x00\x01\x04\x00\x00\x00\x01\x00\x00\x00\x0c\x00\x00
           \x00\x08\x00\x0c\x00\x0b\x00\x04\x00\x08\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x02\x08\x00
           (\x00\x1c\x00\x04\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\xc0\x00\x00\x00\x00\x00
           \x00\x14@\x01\x00\x00\x00\x00\x00\x00\x00P\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08
           \x00\n\x00\t\x00\x04\x00\x08\x00\x00\x00\x0c\x00\x00\x00\x00\x02\x06\x00\x08\x00\x04\x00\x06
           \x00\x00\x00\x04\x00\x00\x00\x80\x02\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00
           \x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x0f\x00\x00\x00\x00\x00\x00\x00\x1d\x00\x00
           \x00\x00\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00P\x00\x00\x00\x00\x00\x00\x00h\x00\x00\x00
           \x00\x00\x00\x00\xed\x00\x00\x00\x00\x00\x00\x00`\x01\x00\x00\x00\x00\x00\x00+\x02\x00\x00
           \x00\x00\x00\x00c\x03\x00\x00\x00\x00\x00\x00\xa7\x05\x00\x00\x00\x00\x00\x00\xfe\x07\x00
           \x00\x00\x00\x00\x00\xdd\x0b\x00\x00\x00\x00\x00\x00\xd2\x11\x00\x00\x00\x00\x00\x00\x95\x1a
           \x00\x00\x00\x00\x00\x00D%\x00\x00\x00\x00\x00\x00\xd94\x00\x00\x00\x00\x00\x00\x98H\x00\x00
           \x00\x00\x00\x00\xf9c\x00\x00\x00\x00\x00\x00\xb8\x88\x00\x00\x00\x00\x00\x00\xc8\xb3\x00\x00
           \x00\x00\x00\x00\xdf\xe6\x00\x00\x00\x00\x00\x00\xcc*\x01\x00\x00\x00\x00\x00\xb9x\x01\x00
           \x00\x00\x00\x00A\xd4\x01\x00\x00\x00\x00\x00\xef:\x02\x00\x00\x00\x00\x00}\xb5\x02\x00\x00
           \x00\x00\x00\xc46\x03\x00\x00\x00\x00\x00,\xc2\x03\x00\x00\x00\x00\x00\x9cR\x04\x00\x00\x00
           \x00\x00a\xe7\x04\x00\x00\x00\x00\x00\x92{\x05\x00\x00\x00\x00\x00\x81\x02\x06\x00\x00\x00
           \x00\x00\xce\x81\x06\x00\x00\x00\x00\x00\xa7\xe9\x06\x00\x00\x00\x00\x00\xb6>\x07\x00\x00
           \x00\x00\x00vy\x07\x00\x00\x00\x00\x00\x06\x94\x07\x00\x00\x00\x00\x00\x02\x99\x07\x00\x00
           \x00\x00\x00\x03|\x07\x00\x00\x00\x00\x00p8\x07\x00\x00\x00\x00\x00\x97\xeb\x06\x00\x00\x00
           \x00\x00\xb2~\x06\x00\x00\x00\x00\x00Q\x00\x06\x00\x00\x00\x00\x00\x89x\x05\x00\x00\x00\x00
           \x00K\xe6\x04\x00\x00\x00\x00\x00\x97O\x04\x00\x00\x00\x00\x00\xc5\xc2\x03\x00\x00\x00\x00
           \x00\xed8\x03\x00\x00\x00\x00\x00\x8e\xb5\x02\x00\x00\x00\x00\x00\xd9?\x02\x00\x00\x00\x00
           \x00\x02\xd6\x01\x00\x00\x00\x00\x00Py\x01\x00\x00\x00\x00\x00y+\x01\x00\x00\x00\x00\x00\xc3
           \xe8\x00\x00\x00\x00\x00\x00\xd0\xb2\x00\x00\x00\x00\x00\x00\x9b\x86\x00\x00\x00\x00\x00\x00
           ,e\x00\x00\x00\x00\x00\x00\xbcI\x00\x00\x00\x00\x00\x00\x085\x00\x00\x00\x00\x00\x00c%\x00
           \x00\x00\x00\x00\x00\x06\x1a\x00\x00\x00\x00\x00\x00\x15\x12\x00\x00\x00\x00\x00\x00Y\x0c
           \x00\x00\x00\x00\x00\x00\x15\x08\x00\x00\x00\x00\x00\x006\x05\x00\x00\x00\x00\x00\x00n\x03
           \x00\x00\x00\x00\x00\x00E\x02\x00\x00\x00\x00\x00\x00L\x01\x00\x00\x00\x00\x00\x00\xdc\x00
           \x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00A\x00\x00\x00\x00\x00\x00\x00\'\x00
           \x00\x00\x00\x00\x00\x00\x1a\x00\x00\x00\x00\x00\x00\x00\x13\x00\x00\x00\x00\x00\x00\x00\x0f
           \x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00")

print("Numpy size: ", numpy_hist[0].nbytes + numpy_hist[1].nbytes)

tmessage = ROOT.TMessage()
tmessage.WriteObject(root_hist)
print("ROOT size:  ", tmessage.Length())

import pickle
print("Pandas size:", len(pickle.dumps(pandas_hist)))

print("Aghast size: ", len(ghastly_hist.tobuffer()))

Numpy size:  1288
ROOT size:   1962
Pandas size: 2984
Aghast size:  792

鬼怪通常被视为一种内存格式，就像Apache Arrow一样，但用于统计聚合。与Arrow一样，它减少了在N个统计库之间实现$N(N - 1)/2$个转换函数的需求，只需N个转换函数即可。（参见Arrow网站上的图。）

约定翻译

鬼怪还旨在尽可能接近零拷贝。这意味着它必须在约定之间进行优雅的转换。不同的直方图库以不同的方式处理溢出桶。

fromroot = aghast.from_root(root_hist)
fromroot.axis[0].binning.dump()
print("Bin contents length:", len(fromroot.counts.array))

RegularBinning(
  num=80,
  interval=RealInterval(low=-5.0, high=5.0),
  overflow=RealOverflow(loc_underflow=BinLocation.below1, loc_overflow=BinLocation.above1))
Bin contents length: 82

ghastly_hist.axis[0].binning.dump()
print("Bin contents length:", len(ghastly_hist.counts.array))

RegularBinning(num=80, interval=RealInterval(low=-5.0, high=5.0))
Bin contents length: 80

然而，我们仍然希望能够像这些差异不存在一样操作它们。

sum_hist = fromroot + ghastly_hist

sum_hist.axis[0].binning.dump()
print("Bin contents length:", len(sum_hist.counts.array))

RegularBinning(
  num=80,
  interval=RealInterval(low=-5.0, high=5.0),
  overflow=RealOverflow(loc_underflow=BinLocation.above1, loc_overflow=BinLocation.above2))
Bin contents length: 82

桶结构跟踪了下溢/溢出桶的存在以及它们的位置。

ROOT的约定是将下溢放在正常桶之前（below1），将溢出放在之后（above1），这样正常桶就实际上是1索引。
Boost.Histogram的约定是将溢出放在正常桶之后（above1），然后是下溢（above2），这样在Numpy中可以通过myhist[-1]访问下溢。
Numpy直方图没有下溢/溢出桶。
Pandas可能有扩展到无穷远的Intervals。

鬼怪接受所有这些，这样它就不必操纵它接收到的桶内容缓冲区，但如果需要组合遵循不同约定的直方图，它知道如何处理它们。

桶类型

所有不同的轴类型在鬼怪中都有等效的（而且不都是一维的）。

aghast.IntegerBinning(5, 10).dump()
aghast.RegularBinning(100, aghast.RealInterval(-5, 5)).dump()
aghast.HexagonalBinning(0, 100, 0, 100, aghast.HexagonalBinning.cube_xy).dump()
aghast.EdgesBinning([0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]).dump()
aghast.IrregularBinning([aghast.RealInterval(0, 5),
                         aghast.RealInterval(10, 100),
                         aghast.RealInterval(-10, 10)],
                       overlapping_fill=aghast.IrregularBinning.all).dump()
aghast.CategoryBinning(["one", "two", "three"]).dump()
aghast.SparseRegularBinning([5, 3, -2, 8, -100], 10).dump()
aghast.FractionBinning(error_method=aghast.FractionBinning.clopper_pearson).dump()
aghast.PredicateBinning(["signal region", "control region"]).dump()
aghast.VariationBinning([aghast.Variation([aghast.Assignment("x", "nominal")]),
                         aghast.Variation([aghast.Assignment("x", "nominal + sigma")]),
                         aghast.Variation([aghast.Assignment("x", "nominal - sigma")])]).dump()

IntegerBinning(min=5, max=10)
RegularBinning(num=100, interval=RealInterval(low=-5.0, high=5.0))
HexagonalBinning(qmin=0, qmax=100, rmin=0, rmax=100, coordinates=HexagonalBinning.cube_xy)
EdgesBinning(edges=[0.01 0.05 0.1 0.5 1 5 10 50 100])
IrregularBinning(
  intervals=[
    RealInterval(low=0.0, high=5.0),
    RealInterval(low=10.0, high=100.0),
    RealInterval(low=-10.0, high=10.0)
  ],
  overlapping_fill=IrregularBinning.all)
CategoryBinning(categories=['one', 'two', 'three'])
SparseRegularBinning(bins=[5 3 -2 8 -100], bin_width=10.0)
FractionBinning(error_method=FractionBinning.clopper_pearson)
PredicateBinning(predicates=['signal region', 'control region'])
VariationBinning(
  variations=[
    Variation(assignments=[
        Assignment(identifier='x', expression='nominal')
      ]),
    Variation(
      assignments=[
        Assignment(identifier='x', expression='nominal + sigma')
      ]),
    Variation(
      assignments=[
        Assignment(identifier='x', expression='nominal - sigma')
      ])
  ])

这些桶类别的含义在规范中给出，但许多类可以相互转换，将它们转换为CategoryBinning（字符串）通常可以使意图更加清晰。

aghast.IntegerBinning(5, 10).toCategoryBinning().dump()
aghast.RegularBinning(10, aghast.RealInterval(-5, 5)).toCategoryBinning().dump()
aghast.EdgesBinning([0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]).toCategoryBinning().dump()
aghast.IrregularBinning([aghast.RealInterval(0, 5),
                         aghast.RealInterval(10, 100),
                         aghast.RealInterval(-10, 10)],
                       overlapping_fill=aghast.IrregularBinning.all).toCategoryBinning().dump()
aghast.SparseRegularBinning([5, 3, -2, 8, -100], 10).toCategoryBinning().dump()
aghast.FractionBinning(error_method=aghast.FractionBinning.clopper_pearson).toCategoryBinning().dump()
aghast.PredicateBinning(["signal region", "control region"]).toCategoryBinning().dump()
aghast.VariationBinning([aghast.Variation([aghast.Assignment("x", "nominal")]),
                         aghast.Variation([aghast.Assignment("x", "nominal + sigma")]),
                         aghast.Variation([aghast.Assignment("x", "nominal - sigma")])]
                        ).toCategoryBinning().dump()

CategoryBinning(categories=['5', '6', '7', '8', '9', '10'])
CategoryBinning(
  categories=['[-5, -4)', '[-4, -3)', '[-3, -2)', '[-2, -1)', '[-1, 0)', '[0, 1)', '[1, 2)', '[2, 3)',
              '[3, 4)', '[4, 5)'])
CategoryBinning(
  categories=['[0.01, 0.05)', '[0.05, 0.1)', '[0.1, 0.5)', '[0.5, 1)', '[1, 5)', '[5, 10)', '[10, 50)',
              '[50, 100)'])
CategoryBinning(categories=['[0, 5)', '[10, 100)', '[-10, 10)'])
CategoryBinning(categories=['[50, 60)', '[30, 40)', '[-20, -10)', '[80, 90)', '[-1000, -990)'])
CategoryBinning(categories=['pass', 'all'])
CategoryBinning(categories=['signal region', 'control region'])
CategoryBinning(categories=['x := nominal', 'x := nominal + sigma', 'x := nominal - sigma'])

这项技术还可以消除对溢出桶的混淆。

aghast.RegularBinning(5, aghast.RealInterval(-5, 5), aghast.RealOverflow(
    loc_underflow=aghast.BinLocation.above2,
    loc_overflow=aghast.BinLocation.above1,
    loc_nanflow=aghast.BinLocation.below1
    )).toCategoryBinning().dump()

CategoryBinning(
  categories=['{nan}', '[-5, -3)', '[-3, -1)', '[-1, 1)', '[1, 3)', '[3, 5)', '[5, +inf]',
              '[-inf, -5)'])

复杂的桶类型

你可能还想知道关于FractionBinning、PredicateBinning和VariationBinning。

FractionBinning是一个有两个桶的轴：#通过和#总数、#失败和#总数，或者#通过和#失败。将其添加到另一个轴上实际上创建了一个“效率图”。

h = aghast.Histogram([aghast.Axis(aghast.FractionBinning()),
                      aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)))],
                    aghast.UnweightedCounts(
                        aghast.InterpretedInlineBuffer.fromarray(
                            numpy.array([[  9,  25,  29,  35,  54,  67,  60,  84,  80,  94],
                                         [ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122]]))))
df = aghast.to_pandas(h)
df

		未加权
pass	[-5.0, -4.0)	9
[-4.0, -3.0)	25
[-3.0, -2.0)	29
[-2.0, -1.0)	35
[-1.0, 0.0)	54
[0.0, 1.0)	67
[1.0, 2.0)	60
[2.0, 3.0)	84
[3.0, 4.0)	80
[4.0, 5.0)	94
all	[-5.0, -4.0)	99
[-4.0, -3.0)	119
[-3.0, -2.0)	109
[-2.0, -1.0)	109
[-1.0, 0.0)	95
[0.0, 1.0)	104
[1.0, 2.0)	102
[2.0, 3.0)	106
[3.0, 4.0)	112
[4.0, 5.0)	122

df = df.unstack(level=0)
df

	未加权
	all	pass
[-5.0, -4.0)	99	9
[-4.0, -3.0)	119	25
[-3.0, -2.0)	109	29
[-2.0, -1.0)	109	35
[-1.0, 0.0)	95	54
[0.0, 1.0)	104	67
[1.0, 2.0)	102	60
[2.0, 3.0)	106	84
[3.0, 4.0)	112	80
[4.0, 5.0)	122	94

df["unweighted", "pass"] / df["unweighted", "all"]

[-5.0, -4.0)    0.090909
[-4.0, -3.0)    0.210084
[-3.0, -2.0)    0.266055
[-2.0, -1.0)    0.321101
[-1.0, 0.0)     0.568421
[0.0, 1.0)      0.644231
[1.0, 2.0)      0.588235
[2.0, 3.0)      0.792453
[3.0, 4.0)      0.714286
[4.0, 5.0)      0.770492
dtype: float64

PredicateBinning表示每个桶代表填充过程中的一个谓词（if-then规则）。鬼怪没有填充过程，但填充库可以使用它来编码拟合库可以利用的直方图之间的关系，例如，用于组合信号-控制区域拟合。这些区域可能重叠：输入数据可能满足多个谓词，而overlapping_fill确定选择了哪个桶：first、last或all。

VariationBinning表示每个桶代表用于计算填充变量的一个参数的变体。这用于通过改变它们并重新填充来决定对系统效应的敏感性。在这种桶中，相同的输入数据进入每个桶。

xdata = numpy.random.normal(0, 1, int(1e6))
sigma = numpy.random.uniform(-0.1, 0.8, int(1e6))

h = aghast.Histogram([aghast.Axis(aghast.VariationBinning([
                         aghast.Variation([aghast.Assignment("x", "nominal")]),
                         aghast.Variation([aghast.Assignment("x", "nominal + sigma")])])),
                     aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)))],
                    aghast.UnweightedCounts(
                        aghast.InterpretedInlineBuffer.fromarray(
                            numpy.concatenate([
                                numpy.histogram(xdata, bins=10, range=(-5, 5))[0],
                                numpy.histogram(xdata + sigma, bins=10, range=(-5, 5))[0]]))))
df = aghast.to_pandas(h)
df

		未加权
x := nominal	[-5.0, -4.0)	31
[-4.0, -3.0)	1309
[-3.0, -2.0)	21624
[-2.0, -1.0)	135279
[-1.0, 0.0)	341683
[0.0, 1.0)	341761
[1.0, 2.0)	135675
[2.0, 3.0)	21334
[3.0, 4.0)	1273
[4.0, 5.0)	31
x := nominal + sigma	[-5.0, -4.0)	14
[-4.0, -3.0)	559
[-3.0, -2.0)	10814
[-2.0, -1.0)	84176
[-1.0, 0.0)	271999
[0.0, 1.0)	367950
[1.0, 2.0)	209479
[2.0, 3.0)	49997
[3.0, 4.0)	4815
[4.0, 5.0)	193

df.unstack(level=0)

	未加权
	x := nominal	x := nominal + sigma
[-5.0, -4.0)	31	14
[-4.0, -3.0)	1309	559
[-3.0, -2.0)	21624	10814
[-2.0, -1.0)	135279	84176
[-1.0, 0.0)	341683	271999
[0.0, 1.0)	341761	367950
[1.0, 2.0)	135675	209479
[2.0, 3.0)	21334	49997
[3.0, 4.0)	1273	4815
[4.0, 5.0)	31	193

集合

您可以将许多对象（直方图、函数、ntuples）收集到一个Collection中，部分是为了方便将这些对象封装在一个对象中。

aghast.Collection({"one": fromroot, "two": ghastly_hist}).dump()

Collection(
  objects={
    'one': Histogram(
      axis=[
        Axis(
          binning=
            RegularBinning(
              num=80,
              interval=RealInterval(low=-5.0, high=5.0),
              overflow=RealOverflow(loc_underflow=BinLocation.below1, loc_overflow=BinLocation.above1)),
          statistics=[
            Statistics(
              moments=[
                Moments(sumwxn=InterpretedInlineInt64Buffer(buffer=[1e+07]), n=0),
                Moments(sumwxn=InterpretedInlineFloat64Buffer(buffer=[1e+07]), n=0, weightpower=1),
                Moments(sumwxn=InterpretedInlineFloat64Buffer(buffer=[1e+07]), n=0, weightpower=2),
                Moments(sumwxn=InterpretedInlineFloat64Buffer(buffer=[2468.31]), n=1, weightpower=1),
                Moments(
                  sumwxn=InterpretedInlineFloat64Buffer(buffer=[1.00118e+07]),
                  n=2,
                  weightpower=1)
              ])
          ])
      ],
      counts=
        UnweightedCounts(
          counts=
            InterpretedInlineFloat64Buffer(
              buffer=
                  [0.00000e+00 2.00000e+00 5.00000e+00 9.00000e+00 1.50000e+01 2.90000e+01
                   4.90000e+01 8.00000e+01 1.04000e+02 2.37000e+02 3.52000e+02 5.55000e+02
                   8.67000e+02 1.44700e+03 2.04600e+03 3.03700e+03 4.56200e+03 6.80500e+03
                   9.54000e+03 1.35290e+04 1.85840e+04 2.55930e+04 3.50000e+04 4.60240e+04
                   5.91030e+04 7.64920e+04 9.64410e+04 1.19873e+05 1.46159e+05 1.77533e+05
                   2.10628e+05 2.46316e+05 2.83292e+05 3.21377e+05 3.59314e+05 3.93857e+05
                   4.26446e+05 4.53031e+05 4.74806e+05 4.89846e+05 4.96646e+05 4.97922e+05
                   4.90499e+05 4.73200e+05 4.53527e+05 4.25650e+05 3.93297e+05 3.58537e+05
                   3.21099e+05 2.82519e+05 2.46469e+05 2.11181e+05 1.77550e+05 1.47417e+05
                   1.20322e+05 9.65920e+04 7.66650e+04 5.95870e+04 4.57760e+04 3.44590e+04
                   2.59000e+04 1.88760e+04 1.35760e+04 9.57100e+03 6.66200e+03 4.62900e+03
                   3.16100e+03 2.06900e+03 1.33400e+03 8.78000e+02 5.81000e+02 3.32000e+02
                   2.20000e+02 1.35000e+02 6.50000e+01 3.90000e+01 2.60000e+01 1.90000e+01
                   1.50000e+01 4.00000e+00 4.00000e+00 0.00000e+00]))),
    'two': Histogram(
      axis=[
        Axis(binning=RegularBinning(num=80, interval=RealInterval(low=-5.0, high=5.0)))
      ],
      counts=
        UnweightedCounts(
          counts=
            InterpretedInlineInt64Buffer(
              buffer=
                  [     2      5      9     15     29     49     80    104    237    352
                      555    867   1447   2046   3037   4562   6805   9540  13529  18584
                    25593  35000  46024  59103  76492  96441 119873 146159 177533 210628
                   246316 283292 321377 359314 393857 426446 453031 474806 489846 496646
                   497922 490499 473200 453527 425650 393297 358537 321099 282519 246469
                   211181 177550 147417 120322  96592  76665  59587  45776  34459  25900
                    18876  13576   9571   6662   4629   3161   2069   1334    878    581
                      332    220    135     65     39     26     19     15      4      4])))
  })

不仅是为了方便：您还可以在Collection中定义一个Axis，将所有内容按该Axis细分。例如，您可以创建一个包含不同质量直方图的集合，所有这些直方图都有信号和控制区域使用PredicateBinning，或者所有都有系统变化使用VariationBinning。

从填充器到安装工交流此类信息时，无需依赖于命名约定。

直方图 → 直方图转换

我在引言中提到，aghast 不会填充直方图，也不会绘制直方图——这是数据分析师期望进行的两项操作。这些操作将由面向用户的库来完成。

然而，aghast 可以将直方图转换为其他直方图，而不仅仅是不同格式。您可以使用 + 符号组合直方图。除了添加直方图计数外，它还会适当组合辅助统计信息（如果可能的话）。

h1 = aghast.Histogram([
    aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)),
        statistics=[aghast.Statistics(
            moments=[
                aghast.Moments(aghast.InterpretedInlineBuffer.fromarray(numpy.array([10])), n=1),
                aghast.Moments(aghast.InterpretedInlineBuffer.fromarray(numpy.array([20])), n=2)],
            quantiles=[
                aghast.Quantiles(aghast.InterpretedInlineBuffer.fromarray(numpy.array([30])), p=0.5)],
            mode=aghast.Modes(aghast.InterpretedInlineBuffer.fromarray(numpy.array([40]))),
            min=aghast.Extremes(aghast.InterpretedInlineBuffer.fromarray(numpy.array([50]))),
            max=aghast.Extremes(aghast.InterpretedInlineBuffer.fromarray(numpy.array([60]))))])],
    aghast.UnweightedCounts(aghast.InterpretedInlineBuffer.fromarray(numpy.arange(10))))
h2 = aghast.Histogram([
    aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)),
        statistics=[aghast.Statistics(
            moments=[
                aghast.Moments(aghast.InterpretedInlineBuffer.fromarray(numpy.array([100])), n=1),
                aghast.Moments(aghast.InterpretedInlineBuffer.fromarray(numpy.array([200])), n=2)],
            quantiles=[
                aghast.Quantiles(aghast.InterpretedInlineBuffer.fromarray(numpy.array([300])), p=0.5)],
            mode=aghast.Modes(aghast.InterpretedInlineBuffer.fromarray(numpy.array([400]))),
            min=aghast.Extremes(aghast.InterpretedInlineBuffer.fromarray(numpy.array([500]))),
            max=aghast.Extremes(aghast.InterpretedInlineBuffer.fromarray(numpy.array([600]))))])],
    aghast.UnweightedCounts(aghast.InterpretedInlineBuffer.fromarray(numpy.arange(100, 200, 10))))

(h1 + h2).dump()

Histogram(
  axis=[
    Axis(
      binning=RegularBinning(num=10, interval=RealInterval(low=-5.0, high=5.0)),
      statistics=[
        Statistics(
          moments=[
            Moments(sumwxn=InterpretedInlineInt64Buffer(buffer=[110]), n=1),
            Moments(sumwxn=InterpretedInlineInt64Buffer(buffer=[220]), n=2)
          ],
          min=Extremes(values=InterpretedInlineInt64Buffer(buffer=[50])),
          max=Extremes(values=InterpretedInlineInt64Buffer(buffer=[600])))
      ])
  ],
  counts=
    UnweightedCounts(
      counts=InterpretedInlineInt64Buffer(buffer=[100 111 122 133 144 155 166 177 188 199])))

将 h1 和 h2 的对应时刻相匹配并相加，删除分位数和众数（无法合并），并选择正确的最小值和最大值；还将直方图内容相加。

另一个重要的直方图 → 直方图转换是轴缩减，它可以有三种形式

切割轴，要么丢弃被删除的桶，要么将它们添加到下溢/上溢（如果可能，取决于桶类型）；
通过组合相邻桶进行重新桶化；
投影出一个轴，将其完全删除，对所有现有桶求和。

所有这些操作都使用了 Pandas 启发的 loc/iloc 语法。

h = aghast.Histogram(
    [aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)))],
    aghast.UnweightedCounts(
        aghast.InterpretedInlineBuffer.fromarray(numpy.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90]))))

loc 在数据坐标系中进行切片。 1.5 向上取整到桶索引 6。前五个桶被合并到一个溢出桶中：150 = 10 + 20 + 30 + 40 + 50。

h.loc[1.5:].dump()

Histogram(
  axis=[
    Axis(
      binning=
        RegularBinning(
          num=4,
          interval=RealInterval(low=1.0, high=5.0),
          overflow=
            RealOverflow(
              loc_underflow=BinLocation.above1,
              minf_mapping=RealOverflow.missing,
              pinf_mapping=RealOverflow.missing,
              nan_mapping=RealOverflow.missing)))
  ],
  counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[60 70 80 90 150])))

iloc 根据桶索引编号进行切片。

h.iloc[6:].dump()

Histogram(
  axis=[
    Axis(
      binning=
        RegularBinning(
          num=4,
          interval=RealInterval(low=1.0, high=5.0),
          overflow=
            RealOverflow(
              loc_underflow=BinLocation.above1,
              minf_mapping=RealOverflow.missing,
              pinf_mapping=RealOverflow.missing,
              nan_mapping=RealOverflow.missing)))
  ],
  counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[60 70 80 90 150])))

切片具有 start、stop 和 step （start:stop:step）。step 参数重新桶化

h.iloc[::2].dump()

Histogram(
  axis=[
    Axis(binning=RegularBinning(num=5, interval=RealInterval(low=-5.0, high=5.0)))
  ],
  counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[10 50 90 130 170])))

因此，您可以切片和重新桶化作为同一操作的一部分。

投影使用相同的机制，只是将 None 作为轴切片传递。

h2 = aghast.Histogram(
    [aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5))),
     aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)))],
    aghast.UnweightedCounts(
        aghast.InterpretedInlineBuffer.fromarray(numpy.arange(100))))

h2.iloc[:, None].dump()

Histogram(
  axis=[
    Axis(binning=RegularBinning(num=10, interval=RealInterval(low=-5.0, high=5.0)))
  ],
  counts=
    UnweightedCounts(
      counts=InterpretedInlineInt64Buffer(buffer=[45 145 245 345 445 545 645 745 845 945])))

因此，所有三种轴缩减操作都可以使用单个语法完成。

通常，一个 n 维 ghastly 直方图可以像 n 维 Numpy 数组一样切片。这包括整数和布尔索引（尽管这必然会将桶化更改为 IrregularBinning）。

h.iloc[[4, 3, 6, 7, 1]].dump()

Histogram(
  axis=[
    Axis(
      binning=
        IrregularBinning(
          intervals=[
            RealInterval(low=-1.0, high=0.0),
            RealInterval(low=-2.0, high=-1.0),
            RealInterval(low=1.0, high=2.0),
            RealInterval(low=2.0, high=3.0),
            RealInterval(low=-4.0, high=-3.0)
          ]))
  ],
  counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[40 30 60 70 10])))

h.iloc[[True, False, True, False, True, False, True, False, True, False]].dump()

Histogram(
  axis=[
    Axis(
      binning=
        IrregularBinning(
          intervals=[
            RealInterval(low=-5.0, high=-4.0),
            RealInterval(low=-3.0, high=-2.0),
            RealInterval(low=-1.0, high=0.0),
            RealInterval(low=1.0, high=2.0),
            RealInterval(low=3.0, high=4.0)
          ]))
  ],
  counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[0 20 40 60 80])))

loc 对数值桶化接受

一个实数
一个实值切片
None 用于投影
省略号（...）

loc 对分类桶化接受

一个字符串
字符串的可迭代序列
一个空切片
None 用于投影
省略号（...）

iloc 接受

一个整数
一个整数值切片
None 用于投影
类似数组的整数值
类似数组的布尔值
省略号（...）

桶计数 → Numpy

通常，人们希望从直方图中提取桶计数。上面的 loc/iloc 语法创建的是 直方图，而不是桶计数。

直方图的 counts 属性具有切片语法。

allcounts = numpy.arange(12) * numpy.arange(12)[:, None]   # multiplication table
allcounts[10, :] = -999   # underflows
allcounts[11, :] = 999    # overflows
allcounts[:, 0]  = -999   # underflows
allcounts[:, 1]  = 999    # overflows
print(allcounts)

[[-999  999    0    0    0    0    0    0    0    0    0    0]
 [-999  999    2    3    4    5    6    7    8    9   10   11]
 [-999  999    4    6    8   10   12   14   16   18   20   22]
 [-999  999    6    9   12   15   18   21   24   27   30   33]
 [-999  999    8   12   16   20   24   28   32   36   40   44]
 [-999  999   10   15   20   25   30   35   40   45   50   55]
 [-999  999   12   18   24   30   36   42   48   54   60   66]
 [-999  999   14   21   28   35   42   49   56   63   70   77]
 [-999  999   16   24   32   40   48   56   64   72   80   88]
 [-999  999   18   27   36   45   54   63   72   81   90   99]
 [-999  999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999]
 [-999  999  999  999  999  999  999  999  999  999  999  999]]

h2 = aghast.Histogram(
    [aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5),
                     aghast.RealOverflow(loc_underflow=aghast.RealOverflow.above1,
                                       loc_overflow=aghast.RealOverflow.above2))),
     aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5),
                     aghast.RealOverflow(loc_underflow=aghast.RealOverflow.below2,
                                       loc_overflow=aghast.RealOverflow.below1)))],
    aghast.UnweightedCounts(
        aghast.InterpretedInlineBuffer.fromarray(allcounts)))

print(h2.counts[:, :])

[[ 0  0  0  0  0  0  0  0  0  0]
 [ 2  3  4  5  6  7  8  9 10 11]
 [ 4  6  8 10 12 14 16 18 20 22]
 [ 6  9 12 15 18 21 24 27 30 33]
 [ 8 12 16 20 24 28 32 36 40 44]
 [10 15 20 25 30 35 40 45 50 55]
 [12 18 24 30 36 42 48 54 60 66]
 [14 21 28 35 42 49 56 63 70 77]
 [16 24 32 40 48 56 64 72 80 88]
 [18 27 36 45 54 63 72 81 90 99]]

要获取下溢和上溢，将切片范围设置为 -inf 和 +inf。

print(h2.counts[-numpy.inf:numpy.inf, :])

[[-999 -999 -999 -999 -999 -999 -999 -999 -999 -999]
 [   0    0    0    0    0    0    0    0    0    0]
 [   2    3    4    5    6    7    8    9   10   11]
 [   4    6    8   10   12   14   16   18   20   22]
 [   6    9   12   15   18   21   24   27   30   33]
 [   8   12   16   20   24   28   32   36   40   44]
 [  10   15   20   25   30   35   40   45   50   55]
 [  12   18   24   30   36   42   48   54   60   66]
 [  14   21   28   35   42   49   56   63   70   77]
 [  16   24   32   40   48   56   64   72   80   88]
 [  18   27   36   45   54   63   72   81   90   99]
 [ 999  999  999  999  999  999  999  999  999  999]]

print(h2.counts[:, -numpy.inf:numpy.inf])

[[-999    0    0    0    0    0    0    0    0    0    0  999]
 [-999    2    3    4    5    6    7    8    9   10   11  999]
 [-999    4    6    8   10   12   14   16   18   20   22  999]
 [-999    6    9   12   15   18   21   24   27   30   33  999]
 [-999    8   12   16   20   24   28   32   36   40   44  999]
 [-999   10   15   20   25   30   35   40   45   50   55  999]
 [-999   12   18   24   30   36   42   48   54   60   66  999]
 [-999   14   21   28   35   42   49   56   63   70   77  999]
 [-999   16   24   32   40   48   56   64   72   80   88  999]
 [-999   18   27   36   45   54   63   72   81   90   99  999]]

请注意，下溢现在都在正常桶的下方，上溢现在都在正常桶的上方，无论它们在 ghast 中的排列方式如何。这允许分析代码独立于直方图来源。

其他类型

Aghast 可以将拟合函数附加到直方图上，可以存储独立的函数，例如查找表，并且可以存储用于无权重拟合或机器学习的 ntuples。

致谢

本工作的支持由 NSF 合作协议 OAC-1836650 (IRIS-HEP)、赠款 OAC-1450377 (DIANA/HEP) 和 PHY-1520942 (US-CMS LHC Ops) 提供。

特别感谢 aghast 贡献者的帮助！

项目详情

这些详情尚未通过PyPI验证

项目链接

发布历史发布通知 | RSS 源

本版本

0.2.1

2019 年 4 月 12 日

0.2.0

2019 年 4 月 11 日

0.1.0

2019 年 3 月 31 日

0.1.0rc4 预发布

2019 年 3 月 31 日

0.1.0rc3 预发布

2019 年 3 月 31 日

0.1.0rc2 预发布

2019 年 3 月 31 日

0.1.0rc1 预发布

2019 年 3 月 31 日

下载文件

下载适合您平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源代码分发

aghast-0.2.1.tar.gz (137.7 kB 查看哈希值)

上传时间 2019年4月12日 源代码

构建分发

aghast-0.2.1-py2.py3-none-any.whl (136.0 kB 查看哈希值)

上传时间 2019年4月12日 Python 2 Python 3

aghast-0.2.1.tar.gz的哈希值

aghast-0.2.1.tar.gz的哈希值
算法	哈希摘要
SHA256	`5a60f84d8ecb1b5d56368d6eb839fdae93496d9d96b2a1ead8f820732679ccb4`
MD5	`47671b75afe7e3406ebed22ac0aaff1f`
BLAKE2b-256	`fdd9cfbc5921f2fa64648b3aeff0a5f02a7db1287dca0a38e560896a3e805671`

aghast-0.2.1-py2.py3-none-any.whl的哈希值

aghast-0.2.1-py2.py3-none-any.whl的哈希值
算法	哈希摘要
SHA256	`d945d3adb55dea3a1cd465730c49110a8667651d9d2ae1ff6643900b0d3e65c8`
MD5	`bfca193395dfd928cac5b22e24abb77c`
BLAKE2b-256	`9a15b67a8f15912dbfbbf4ae6025c4284c287a6be67e580d618afb38701af7a2`

aghast 0.2.1

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

aghast

通过包安装

手动安装

文档

教程示例

转换

序列化

约定翻译

桶类型

复杂的桶类型

集合

直方图 → 直方图转换

桶计数 → Numpy

其他类型

致谢

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发布历史发布通知 | RSS 源

下载文件

源代码分发

构建分发

aghast 0.2.1

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

aghast

通过包安装

手动安装

文档

教程示例

转换

序列化

约定翻译

桶类型

复杂的桶类型

集合

直方图 → 直方图转换

桶计数 → Numpy

其他类型

致谢

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发布历史 发布通知 | RSS 源

下载文件

源代码分发

构建分发

发布历史发布通知 | RSS 源