在uplot.pandas.iterate中展平3x3数组

2024-05-15 06:23:49 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试修改我的一个现有脚本，该脚本使用uproot.pandas.iterate将根文件中的数据读取到pandas数据帧中。目前它只读取包含简单数据类型（float、int、bools）的分支，但我想添加读取一些存储3x3矩阵的分支的功能。通过查看自述文件，我了解到，在这种情况下，建议通过将flatten=True作为参数传递给迭代函数来平坦结构。但是，当我这样做时，它崩溃了：

Traceback (most recent call last):
  File "genPreselTuples.py", line 338, in <module>
    data = read_events(args.decaymode, args.tag, args.year, args.polarity, chunk=args.chunk, numchunks=args.numchunks, verbose=args.verbose, testing=args.testing)
  File "genPreselTuples.py", line 180, in read_events
    for df in uproot.pandas.iterate(filename_list, treename, branches=list(branchdict.keys()), entrysteps=100000, namedecode='utf-8', flatten=True):
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 117, in iterate
    for start, stop, arrays in tree.iterate(branches=branchesinterp, entrysteps=entrysteps, outputtype=outputtype, namedecode=namedecode, reportentries=True, entrystart=0, entrystop=tree.numentries, flatten=flatten, flatname=flatname, awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking):
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 721, in iterate
    out = out()
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 678, in <lambda>
    return lambda: uproot._connect._pandas.futures2df([(branch.name, interpretation, wrap_again(branch, interpretation, future)) for branch, interpretation, future, past, cachekey in futures], outputtype, start, stop, flatten, flatname, awkward)
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/_connect/_pandas.py", line 162, in futures2df
    array = array.view(awkward.numpy.dtype([(str(i), array.dtype) for i in range(functools.reduce(operator.mul, array.shape[1:]))])).reshape(array.shape[0])
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.

我的代码如下：

# prepare for file reading
data = pd.DataFrame() # create empty dataframe to hold final output data
file_counter = 0      # count how many files have been processed
event_counter = 0     # count how many events were in input files that have been processed

# loop over files in filename_list & add contents to dataframe
for df in uproot.pandas.iterate(filename_list, treename, branches=list(branchdict.keys()), entrysteps=100000, namedecode='utf-8', flatten=True):
    df.rename(branchdict, axis='columns', inplace=True)   # rename branches to custom names (defined in dictionary)
    
    file_counter += 1                # manage file counting
    event_counter += df.shape[0]     # manage event counting
    
    print(df.head(10)) # debugging
    
    # apply all cuts
    for cut in cutlist:
        df.query(cut, inplace=True)
    
    # append events to dataframe of data
    data = data.append(df, ignore_index=True)
    
    # terminal output
    print('Processed '+format(file_counter,',')+' chunks (kept '+format(data.shape[0],',')+' of '+format(event_counter,',')+' events ({0:.2f}%))'.format(100*data.shape[0]/event_counter), end='\r')

我已经能够让它与flatten=False一起工作（当打印数据帧时，它将值分解成类似于此处所示的列：https://github.com/scikit-hep/uproot#multiple-values-per-event-fixed-size-arrays）

   eventNumber  runNumber  totCandidates  nCandidate  ...  D0_SubVtx_234_COV_[1][2]  D0_SubVtx_234_COV_[2][0]  D0_SubVtx_234_COV_[2][1]  D0_SubVtx_234_COV_[2][2]
0     13769776     177132              3           0  ...                 -0.016343                  0.032616                 -0.016343                  0.470791
1     13769776     177132              3           1  ...                 -0.016343                  0.032616                 -0.016343                  0.470791
2     13769776     177132              3           2  ...                 -0.016343                  0.032616                 -0.016343                  0.470791
3     36250092     177132              2           0  ...                  0.004726                 -0.017212                  0.004726                  0.193447
4     36250092     177132              2           1  ...                  0.004726                 -0.017212                  0.004726                  0.193447

[5 rows x 296 columns]

但我从自述文件中了解到，不建议对这些结构进行展平，至少是为了提高速度——因为我有O（10^8）行要通过，所以速度有点令人担忧。我对造成这种情况的原因很感兴趣，因此我可以找到处理这些对象的最佳方法（稍后将它们写入新文件）。谢谢

编辑：我已经把问题缩小到了branches选项。如果我手动指定一些分支（例如branches=['eventNumber, D0_SubVtx_234_COV_']），那么它可以很好地处理flatten=True和False。但是当使用这个list(branchdict.keys())时，它给出了原始问题顶部显示的ValueError

我已经检查了这个列表，&；其中的所有元素都是真实的分支名称（或者它给出了一个keyrerror）——它包含206个常规分支，其中一些包含标准数据类型，其他一些包含单个数据类型的长度为1的列表，加上10个包含类似3x3矩阵的分支

如果我从这个列表中删除包含矩阵的分支，那么它将按预期工作。如果只删除长度为1的列表，情况也是如此。每当我试图读取（分离）包含这些长度为1的列表和这些3x3矩阵的分支时，就会发生崩溃

Tags： in py true pandas df for data 分支

0条回答

目前没有回答

在uplot.pandas.iterate中展平3x3数组

相关问题更多 >

编程相关推荐

热门问题

热门文章

在uplot.pandas.iterate中展平3x3数组

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >