哈希np.array>int的确定性方法

import numpy as np from pyarrow import plasma def int_to_bytes(x: int) -> bytes: return x.to_bytes( (x.bit_length() + 7) // 8, "big" ) # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3 def get_object_id(arr): arr_id = int(arr.sum() / (arr.shape[0])) oid: bytes = int_to_bytes(arr_id).zfill(20) # fill from left with zeroes, must be of length 20 return plasma.ObjectID(oid)

arr = np.arange(12) a1 = arr.reshape(3, 4) a2 = arr.reshape(3,2,2) assert get_object_id(a1) != get_object_id(a2), 'Hash collision' # another good test case assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3)) assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12))

1条回答

网友

1楼 · 发布于 2024-06-16 14:58:10

hashlib模块有一些从字节字符串（通常用于CRC）计算哈希的例程。可以使用ndarray.tobytes将数据数组转换为字节字符串，但是示例仍然会失败，因为这些数组具有相同的字节，但形状不同。所以你也可以把形状散列出来

def hasharr(arr):
  hash = hashlib.blake2b(arr.tobytes(), digest_size=20)
  for dim in arr.shape:
    hash.update(dim.to_bytes(4, byteorder='big'))
  return hash.digest()

Exmaple：

>>> hasharr(a1)
b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ'
>>> hasharr(a2)
b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

我不是blake2b方面的专家，所以你必须自己做研究，找出碰撞的可能性

我不知道为什么要标记pyarrow，但是如果您想在pyarrow数组上执行相同的操作而不转换为numpy，那么您可以使用arr.buffers()获取数组的缓冲区，并将这些缓冲区（将有多个，有些可能是None）转换为使用buf.to_pybytes()的字节字符串。只是散列所有缓冲区。这里不必担心形状，因为pyarrow数组总是一维的

相关问题更多 >

编程相关推荐

热门问题

热门文章