从字符串列表创建一个numpy结构化数组

1条回答

网友

1楼 · 发布于 2024-05-15 08:46:42

抱歉，这个答案又长又乱，但这就是为什么要弄清楚到底发生了什么。尤其是数据类型的复杂性被其长度所掩盖。在

当我为delimiter尝试你的列表时，我得到了TypeError: cannot perform accumulate with flexible type错误。详细信息显示错误发生在LineSplitter。不必详细说明，分隔符应该是一个字符（或默认的“空白”）。在

来自genfromtxt文档：

delimiter : str, int, or sequence, optional The string used to separate values. By default, any consecutive whitespaces act as delimiter. An integer or sequence of integers can also be provided as width(s) of each field.

genfromtxt拆分器比loadtxt使用的字符串.split稍微强大一些，但没有{}拆分器一般。在

至于{TypeError}a bytes-like object is required, not 'str'，您可以为几个字段指定dtype'str'。这是字节字符串，其中record是unicode字符串（在Py3中）。但是你已经意识到了BytesIO(record.encode())。在

我喜欢测试genfromtxt病例：

record = b'....'
np.genfromtxt([record], ....)

或者更好

^{pr2}$

如果我让genfromtxt推断字段类型，并且只使用一个分隔符，我得到32个字段：

In [19]: A=np.genfromtxt([record],dtype=None,delimiter='|')
In [20]: len(A.dtype)
Out[20]: 32
In [21]: A
Out[21]: 
array((b'0002 00038 1', False, 3.6412123, 1.08701186, 14.1, -23.0, 69, 82, 1.8, 1.9, 1968.56, 1957.3, 3, 1.0, 3.0, 0.9, 3.0, 12.444, 0.213, 11.907, 0.189, 999, False, False, 3.64117944, 1.08706861, 1.83, 1.73, 81.0, 104.7, False, 0.0), 
      dtype=[('f0', 'S12'), ('f1', '?'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ... ('f26', '<f8'), ('f27', '<f8'), ('f28', '<f8'), ('f29', '<f8'), ('f30', '?'), ('f31', '<f8')])

当我们解决了整个字节和分隔符的问题时

np.array([x for x in re.split(b'\|| ',record)],dtype=dform)

确实跑了。我现在看到你的数据表单很复杂，有嵌套的复合字段。在

但是要定义一个结构化数组，需要给它一个记录列表，例如

np.array([(record1...), (record2...), ....], dtype([(field1),(field2 ),...]))

您正在尝试创建一个记录。我可以把你的列表包装成一个元组，但是我得到了这个长度和dform长度，66v17不匹配。如果计算所有的子字段dform可能有66个值，但我们不能仅用一个元组来计算。在

我从来没有尝试过从这样复杂的dtype创建数组，所以我在寻找使其工作的方法。在

In [41]: np.zeros((1,),dform)
Out[41]: 
array([ ((0, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)], 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), ('pflag', '<U'), ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), ('meanEpoch', ....('solutionType', '<U'), ('correlation', '<f8')])

In [64]: for name in A.dtype.names:
    print(A[name].dtype)
   ....:     
[('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
int32
[('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]
int32
<U1
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
<U1
float64

我统计了34个原始数据类型字段。大多数是“标量”，一些是2-4个术语，其中一个有更高层次的嵌套。在

如果我用|替换前两个拆分空格，record.split(b'|')会给我34个字符串。在

让我们来试试genfromtxt：

In [79]: np.genfromtxt([record],delimiter='|',dtype=dform)
Out[79]: 
array(((2, 38, 1), '', (3.6412123, 1.08701186), (14.1, -23.0), 
   (69, 82, 1.8, 1.9), (1968.56, 1957.3), 3, (1.0, 3.0, 0.9, 3.0),
   ((12.444, 0.213), (11.907, 0.189)), 999, '', '', 
   (3.64117944, 1.08706861), (1.83, 1.73), (81.0, 104.7), '', 0.0), 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), 
 ('pflag', '<U'), 
 ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]),  
 ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('meanEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]),   
 ('numPos', '<i4'), 
 ('fitGoodness', [('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('magnitude', [('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]), 
 ('starProximity', '<i4'), ('tycho1flag', '<U'), ('hipparcosNumber', '<U'), 
 ('observedPos', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('observedEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
 ('observedError', [('rightAscension', '<f8'), ('declination', '<f8')]), ('solutionType', '<U'), ('correlation', '<f8')])

这看起来很合理。genfromtxt实际上可以在复合字段中拆分值。这是我想用np.array()来尝试的。在

因此，如果您确定了分隔符和byte/unicode，那么genfromtxt就可以处理这种混乱。在

相关问题更多 >

编程相关推荐

热门问题

热门文章