如何在将文件读入numpy array或simi时使用regex作为分隔符函数

2024-05-12 20:21:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个txt文件:

.xsh 1:
..sxi
..kuxz
...iucdb
...khjub
..kjb
.hjub 2:
..ind
..ljnasdc
...kicd
...lijnbcd
.split 3:
..asd

我想将这个文件加载到一个numpy数组中(因为numpy处理起来很快),以便在加载时更快地开始解析。所以说,我希望它在每个分隔符上拆分文件

^{pr2}$

现在我试着这样做:

import numpy as np
import os,re
path = 'C:\\temp'
filename = 'file.txt'
delim = '(^\.\w+\s\d+\:)'
delimFunc = (lambda s: re.split(delim,s))
fname = os.path.join(path,filename)
ar=np.loadtxt(fname, dtype = str, delimiter = delimFunc)
print len(ar)

在这里,它并没有按照我想要的方式拆分,而是在每一条换行符上拆分。有没有可能让纽比,熊猫或其他任何快速图书馆的行为方式,我想在这里?在

我想要结果:

[[.xsh 1:
..sxi
..kuxz
...iucdb
...khjub
..kjb]
[.hjub 2:
..ind
..ljnasdc
...kicd
...lijnbcd]
[.split 3:
..asd]]

Tags: 文件pathnumpytxtsplitindxshkjb
3条回答

我不得不以不同的方式解决问题,但比以前更快:

import numpy as np
import os,re
import time
t1=time.time()
path = 'C:\\temp'
filename = 'file.txt'
delim = '(^\.\w+\s\d+\:)'
fname = os.path.join(path,filename)
ar=np.loadtxt(fname, dtype = str, delimiter = '\n')
x = np.array([],np.int32)
for (i,v) in enumerate(ar):
    if re.search(delim,v):
        x=np.append(x,i)

t2=time.time()
print np.split(ar,x)[1]
print 'Length of array:{0:d},took as long as {1:.2f} to complete'.format(len(x),(t2-t1))

我会这样去的

...
d = re.compile(delim)
# np.nonzero in this case returns a 1-uple of arrays, we have to unwrap
ixs = np.nonzero([d.search(item) for item in ar])[0]
splitted = np.split(ar, ixs if ixs[0] else ixs[1:])
...

ixs if ixs[0] else ixs[1:]表达式考虑第一条记录中是否存在有效的“分隔符”,以实现您在原始问题中显示的结果类型(即,没有记录的无效记录)。在

我认为pandas支持这种开箱即用的方式,如果您可以选择的话。在

看看https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

sep参数:

sep : str, default ‘,’

Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

您还可以使用.values方法iirc将pandas数据帧转换回numpy数组

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html

相关问题 更多 >