如何在将文件读入numpy array或simi时使用regex作为分隔符函数

import numpy as np import os,re path = 'C:\\temp' filename = 'file.txt' delim = '(^\.\w+\s\d+\:)' delimFunc = (lambda s: re.split(delim,s)) fname = os.path.join(path,filename) ar=np.loadtxt(fname, dtype = str, delimiter = delimFunc) print len(ar)

3条回答

网友

1楼 · 编辑于 2024-05-12 20:21:49

我不得不以不同的方式解决问题，但比以前更快：

import numpy as np
import os,re
import time
t1=time.time()
path = 'C:\\temp'
filename = 'file.txt'
delim = '(^\.\w+\s\d+\:)'
fname = os.path.join(path,filename)
ar=np.loadtxt(fname, dtype = str, delimiter = '\n')
x = np.array([],np.int32)
for (i,v) in enumerate(ar):
    if re.search(delim,v):
        x=np.append(x,i)

t2=time.time()
print np.split(ar,x)[1]
print 'Length of array:{0:d},took as long as {1:.2f} to complete'.format(len(x),(t2-t1))

网友

2楼 · 编辑于 2024-05-12 20:21:49

我会这样去的

...
d = re.compile(delim)
# np.nonzero in this case returns a 1-uple of arrays, we have to unwrap
ixs = np.nonzero([d.search(item) for item in ar])[0]
splitted = np.split(ar, ixs if ixs[0] else ixs[1:])
...

ixs if ixs[0] else ixs[1:]表达式考虑第一条记录中是否存在有效的“分隔符”，以实现您在原始问题中显示的结果类型（即，没有记录的无效记录）。在

网友

3楼 · 编辑于 2024-05-12 20:21:49

我认为pandas支持这种开箱即用的方式，如果您可以选择的话。在

看看https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

sep参数：

sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

您还可以使用.values方法iirc将pandas数据帧转换回numpy数组

（https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html）

相关问题更多 >

编程相关推荐

热门问题

热门文章