使用Python从文本文件读取行中的特殊行尾字符/字符串
我需要从一个文本文件中读取行,但这个文件的“行结束”符号不一定是 \n 或 \x,可能是任何字符组合,比如 'xyz' 或者 '|',不过每种文件的“行结束”符号都是固定的,大家都知道。
因为这个文本文件可能很大,我需要考虑性能和内存使用,所以最好的解决方案是什么呢?现在我用的是 string.read(1000) 结合 split(myendofline) 或 partition(myendofline),但我想知道有没有更优雅和标准的解决办法。
4 个回答
TextFileData.split(EndOfLine_char)
这个方法看起来是你要找的解决办法。如果这个方法运行得不够快,那你可以考虑使用更底层的编程方式。
最简单的方法就是直接把整个内容读进来,然后用 .split('|')
来分割。
不过,如果这样做不太合适,因为它会把所有内容都放到内存里,你可以选择分块读取,然后对每一块进行分割。你可以写一个类,当当前读取的内容用完时,它会自动再读取一块,其他部分的程序就不需要知道这个过程。
下面是输入文件,zen.txt
The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!
这是我的一个小测试案例,对我来说是有效的。它没有处理很多特殊情况,也不是特别美观,但应该能帮助你入门。
class SpecialDelimiters(object):
def __init__(self, filehandle, terminator, chunksize=10):
self.file = filehandle
self.terminator = terminator
self.chunksize = chunksize
self.chunk = ''
self.lines = []
self.done = False
def __iter__(self):
return self
def next(self):
if self.done:
raise StopIteration
try:
return self.lines.pop(0)
except IndexError:
#The lines list is empty, so let's read some more!
while True:
#Looping so even if our chunksize is smaller than one line we get at least one chunk
newchunk = self.file.read(self.chunksize)
self.chunk += newchunk
rawlines = self.chunk.split(self.terminator)
if len(rawlines) > 1 or not newchunk:
#we want to keep going until we have at least one block
#or reached the end of the file
break
self.lines.extend(rawlines[:-1])
self.chunk = rawlines[-1]
try:
return self.lines.pop(0)
except IndexError:
#The end of the road, return last remaining stuff
self.done = True
return self.chunk
zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
print line
这里有一个生成器函数,它可以像迭代器一样处理文件,按照文件中相同的特殊换行符来切分行。
它会按lenchunk
个字符的块来读取文件,并逐块显示每个当前块中的行。
因为在我的例子中,换行符是3个字符(':;:'),所以有可能一个块的结尾正好是一个被切开的换行符:这个生成器函数会处理这种情况,确保正确显示行。
如果换行符只有一个字符,函数可以简化。我只写了处理最复杂情况的函数。
使用这个函数可以逐行读取文件,而不需要把整个文件都加载到内存中。
from random import randrange, choice
# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
for i in xrange(50))
with open('fofo.txt','wb') as g:
g.write(ch)
# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines
def liner(filename,eol,lenchunk,nl=0):
# nl = 0 or 1 acts as 0 or 1 in splitlines()
L = len(eol)
NL = len(eol) if nl else 0
with open(filename,'rb') as f:
chunk = f.read(lenchunk)
tail = ''
while chunk:
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
chunk = tail + kept
tail = newtail
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
chunk = f.read(lenchunk)
yield tail
for line in liner('fofo.txt',':;:'):
print line
这里是同样的代码,添加了一些打印输出,方便跟踪算法的执行过程。
from random import randrange, choice
# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
for i in xrange(50))
with open('fofo.txt','wb') as g:
g.write(ch)
# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines
def liner(filename,eol,lenchunk,nl=0):
L = len(eol)
NL = len(eol) if nl else 0
with open(filename,'rb') as f:
ch = f.read()
the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\
'\nend of the file=='+ch[-50:]+\
'\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'
f.seek(0,0)
chunk = f.read(lenchunk)
tail = ''
while chunk:
if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;':
wr = [' ##########---------- cut newline cut ----------##########'+\
'\nchunk== '+chunk+\
'\n---------------------------------------------------']
else:
wr = ['chunk== '+chunk+\
'\n---------------------------------------------------']
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
wr.append('\nkept== '+kept+\
'\n---------------------------------------------------'+\
'\nnewtail== '+newtail)
chunk = tail + kept
tail = newtail
wr.append('\n---------------------------------------------------'+\
'\ntail + kept== '+chunk+\
'\n---------------------------------------------------')
print ''.join(wr)
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
print '\n\n==================================================='
chunk = f.read(lenchunk)
yield tail
print the_end
for line in liner('fofo.txt',':;:',1):
print 'line== '+line
.
编辑
我比较了我的代码和chmullig的代码的执行时间。
使用一个大约10MB的'fofo.txt'文件,创建方式是:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60)))
for i in xrange(324000))
with open('fofo.txt','wb') as g:
g.write(ch)
并且这样测量时间:
te = clock()
for line in liner('fofo.txt',':;:', 65536):
pass
print clock()-te
fh = open('fofo.txt', 'rb')
zenBreaker = SpecialDelimiters(fh, ':;:', 65536)
te = clock()
for line in zenBreaker:
pass
print clock()-te
我在多次测试中得到了以下观察到的最小时间:
............我的代码 0.7067秒
chmullig的代码 0.8373秒
.
编辑 2
我修改了我的生成器函数:liner2()
现在接受文件句柄,而不是文件名。这样可以把打开文件的过程放在测量时间之外,就像chmullig的代码那样。
def liner2(fh,eol,lenchunk,nl=0):
L = len(eol)
NL = len(eol) if nl else 0
chunk = fh.read(lenchunk)
tail = ''
while chunk:
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
chunk = tail + kept
tail = newtail
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
chunk = fh.read(lenchunk)
yield tail
fh = open('fofo.txt', 'rb')
te = clock()
for line in liner2(fh,':;:', 65536):
pass
print clock()-te
经过多次测试以查看最小时间,结果是:
.........使用liner() 0.7067秒
.......使用liner2() 0.7064秒
chmullig的代码 0.8373秒
实际上,打开文件所花的时间在总时间中几乎可以忽略不计。