使用Python从文本文件读取行中的特殊行尾字符/字符串

1 投票
4 回答
729 浏览
提问于 2025-04-16 13:31

我需要从一个文本文件中读取行,但这个文件的“行结束”符号不一定是 \n 或 \x,可能是任何字符组合,比如 'xyz' 或者 '|',不过每种文件的“行结束”符号都是固定的,大家都知道。

因为这个文本文件可能很大,我需要考虑性能和内存使用,所以最好的解决方案是什么呢?现在我用的是 string.read(1000) 结合 split(myendofline) 或 partition(myendofline),但我想知道有没有更优雅和标准的解决办法。

4 个回答

1

TextFileData.split(EndOfLine_char) 这个方法看起来是你要找的解决办法。如果这个方法运行得不够快,那你可以考虑使用更底层的编程方式。

2

最简单的方法就是直接把整个内容读进来,然后用 .split('|') 来分割。

不过,如果这样做不太合适,因为它会把所有内容都放到内存里,你可以选择分块读取,然后对每一块进行分割。你可以写一个类,当当前读取的内容用完时,它会自动再读取一块,其他部分的程序就不需要知道这个过程。

下面是输入文件,zen.txt

The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!

这是我的一个小测试案例,对我来说是有效的。它没有处理很多特殊情况,也不是特别美观,但应该能帮助你入门。

class SpecialDelimiters(object):
    def __init__(self, filehandle, terminator, chunksize=10):
        self.file = filehandle
        self.terminator = terminator
        self.chunksize = chunksize
        self.chunk = ''
        self.lines = []
        self.done = False

    def __iter__(self):
        return self

    def next(self):
        if self.done:
            raise StopIteration
        try:
            return self.lines.pop(0)
        except IndexError:
            #The lines list is empty, so let's read some more!
            while True:
                #Looping so even if our chunksize is smaller than one line we get at least one chunk
                newchunk = self.file.read(self.chunksize)
                self.chunk += newchunk
                rawlines = self.chunk.split(self.terminator)
                if len(rawlines) > 1 or not newchunk:
                    #we want to keep going until we have at least one block
                    #or reached the end of the file
                    break
            self.lines.extend(rawlines[:-1])
            self.chunk = rawlines[-1]
            try:
                return self.lines.pop(0)
            except IndexError:
                #The end of the road, return last remaining stuff
                self.done = True
                return self.chunk               

zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
    print line  
2

这里有一个生成器函数,它可以像迭代器一样处理文件,按照文件中相同的特殊换行符来切分行。

它会按lenchunk个字符的块来读取文件,并逐块显示每个当前块中的行。

因为在我的例子中,换行符是3个字符(':;:'),所以有可能一个块的结尾正好是一个被切开的换行符:这个生成器函数会处理这种情况,确保正确显示行。

如果换行符只有一个字符,函数可以简化。我只写了处理最复杂情况的函数。

使用这个函数可以逐行读取文件,而不需要把整个文件都加载到内存中。

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    # nl = 0 or 1 acts as 0 or 1 in splitlines()
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            chunk = tail + kept
            tail = newtail
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            chunk = f.read(lenchunk)
        yield tail
    


for line in liner('fofo.txt',':;:'):
    print line

这里是同样的代码,添加了一些打印输出,方便跟踪算法的执行过程。

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        ch = f.read()
        the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\
                  '\nend of the file=='+ch[-50:]+\
                  '\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'
        f.seek(0,0)
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;':
                wr = [' ##########---------- cut newline cut ----------##########'+\
                     '\nchunk== '+chunk+\
                     '\n---------------------------------------------------']
            else:
                wr = ['chunk== '+chunk+\
                     '\n---------------------------------------------------']
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            wr.append('\nkept== '+kept+\
                      '\n---------------------------------------------------'+\
                      '\nnewtail== '+newtail)
            chunk = tail + kept
            tail = newtail
            wr.append('\n---------------------------------------------------'+\
                      '\ntail + kept== '+chunk+\
                      '\n---------------------------------------------------')
            print ''.join(wr)
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            print '\n\n==================================================='
            chunk = f.read(lenchunk)
        yield tail
        print the_end
    


for line in liner('fofo.txt',':;:',1):
    print 'line== '+line

.

编辑

我比较了我的代码和chmullig的代码的执行时间。

使用一个大约10MB的'fofo.txt'文件,创建方式是:

alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60)))
                for i in xrange(324000))
with open('fofo.txt','wb') as g:
    g.write(ch)

并且这样测量时间:

te = clock()
for line in liner('fofo.txt',':;:', 65536):
    pass
print clock()-te


fh = open('fofo.txt', 'rb')
zenBreaker = SpecialDelimiters(fh, ':;:', 65536)

te = clock()
for line in zenBreaker:
    pass
print clock()-te

我在多次测试中得到了以下观察到的最小时间:

............我的代码 0.7067秒

chmullig的代码 0.8373秒

.

编辑 2

我修改了我的生成器函数:liner2()现在接受文件句柄,而不是文件名。这样可以把打开文件的过程放在测量时间之外,就像chmullig的代码那样。

def liner2(fh,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    chunk = fh.read(lenchunk)
    tail = ''
    while chunk:
        last = chunk.rfind(eol)
        if last==-1:
            kept = chunk
            newtail = ''
        else:
            kept = chunk[0:last+L]   # here: L
            newtail = chunk[last+L:] # here: L
        chunk = tail + kept
        tail = newtail
        x = y = 0
        while y+1:
            y = chunk.find(eol,x)
            if y+1: yield chunk[x:y+NL] # here: NL
            else: break
            x = y+L # here: L
        chunk = fh.read(lenchunk)
    yield tail

fh = open('fofo.txt', 'rb')
te = clock()
for line in liner2(fh,':;:', 65536):
    pass
print clock()-te

经过多次测试以查看最小时间,结果是:

.........使用liner() 0.7067秒

.......使用liner2() 0.7064秒

chmullig的代码 0.8373秒

实际上,打开文件所花的时间在总时间中几乎可以忽略不计。

撰写回答