如何在Python中集中于列表的子集

1 投票
3 回答
609 浏览
提问于 2025-04-16 12:11

我经常遇到这样的问题,假设我有一个文本文件,我用 file.readlines() 把它读成了一个列表。

假设这个文件的内容大概是这样的:

stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff #indeterminate number of line \
The text I want is set off by something distinctive
I want this
I want this
I want this
I want this # indeterminate number of lines
The end is also identifiable by something distinctive
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff

我处理这个问题的方法是这样做的:

themasterlist=[]
for file in filelist:
    count=0
    templist=[]
    for line in file:
        if line=='The text I want is set off by something distinctive':
            count=1
        if line=='The end is also identifiable by something distinctive':
            count=0
        if count==1:
        templist.append(line)
   themasterlist.append(templist)

我曾考虑过使用字符串 (file.read()),然后根据结束点进行分割,再把它转换成列表,但其实我想用这种方法处理其他类型的内容。例如,假设我正在遍历 lxml.fromstring(somefile) 的元素,我想根据元素的文本是否包含某个短语来处理一部分元素等等。

需要注意的是,我可能一次要处理 20万到30万个文件。

我的解决方案是有效的,但我觉得有点笨拙,感觉我对 Python 还有一些重要的东西没掌握。

有三个非常好的回答,我从每个回答中都学到了有用的东西。我需要选择一个作为答案,但我很感激每位发帖者的回复,真的很有帮助。

3 个回答

1

如果每个文件里只有一个感兴趣的代码块,你可以这样做:

from itertools import dropwhile, takewhile
startline = "The text I want is set off by something distinctive"
endline = "The end is also identifiable by something distinctive"
masterlist = []
for file in filelist:
    next(dropwhile(lambda line: line != startline, file))
    masterlist.append(list(takewhile(lambda line: line != endline, file)))

但是如果每个文件里有不确定数量的代码块,这样就没那么简单好看了:

for file in filelist:
    templist = []
    while True:
        try:
            next(dropwhile(lambda line: line != startline, file))
            masterlist += takewhile(lambda line: line != endline, file)
        except StopIteration:
            break
   masterlist.append(templist)

请注意,这段代码假设 filelist 是一个打开的文件对象列表。

2

你可以这样做:

data = list(filelist)
topindex = data.index('The text I want is set off by something distinctive')
endindex = data.index('The end is also identifiable by something distinctive')
themasterlist = data[topindex+1:endindex]

上面的代码如果找不到你指定的文本,就会出现错误,所以要做好准备。另外,我确保了 data 是一个列表,因为尽管名字叫 filelist,我不能确定它是不是一个列表(它可能是一个迭代器)。

4

我喜欢这样的东西:

def findblock( lines, start, stop ):
    it = iter(lines)
    for line in it:
        if start in line:
            # now we are in the block, so yield till we find the end
            for line in it:
                if stop in line:
                    # lets just look for one block
                    return # leave this generator
                    # break # would keep looking for the next block
                yield line                

for line in findblock(lines, start="something distinctive", 
                             stop="something distinctive"):
    print line

你缺少的东西是“yield”和列表推导式——这是你修改过的代码:

def findblock( lines, start='The text I want is set off by something distinctive', 
                      stop='The end is also identifiable by something distinctive'):
    for line in lines:
        inblock = False
        if line==start:
            inblock=True
        if line==stop:
            inblock=False # or return mb?
        if inblock:
            yield line

themasterlist = [list(findblock( file )) for file in files]

撰写回答