如何在Python中集中于列表的子集

1 投票

3 回答

609 浏览

提问于 2025-04-16 12:11

我经常遇到这样的问题，假设我有一个文本文件，我用 file.readlines() 把它读成了一个列表。

假设这个文件的内容大概是这样的：

stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff #indeterminate number of line \
The text I want is set off by something distinctive
I want this
I want this
I want this
I want this # indeterminate number of lines
The end is also identifiable by something distinctive
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff

我处理这个问题的方法是这样做的：

themasterlist=[]
for file in filelist:
    count=0
    templist=[]
    for line in file:
        if line=='The text I want is set off by something distinctive':
            count=1
        if line=='The end is also identifiable by something distinctive':
            count=0
        if count==1:
        templist.append(line)
   themasterlist.append(templist)

我曾考虑过使用字符串 (file.read())，然后根据结束点进行分割，再把它转换成列表，但其实我想用这种方法处理其他类型的内容。例如，假设我正在遍历 lxml.fromstring(somefile) 的元素，我想根据元素的文本是否包含某个短语来处理一部分元素等等。

需要注意的是，我可能一次要处理 20万到30万个文件。

我的解决方案是有效的，但我觉得有点笨拙，感觉我对 Python 还有一些重要的东西没掌握。

有三个非常好的回答，我从每个回答中都学到了有用的东西。我需要选择一个作为答案，但我很感激每位发帖者的回复，真的很有帮助。

性能优化数据处理 lxml 文本文件字符串分割列表处理元素遍历子集选择

3 个回答

如果每个文件里只有一个感兴趣的代码块，你可以这样做：

from itertools import dropwhile, takewhile
startline = "The text I want is set off by something distinctive"
endline = "The end is also identifiable by something distinctive"
masterlist = []
for file in filelist:
    next(dropwhile(lambda line: line != startline, file))
    masterlist.append(list(takewhile(lambda line: line != endline, file)))

但是如果每个文件里有不确定数量的代码块，这样就没那么简单好看了：

for file in filelist:
    templist = []
    while True:
        try:
            next(dropwhile(lambda line: line != startline, file))
            masterlist += takewhile(lambda line: line != endline, file)
        except StopIteration:
            break
   masterlist.append(templist)

请注意，这段代码假设 filelist 是一个打开的文件对象列表。

回答于 2025-04-16 由 Python大师

分享举报

你可以这样做：

data = list(filelist)
topindex = data.index('The text I want is set off by something distinctive')
endindex = data.index('The end is also identifiable by something distinctive')
themasterlist = data[topindex+1:endindex]

上面的代码如果找不到你指定的文本，就会出现错误，所以要做好准备。另外，我确保了 data 是一个列表，因为尽管名字叫 filelist，我不能确定它是不是一个列表（它可能是一个迭代器）。

回答于 2025-04-16 由 Python大师

分享举报

我喜欢这样的东西：

def findblock( lines, start, stop ):
    it = iter(lines)
    for line in it:
        if start in line:
            # now we are in the block, so yield till we find the end
            for line in it:
                if stop in line:
                    # lets just look for one block
                    return # leave this generator
                    # break # would keep looking for the next block
                yield line                

for line in findblock(lines, start="something distinctive", 
                             stop="something distinctive"):
    print line

你缺少的东西是“yield”和列表推导式——这是你修改过的代码：

def findblock( lines, start='The text I want is set off by something distinctive', 
                      stop='The end is also identifiable by something distinctive'):
    for line in lines:
        inblock = False
        if line==start:
            inblock=True
        if line==stop:
            inblock=False # or return mb?
        if inblock:
            yield line

themasterlist = [list(findblock( file )) for file in files]

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中集中于列表的子集

3 个回答

撰写回答