如何在Python中集中于列表的子集
我经常遇到这样的问题,假设我有一个文本文件,我用 file.readlines() 把它读成了一个列表。
假设这个文件的内容大概是这样的:
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff #indeterminate number of line \
The text I want is set off by something distinctive
I want this
I want this
I want this
I want this # indeterminate number of lines
The end is also identifiable by something distinctive
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
我处理这个问题的方法是这样做的:
themasterlist=[]
for file in filelist:
count=0
templist=[]
for line in file:
if line=='The text I want is set off by something distinctive':
count=1
if line=='The end is also identifiable by something distinctive':
count=0
if count==1:
templist.append(line)
themasterlist.append(templist)
我曾考虑过使用字符串 (file.read()),然后根据结束点进行分割,再把它转换成列表,但其实我想用这种方法处理其他类型的内容。例如,假设我正在遍历 lxml.fromstring(somefile) 的元素,我想根据元素的文本是否包含某个短语来处理一部分元素等等。
需要注意的是,我可能一次要处理 20万到30万个文件。
我的解决方案是有效的,但我觉得有点笨拙,感觉我对 Python 还有一些重要的东西没掌握。
有三个非常好的回答,我从每个回答中都学到了有用的东西。我需要选择一个作为答案,但我很感激每位发帖者的回复,真的很有帮助。
3 个回答
1
如果每个文件里只有一个感兴趣的代码块,你可以这样做:
from itertools import dropwhile, takewhile
startline = "The text I want is set off by something distinctive"
endline = "The end is also identifiable by something distinctive"
masterlist = []
for file in filelist:
next(dropwhile(lambda line: line != startline, file))
masterlist.append(list(takewhile(lambda line: line != endline, file)))
但是如果每个文件里有不确定数量的代码块,这样就没那么简单好看了:
for file in filelist:
templist = []
while True:
try:
next(dropwhile(lambda line: line != startline, file))
masterlist += takewhile(lambda line: line != endline, file)
except StopIteration:
break
masterlist.append(templist)
请注意,这段代码假设 filelist
是一个打开的文件对象列表。
2
你可以这样做:
data = list(filelist)
topindex = data.index('The text I want is set off by something distinctive')
endindex = data.index('The end is also identifiable by something distinctive')
themasterlist = data[topindex+1:endindex]
上面的代码如果找不到你指定的文本,就会出现错误,所以要做好准备。另外,我确保了 data
是一个列表,因为尽管名字叫 filelist
,我不能确定它是不是一个列表(它可能是一个迭代器)。
4
我喜欢这样的东西:
def findblock( lines, start, stop ):
it = iter(lines)
for line in it:
if start in line:
# now we are in the block, so yield till we find the end
for line in it:
if stop in line:
# lets just look for one block
return # leave this generator
# break # would keep looking for the next block
yield line
for line in findblock(lines, start="something distinctive",
stop="something distinctive"):
print line
你缺少的东西是“yield”和列表推导式——这是你修改过的代码:
def findblock( lines, start='The text I want is set off by something distinctive',
stop='The end is also identifiable by something distinctive'):
for line in lines:
inblock = False
if line==start:
inblock=True
if line==stop:
inblock=False # or return mb?
if inblock:
yield line
themasterlist = [list(findblock( file )) for file in files]