使用Python查找多个单词并打印下一行
我有一个很大的文本文件,内容大致如下:
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
.....
现在我想写一个Python脚本,来查找像(> <Enzymologic: Ki nM 1>
,> <Enzymologic: EC50/IC50 nM 1>
)这样的词,并把每个词后面的那一行以制表符分隔的格式打印出来,格式如下:
> <Enzymologic: Ki nM 1> > <Enzymologic: EC50/IC50 nM 1>
257000 n/a
5000 1000
....
我试过以下代码:
infile = path of the file
lines = infile.readlines()
infile.close()
searchtxt = "> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"
for i, line in enumerate(lines):
if searchtxt in line and i+1 < len(lines):
print lines[i+1]
但是它没有成功,有人能建议一些代码来实现这个功能吗?
提前谢谢大家!
6 个回答
0
你面临两个独立的问题:
解析文件并提取数据
import itertools
# let's imitate a file
pseudo_file = """
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
""".split('\n')
def iterate_on_couple(iterable):
"""
Iterate on two elements, by two elements
"""
iterable = iter(iterable)
for x in iterable:
yield x, next(iterable)
plain_lines = (l for l in pseudo_file if l.strip()) # ignore empty lines
results = {}
# store all results in a dictionary
for name, value in iterate_on_couple(plain_lines):
results.setdefault(name, []).append(value)
# now you got a dictionary with all values linked to a name
print results
这段代码假设你的文件没有损坏,并且总是有这样的结构:
- 空行
- 名字
- 数值
如果不是这样,你可能需要更强大的解决方案。
其次,这段代码会把所有的数值都存储在内存中,如果你的数值很多,这可能会成为一个问题。在这种情况下,你需要考虑一些存储方案,比如 shelve
模块或者 sqlite
。
将结果保存到文件中
import csv
def get(iterable, index, default):
"""
Return an item from array or default if IndexError
"""
try:
return iterable[index]
except IndexError:
return default
names = results.keys() # get a list of all names
# now we write our tab separated file using the csv module
out = csv.writer(open('/tmp/test.csv', 'w'), delimiter='\t')
# first the header
out.writerow(names)
# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]
# then write the lines one by one
for i in xrange(max_size):
line = [get(results[name], i, "-") for name in names]
out.writerow(line)
因为我正在为你写完整的代码,所以我故意使用了一些高级的Python写法,这样在使用的时候你可以思考一下。
1
我觉得你的问题出在你用的是 if searchtxt in line
,而不是对你每个 pattern
用 if pattern in line
。我会这样做:
>>> path = 'D:\\temp\\Test.txt'
>>> lines = open(path).readlines()
>>> searchtxt = "Enzymologic: IC50 nM 1", "Enzymologic: Ki nM 1"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i, line in enumerate(lines):
for pattern in searchtxt:
if pattern in line and i+1 < len(lines):
dict_patterns[pattern].append(lines[i+1])
>>> dict_patterns
defaultdict(<type 'list'>, {'Enzymologic: Ki nM 1': ['257000\n', '5000\n'],
'Enzymologic: IC50 nM 1': ['n/a\n', '1000']})
使用字典可以把结果按模式分组(defaultdict
是一种方便的方式,可以让你不必手动初始化对象)。
1
s = '''Enzymologic: Ki nM 1
257000
Enzymologic: IC50 nM 1
n/a
ITC: Delta_G0 kJ/mole 1
n/a
Enzymologic: Ki nM 1
5000
Enzymologic: IC50 nM 1
1000'''
from collections import defaultdict
lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key, value in zip(keys, values):
result[key].append(value)
print dict(result)
>>> {'ITC: Delta_G0 kJ/mole 1': ['n/a'], 'Enzymologic: Ki nM 1': ['257000', '5000'], 'Enzymologic: IC50 nM 1': ['n/a', '1000']}
然后按照你喜欢的方式格式化输出。