使用Python查找多个单词并打印下一行

3 投票
6 回答
4627 浏览
提问于 2025-04-16 19:32

我有一个很大的文本文件,内容大致如下:

> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000

.....

现在我想写一个Python脚本,来查找像(> <Enzymologic: Ki nM 1>> <Enzymologic: EC50/IC50 nM 1>)这样的词,并把每个词后面的那一行以制表符分隔的格式打印出来,格式如下:

> <Enzymologic: Ki nM 1>     > <Enzymologic: EC50/IC50 nM 1>
257000                       n/a
5000                         1000
.... 

我试过以下代码:

infile = path of the file
lines = infile.readlines()
infile.close()
searchtxt = "> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"
for i, line in enumerate(lines): 
     if searchtxt in line and i+1 < len(lines):
         print lines[i+1]

但是它没有成功,有人能建议一些代码来实现这个功能吗?

提前谢谢大家!

6 个回答

0

你面临两个独立的问题:

解析文件并提取数据

import itertools

# let's imitate a file
pseudo_file = """
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000
""".split('\n')

def iterate_on_couple(iterable):
  """
    Iterate on two elements, by two elements
  """
  iterable = iter(iterable)
  for x in iterable:
    yield x, next(iterable)

plain_lines = (l for l in pseudo_file  if l.strip()) # ignore empty lines

results = {}

# store all results in a dictionary
for name, value in iterate_on_couple(plain_lines):
  results.setdefault(name, []).append(value)

# now you got a dictionary with all values linked to a name
print results

这段代码假设你的文件没有损坏,并且总是有这样的结构:

  • 空行
  • 名字
  • 数值

如果不是这样,你可能需要更强大的解决方案。

其次,这段代码会把所有的数值都存储在内存中,如果你的数值很多,这可能会成为一个问题。在这种情况下,你需要考虑一些存储方案,比如 shelve 模块或者 sqlite

将结果保存到文件中

import csv

def get(iterable, index, default):
  """
    Return an item from array or default if IndexError
  """
  try:
      return iterable[index]
  except IndexError:
      return default

names = results.keys() # get a list of all names

# now we write our tab separated file using the csv module
out = csv.writer(open('/tmp/test.csv', 'w'), delimiter='\t')

# first the header
out.writerow(names)

# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]

# then write the lines one by one
for i in xrange(max_size):
    line = [get(results[name], i, "-") for name in names]
    out.writerow(line)

因为我正在为你写完整的代码,所以我故意使用了一些高级的Python写法,这样在使用的时候你可以思考一下。

1

我觉得你的问题出在你用的是 if searchtxt in line,而不是对你每个 patternif pattern in line。我会这样做:

>>> path = 'D:\\temp\\Test.txt'
>>> lines = open(path).readlines()
>>> searchtxt = "Enzymologic: IC50 nM 1", "Enzymologic: Ki nM 1"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i, line in enumerate(lines):
    for pattern in searchtxt:
        if pattern in line and i+1 < len(lines):
             dict_patterns[pattern].append(lines[i+1])

>>> dict_patterns
defaultdict(<type 'list'>, {'Enzymologic: Ki nM 1': ['257000\n', '5000\n'],
                            'Enzymologic: IC50 nM 1': ['n/a\n', '1000']})

使用字典可以把结果按模式分组(defaultdict 是一种方便的方式,可以让你不必手动初始化对象)。

1
s = '''Enzymologic: Ki nM 1

257000

Enzymologic: IC50 nM 1

n/a

ITC: Delta_G0 kJ/mole 1

n/a

Enzymologic: Ki nM 1

5000

Enzymologic: IC50 nM 1

1000'''
from collections import defaultdict

lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key, value in zip(keys, values):
    result[key].append(value)
print dict(result)

>>> {'ITC: Delta_G0 kJ/mole 1': ['n/a'], 'Enzymologic: Ki nM 1': ['257000', '5000'], 'Enzymologic: IC50 nM 1': ['n/a', '1000']}

然后按照你喜欢的方式格式化输出。

撰写回答