在Python中使用awk：如何在Python类中使用awk脚本？

1 投票

3 回答

4008 浏览

提问于 2025-04-17 04:43

我想用Python运行一个awk脚本，这样我就可以处理一些数据。

有没有办法在Python类中运行awk脚本，而不使用系统类来调用它作为一个shell进程？我运行这些Python脚本的框架不允许使用子进程调用，所以我现在只能想办法把我的awk脚本转换成Python，或者看看能不能在Python中运行awk脚本。

有什么建议吗？我的awk脚本基本上是读取一个文本文件，并隔离出包含特定化合物的蛋白质块（输出是由我们的框架生成的；我在下面添加了一个示例，展示它的样子），然后把它们打印到一个不同的文件中。

    buildProtein compoundA compoundB
    begin fusion
    Calculate : (lots of text here on multiple lines)
    (more lines)
    Final result - H20: value CO2: value Compound: value 
    Other Compounds X: Value Y: value Z:value

    [...another similar block]

举个例子，如果我构建了一个蛋白质，我需要查看最终结果行中是否有CH3COOH这个化合物，如果有的话，我就得把整个块都取出来，从“buildProtein”命令开始，一直到下一个块的开始；然后把它保存到一个文件里；接着再处理下一个块，看看它是否也有我在找的化合物……如果没有，我就跳到下一个，直到文件结束（这个文件中有多个我在找的化合物，有时候它们是连续的，有时候是和没有这个化合物的块交替出现的）。

任何帮助都非常欢迎；我已经为这个问题绞尽脑汁好几周了，找到这个网站后决定寻求一些帮助。

提前感谢你的好心！

文件操作数据处理编程技巧文本解析 awk 脚本转换蛋白质块化合物筛选

3 个回答

我刚开始学习AWK，所以在这方面我帮不了你。不过，我可以给你一些Python代码，能满足你的需求：

class ProteinIterator():
    def __init__(self, file):
        self.file = open(file, 'r')
        self.first_line = self.file.readline()
    def __iter__(self):
        return self
    def __next__(self):
        "returns the next protein build"
        if not self.first_line:     # reached end of file
            raise StopIteration
        file = self.file
        protein_data = [self.first_line]
        while True:
            line = file.readline()
            if line.startswith('buildProtein ') or not line:
                self.first_line = line
                break
            protein_data.append(line)
        return Protein(protein_data)

class Protein():
    def __init__(self, data):
        self._data = data
        for line in data:
            if line.startswith('buildProtein '):
                self.initial_compounds = tuple(line[13:].split())
            elif line.startswith('Final result - '):
                pieces = line[15:].split()[::2]   # every other piece is a name
                self.final_compounds = tuple([p[:-1] for p in pieces])
            elif line.startswith('Other Compounds '):
                pieces = line[16:].split()[::2]   # every other piece is a name
                self.other_compounds = tuple([p[:-1] for p in pieces])
    def __repr__(self):
        return ("Protein(%s)"% self._data[0])
    @property
    def data(self):
        return ''.join(self._data)

这段代码是用来处理一个叫做buildprotein的文本文件的，它会一次返回一个蛋白质，作为一个Protein对象。这个Protein对象很聪明，知道自己的输入、最终结果和其他结果。如果文件里的实际文本和问题中描述的不完全一样，你可能需要修改一些代码。接下来是这段代码的简单测试和示例用法：

if __name__ == '__main__':
    test_data = """\
buildProtein compoundA compoundB
begin fusion
Calculate : (lots of text here on multiple lines)
(more lines)
Final result - H20: value CO2: value Compound: value 
Other Compounds X: Value Y: value Z: value"""

    open('testPI.txt', 'w').write(test_data)
    for protein in ProteinIterator('testPI.txt'):
        print(protein.initial_compounds)
        print(protein.final_compounds)
        print(protein.other_compounds)
        print()
        if 'CO2' in protein.final_compounds:
            print(protein.data)

我没有保存任何值，但如果你需要的话，可以自己加上。希望这能帮到你。

回答于 2025-04-17 由 Python大师

分享举报

Python的re模块可以帮你解决这个问题。如果你不想麻烦地使用正则表达式，只是想快速分隔一些字段，可以使用内置的字符串 .split() 和 .find() 函数。

回答于 2025-04-17 由 Python大师

分享举报

如果你不能使用 subprocess 模块，最好的办法就是把你的 AWK 脚本用 Python 重新写一遍。为此，fileinput 模块是一个很好的过渡工具，它的用法和 AWK 有点像。

回答于 2025-04-17 由 Python大师

分享举报

在Python中使用awk：如何在Python类中使用awk脚本？

3 个回答

撰写回答