并行文件匹配，Python

Question

我正在尝试改进一个脚本，这个脚本用来扫描文件中的恶意代码。我们有一个包含正则表达式模式的文件，每行一个模式。这些正则表达式是为了配合grep使用，因为我们目前的实现基本上是一个bash脚本结合find和grep。这个bash脚本在我的基准目录上运行需要358秒。我写了一个python脚本，能在72秒内完成这个工作，但我想进一步提高效率。首先，我会发布基础代码，然后是我尝试过的改进：

import os, sys, Queue, threading, re

fileList = []
rootDir = sys.argv[1]

class Recurser(threading.Thread):

    def __init__(self, queue, dir):
    self.queue = queue
    self.dir = dir
    threading.Thread.__init__(self)

    def run(self):
    self.addToQueue(self.dir)

    ## HELPER FUNCTION FOR INTERNAL USE ONLY
    def addToQueue(self,  rootDir):
      for root, subFolders, files in os.walk(rootDir):
    for file in files:
       self.queue.put(os.path.join(root,file))
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)

class Scanner(threading.Thread):

    def __init__(self, queue, patterns):
    self.queue = queue
    self.patterns = patterns
    threading.Thread.__init__(self)

    def run(self):
    nextFile = self.queue.get()
    while nextFile is not -1:
       #print "Trying " + nextFile
       self.scanFile(nextFile)
       nextFile = self.queue.get()


    #HELPER FUNCTION FOR INTERNAL UES ONLY
    def scanFile(self, file):
       fp = open(file)
       contents = fp.read()
       i=0
       #for patt in self.patterns:
       if self.patterns.search(contents):
      print "Match " + str(i) + " found in " + file

############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################


fileQueue = Queue.Queue()

#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
   #patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
   giantRE = giantRE + line.rstrip() + '|'

giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)

#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()

print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
   scanner = Scanner(fileQueue, giantRE)
   scanner.start()

这显然是调试用的代码，看起来很糟糕，不用在意那些一堆的queue.put(-1)，我稍后会整理的。有些缩进显示得不太对，特别是在scanFile函数里。

总之，我注意到一些事情。使用1、4甚至8个线程（在scanner中使用xrange(0,???)）并没有什么区别。无论如何，我的执行时间还是大约72秒。我猜这是因为python的全局解释器锁（GIL）造成的。

为了避免使用一个巨大的正则表达式，我尝试将每一行（模式）作为一个编译过的正则表达式放在一个列表中，然后在我的scanfile函数中遍历这个列表。这导致了更长的执行时间。

为了避免python的GIL，我尝试让每个线程去调用grep，如下所示：

#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
      s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
      output = s.communicate()[0]
      if output != '':
         print 'Matchfound in ' + file

这导致了更长的执行时间。

有没有什么建议可以提高性能呢？

:::::::::::::编辑::::::::

我还不能回答自己的问题，不过我可以回应几个提到的点：

@David Nehme - 只是想让大家知道我知道我有一堆queue.put(-1)。

@Blender - 这是为了标记队列的底部。我的扫描线程会一直从队列中取出，直到遇到-1，这个-1在队列的底部（while nextFile不是-1:）。处理器核心有8个，但由于GIL的原因，使用1个线程、4个线程或8个线程都没有区别。启动8个子进程导致代码运行明显变慢（142秒对比72秒）。

@ed - 是的，这和find\grep组合一样慢，实际上更慢，因为它会无差别地grep那些不需要的文件。

@Ron - 不能升级，这必须是通用的。你觉得这样会比72秒快吗？bash的grep需要358秒。我的python大正则表达式方法用1到8个线程能做到72秒。使用popen方法和8个线程（8个子进程）运行在142秒。所以到目前为止，python的大正则表达式方法明显是最好的。

@intued

这是我们当前find\grep组合的核心部分（不是我的脚本）。其实很简单。里面有一些额外的东西，比如ls，但没有什么应该导致5倍的减速。即使grep -r稍微高效一点，5倍的减速也是非常大的。

 find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"

python代码更高效，我不知道为什么，但我通过实验测试过。我更喜欢用python来做。我已经通过python实现了5倍的提速，我希望能进一步加快。

:::::::::::::赢家赢家赢家:::::::::::::::::

看起来我们找到了赢家。

intued的shell脚本以34秒获得第二名，而@steveha的以24秒获得第一名。由于我们很多机器上没有python2.6，我不得不使用cx_freeze来处理。我可以写一个shell脚本包装器来wget一个tar包并解压。不过我还是喜欢intued的简单。

谢谢大家的帮助，我现在有了一个高效的系统管理工具。

正则表达式性能优化多线程全局解释器锁并行处理文件扫描 grep 系统管理工具

并行文件匹配，Python

4 个回答

撰写回答