Python、GTK、Webkit与爬虫，大内存问题

Question

我正在尝试复制一个网站的内容，但不幸的是，这个网站的大部分内容都是基于JavaScript的，包括生成链接的代码。这让大多数标准的网页抓取工具（比如httrack）都无能为力，因为它们处理JavaScript的能力要么根本不行，要么非常不可靠。

于是我决定用Python自己写一个程序，利用webkit引擎来处理HTML。逻辑上看起来很简单，我生成一个字典，把找到的链接作为键，值是0或1，表示这个链接是否已经处理过。我用pyqt4把基本逻辑搞得还不错，但它总是随机崩溃，让我对它产生了怀疑。然后我发现了这个：http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

这个脚本很不错，能正常工作，但我之前从没在Python中接触过gtk。把我的逻辑加到里面还是比较简单的，不过它似乎占用了很多内存。用meliae分析内存时，什么东西都没显示占用那么多内存，即使Python的内存已经达到2Gb。这个网站有很多页面，脚本最终达到了32位内存限制，然后崩溃。我猜测是代码不断生成更多的webkit窗口。我不知道怎么才能真正关闭或销毁这些窗口。我试过用destroy，还有一个main_quit，但似乎都没有关闭它们。

以下是我认为相关的部分（希望是），但目标网址已经更改。我之前用字典来存储url和foundurl，但为了防止它们出奇怪的原因占用内存，我换成了anydbm。可能我之后还会换回字典：

#!/usr/bin/env python
import sys, thread
import gtk
import webkit
import warnings
from time import sleep
from BeautifulSoup import BeautifulSoup
import re
import os
import anydbm
import copy
from meliae import scanner

warnings.filterwarnings('ignore')

class WebView(webkit.WebView):
    def get_html(self):
        self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
        html = self.get_main_frame().get_title()
        self.execute_script('document.title=oldtitle;')
        self.destroy
        return html

class Crawler(gtk.Window):
    def __init__(self, url, file):
        gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala
        gtk.Window.__init__(self)
        self._url = url
        self._file = file
        self.connect("destroy",gtk.main_quit)

    def crawl(self):
        view = WebView()
        view.open(self._url)
        view.connect('load-finished', self._finished_loading)
        self.add(view)
        gtk.main()
        return view.get_html()

    def _finished_loading(self, view, frame):
        with open(self._file, 'w') as f:
            f.write(view.get_html())
        gtk.main_quit()

..各种子程序，处理BeautifulSoup的部分，处理页面，提取链接，整理链接等等...

def main():
    urls=anydbm.open('./urls','n')
    domain = "stackoverflow.com"
    baseUrl = 'http://'+domain
    urls['/']='0'
    while (check_done(urls) == 0):
        count = 0
        foundurls=anydbm.open('./foundurls','n')
        for url, done in urls.iteritems():
            if done == 1: continue
            print "Processing",url
            urls[str(url)] = '1'
            if (re.search(".*\/$",url)):
                outfile=domain+url+"index.html"
            elif (os.path.isdir(os.path.dirname(os.path.abspath(outfile)))):
                outfile=domain+url+"index.html"
            else:
                outfile=domain+url
            if not os.path.exists(os.path.dirname(os.path.abspath(outfile))):
                os.makedirs(os.path.dirname(os.path.abspath(outfile)))
            crawler = Crawler(baseUrl+url, outfile)
            html=crawler.crawl()
            soup = BeautifulSoup(html.__str__())
            for link in hrefs(soup,baseUrl):
                if not foundurls.has_key(str(link)):
                    foundurls[str(link)] = '0'
            del(html)   #  this is an attempt to get the object to vanish, tried del(Crawler) to no avail
            if count==5:
                scanner.dump_all_objects( 'filename' )
                count = 0
            else:
                count=count+1
        for url, done in foundurls.iteritems():
            if not urls.has_key(str(url)):
                urls[str(url)]='0'
        foundurls.close()
        os.remove('./foundurls')
    urls.close()
    os.remove('./urls')

if __name__ == '__main__':
    main()

javascript 内存管理 gtk 网页抓取 beautifulsoup 链接提取爬虫技术 webkit

Python、GTK、Webkit与爬虫，大内存问题

1 个回答

撰写回答