如何通过PyQt获取HTML页面的最终结果?

1 投票
1 回答
899 浏览
提问于 2025-04-18 04:08

最近我在尝试从谷歌搜索结果中抓取数据,发现pyqt这个模块可以很好地执行html中的javascript,从而获取最终的html结果。不过在其他网站上,这个方法似乎都能正常工作,但在谷歌搜索上总是失败。我参考了这里的一个例子:http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

代码如下:

import sys
import time
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *

class Render(QWebPage):

    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url1 = 'http://www.google.com/search?start=0&client=firefox-a&q=adidas&safe=off&pws=0&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2002%2Ccd_max%3A1%2F1%2F2001&filter=0&num=10&access=a&oe=UTF-8&ie=UTF-8'   
url2 = 'http://www.google.com/search?start=0&client=firefox-a&q=adidas&safe=off&pws=0&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2009%2Ccd_max%3A7%2F1%2F2009&filter=0&num=10&access=a&oe=UTF-8&ie=UTF-8'
r = Render(url1)
html = r.frame.toHtml()
print type(html)

outfile = open('page.html','w')
outfile.write(html.toUtf8())
outfile.close()
print 'finished!'

但是,url1和url2的结果总是一样,而且当我在chrome中禁用javascript时,结果也完全相同。那么我们该如何处理呢?我们怎么才能获取谷歌搜索的最终html呢?

1 个回答

0
import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml() 

来源:http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

撰写回答