Python web抓取提供了错误的源代码

import urllib2 url="http://www.amazon.com/s/ref=sr_nr_n_11?rh=n%3A283155%2Cn%3A%2144258011%2Cn%3A2205237011%2Cp_n_feature_browse-bin%3A2656020011%2Cn%3A173507&bbn=2205237011&sort=titlerank&ie=UTF8&qid=1393984161&rnid=1000" webpage=urllib2.urlopen(url).read() doc=open("test.html","w") doc.write(webpage) doc.close()

2条回答

网友

1楼 · 编辑于 2024-04-25 02:08:54

要完成falsetru的回答：

另一个解决方案是使用python-ghost。它基于Qt。安装起来要重得多，所以我也建议Selenium。在

使用Firefox将在脚本执行时打开它。为了不让它出现在您的路上，请使用PhantomJS：

apt-get install nodejs  # you get npm, the Node Package Manager
npm install -g phantomjs  # install globally
[…]
driver = webdriver.PhantomJS()

网友

2楼 · 编辑于 2024-04-25 02:08:54

该页面包含javascript执行。在

urllib2.urlopen(..).read()只需读取url内容。所以他们是不同的。在

要获得相同的内容，您需要使用能够处理javascript的库。在

例如，以下代码使用^{}：

from selenium import webdriver

url = 'http://www.amazon.com/s/ref=sr_nr_n_11?...161&rnid=1000'
driver = webdriver.Firefox()
driver.get(url)
with open('test.html', 'w') as f:
    f.write(driver.page_source.encode('utf-8'))
driver.quit()

相关问题更多 >

编程相关推荐

热门问题

热门文章