Python Mechanize 无法正确处理重定向

Question

我正在用Python的Mechanize和Beautiful Soup做一个网页抓取工具，但不知道为什么重定向没有正常工作。以下是我的代码（抱歉我把变量命名为“thing”和“stuff”；我平时不这样做，相信我）：

stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
    for thing in stuff:
        pageUrl = thing['href']
        print pageUrl

        req = mechanize.Request(pageUrl)

        response = browser.open(req)

        searchPage = response.read()

        soup = BeautifulSoup(searchPage)
        soupString = soup.prettify()
        print soupString

总之，Kraft网站上那些搜索结果超过一页的产品会显示一个链接，可以跳转到下一页。例如，源代码中列出了这个作为Kraft牛排酱和腌料的下一页链接，它会重定向到这个

反正，thing['href']里面有旧链接，因为它是从网页上抓取的。人们可能会认为，调用browser.open()这个链接会让mechanize去访问新链接并返回结果。然而，运行代码后却得到了这个结果：

http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out

我遇到了超时；我想这可能是因为mechanize在寻找旧的URL，而没有被重定向到新的链接（我也试过urllib2，结果是一样的）。这是怎么回事呢？

谢谢你的帮助，如果你需要更多信息，请告诉我。

更新：好的，我启用了日志记录；现在我的代码是：

req = mechanize.Request(pageUrl)
print logging.INFO

当我运行它时，我得到了这个：

url参数不是一个URI（包含非法字符）u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'

20

更新2（在写第一个更新时发生的）：结果是我字符串中的空格导致了问题！我只需要这样做：pageUrl = thing['href'].replace(' ', "+")，然后它就完美工作了。

字符串处理 http请求网页抓取 beautiful soup mechanize 重定向超时 url处理

Python Mechanize 无法正确处理重定向

1 个回答

撰写回答