Python中的简单webcrawler

3条回答

网友

1楼 · 编辑于 2024-05-16 23:36:02

您可能需要添加请求头或其他身份验证。尝试添加用户代理，以避免在某些情况下重述。在

示例：

    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

网友

2楼 · 编辑于 2024-05-16 23:36:02

正如其他人所说，错误不是由代码本身引起的，但是您可能需要尝试做一些事情

如果不这样做的话，请确保爬行程序在添加异常时会出现问题：

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl: # replace `while True` with an actual condition,
                   # otherwise you'll be stuck in an infinite loop
                   # until you hit an exception
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            try:
                intpage = urllib.request.urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except urllib.error.HTTPError as e:  # catch an exception
                if e.code == 401:  # check the status code and take action
                    pass  # or anything else you want to do in case of an `Unauthorized` error
                elif e.code == 403:
                    pass  # or anything else you want to do in case of a `Forbidden` error
                elif e.cide == 404:
                    pass   # or anything else you want to do in case of a `Not Found` error
                # etc
                else:
                    print('Exception:\n{}'.format(e))  # print an unexpected exception
                    sys.exit(1)  # finish the process with exit code 1 (indicates there was a problem)
    return crawled

尝试在请求中添加一个User-Agent头。来自urllib.request docs:

This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

因此，类似这样的方法可能有助于避免403个错误：

^{pr2}$

网友

3楼 · 编辑于 2024-05-16 23:36:02

HTTP 403错误与您的代码无关。这意味着被爬网的网址被禁止访问。大多数情况下，这意味着页面只对登录用户或特定用户可用。在

我实际上运行了你的代码，得到了403个creativecommons链接。原因是urllib默认情况下不发送Host头，您应该手动添加它以避免出现错误（大多数服务器将检查Host报头并决定它们应该提供哪些内容）。您还可以使用Requests python package代替默认情况下发送Host头的内置urllib，它在IMO中更像python

我添加了一个try exept子句来捕获和记录错误，然后继续爬网其他链接。网上有很多断开的链接。在

from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            try:
                intpage = urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except HTTPError as ex:
                print('got http error while crawling', page)
    return crawled

相关问题更多 >

编程相关推荐

热门问题

热门文章