Python中的简单webcrawler

2024-05-16 23:36:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在自学Python并想出了一个简单的web爬虫引擎。代码如下

def find_next_url(page):
    start_of_url_line = page.find('<a href')
    if start_of_url_line == -1:
        return None, 0
else:
    start_of_url = page.find('"http', start_of_url_line)
    if start_of_url == -1:
        return None, 0
    else:
        end_of_url = page.find('"', start_of_url + 1)
        one_url = page[start_of_url + 1 : end_of_url]
        return one_url, end_of_url 

def get_all_url(page):
p = []
while True:
    url, end_pos = find_next_url(page)
    if url:
        p.append(url)
        page = page[end_pos + 1 : ]
    else:
        break
return p

def union(a, b):
    for e in b:
    if e not in a:
        a.append(e)
    return a

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            intpage = urllib.request.urlopen(page).read()
            openpage = str(intpage)
            union(tocrawl, get_all_url(openpage))
            crawled.append(page)
    return crawled

但是我总是收到HTTP403错误。在


Tags: ofinurlreturnifdeflinepage
3条回答

您可能需要添加请求头或其他身份验证。 尝试添加用户代理,以避免在某些情况下重述。在

示例:

    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

正如其他人所说,错误不是由代码本身引起的,但是您可能需要尝试做一些事情

  • 如果不这样做的话,请确保爬行程序在添加异常时会出现问题:

    def webcrawl(seed):
        tocrawl = [seed]
        crawled = []
        while tocrawl: # replace `while True` with an actual condition,
                       # otherwise you'll be stuck in an infinite loop
                       # until you hit an exception
            page = tocrawl.pop()
            if page not in crawled:
                import urllib.request
                try:
                    intpage = urllib.request.urlopen(page).read()
                    openpage = str(intpage)
                    union(tocrawl, get_all_url(openpage))
                    crawled.append(page)
                except urllib.error.HTTPError as e:  # catch an exception
                    if e.code == 401:  # check the status code and take action
                        pass  # or anything else you want to do in case of an `Unauthorized` error
                    elif e.code == 403:
                        pass  # or anything else you want to do in case of a `Forbidden` error
                    elif e.cide == 404:
                        pass   # or anything else you want to do in case of a `Not Found` error
                    # etc
                    else:
                        print('Exception:\n{}'.format(e))  # print an unexpected exception
                        sys.exit(1)  # finish the process with exit code 1 (indicates there was a problem)
        return crawled
    
  • 尝试在请求中添加一个User-Agent头。来自urllib.request docs:

This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

因此,类似这样的方法可能有助于避免403个错误:

^{pr2}$

HTTP 403错误与您的代码无关。这意味着被爬网的网址被禁止访问。大多数情况下,这意味着页面只对登录用户或特定用户可用。在


我实际上运行了你的代码,得到了403个creativecommons链接。原因是urllib默认情况下不发送Host头,您应该手动添加它以避免出现错误(大多数服务器检查Host报头并决定它们应该提供哪些内容)。您还可以使用Requests python package代替默认情况下发送Host头的内置urllib,它在IMO中更像python

我添加了一个try exept子句来捕获和记录错误,然后继续爬网其他链接。网上有很多断开的链接。在

from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            try:
                intpage = urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except HTTPError as ex:
                print('got http error while crawling', page)
    return crawled

相关问题 更多 >