在python2中使用美丽的汤

2024-05-16 04:54:11 发布

2025

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试使用python2.7中的beautifulsoup构建一个基本的web爬虫。这是我的代码：

import re
import httplib
import urllib2
from urlparse import urlparse
from bs4 import BeautifulSoup

regex = re.compile(
        r'^(?:http|https)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

def isValidUrl(url):
    if regex.match(url) is not None:
        return True;
    return False

def crawler(SeedUrl):
    tocrawl=[SeedUrl]
    crawled=[]
    while tocrawl:
        page=tocrawl.pop()
        print 'Crawled:'+page
        pagesource=urllib2.urlopen(page)
        s=pagesource.read()
        soup=BeautifulSoup.BeautifulSoup(s)
        links=soup.findAll('a',href=True)        
        if page not in crawled:
            for l in links:
                if isValidUrl(l['href']):
                    tocrawl.append(l['href'])
            crawled.append(page)   
    return crawled

crawler('https://www.google.co.in/?gfe_rd=cr&ei=SfWxVs65JK_v8we9zrj4AQ&gws_rd=ssl')

我得到了一个错误：

已爬网：https://www.google.co.in/?gfe_rd=cr&ei=SfWxVs65JK_v8we9zrj4AQ&gws_rd=ssl 回溯（最近一次呼叫）：文件“web_crawler_python_2.py”，第38行，in 爬虫程序（'https://www.google.co.in/?gfe_rd=cr&ei=SfWxVs65JK_v8we9zrj4AQ&gws_rd=ssl'） crawler中的第29行文件“web_crawler_python_2.py” 汤=美丽之声。美丽之声（秒） AttributeError:类型对象“BeautifulSoup”没有属性“beauthulsoup”

我试了很多次，但似乎无法调试。谁能给我指出这个问题吗。（顺便说一句，我知道很多网站不允许爬行，但我这样做只是为了学习）。在

谢谢，任何帮助都将不胜感激。在

我用于代码的源代码：simple web crawler

Tags： in https import re web return if page

1条回答

网友

1楼 · 发布于 2024-05-16 04:54:11

此类没有BeautifulSoup属性。我不知道你为什么用它。来自documentation的示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

您需要替换：

^{pr2}$

到

^{3}$

在python2中使用美丽的汤

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python2中使用美丽的汤

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >