靓汤不好吃

#!/usr/bin/env python # -*- coding: utf-8 -*- import requests import urllib from bs4 import BeautifulSoup glimit = 100 def my_spider(max_pages): page = 2 while page <= max_pages: url = 'http://www.bbb.org/search/?type=name&input=constrution&location=Austin%2c+TX&filter=combined&accredited=&radius=5000&country=USA&language=en&codeType=YPPA' url_2 = url + '&page='+ str(page) +'&source=bbbse' source_code = requests.get(url_2) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html5lib") limit = glimit li = soup.find('h4', {'class': 'hcolor'}) children = li.find_all("a") for result in children: href = "http://www.bbb.org" + result.get('href') owl = (result.string) print owl get_single_item_data(href) page += 1 def get_single_item_data(item_url): source_code = requests.get(item_url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html5lib") limit = glimit mysoup = soup.findAll('h3',{'class': 'address__heading' })[:limit] mysoup2 = mysoup.find_all("a") for item in mysoup2: href = "http://www.bbb.org" + item.get('href') print (item.string) my_spider(2)

1条回答

网友

1楼 · 发布于 2024-05-23 21:01:26

您的代码中存在各种问题。

1）您不需要href = "http://www.bbb.org" +。删除"http://www.bbb.org"，因为链接已经有了主机。

（二）

mysoup = soup.findAll('h3',{'class': 'address__heading' })[:limit]
mysoup2 = mysoup.find_all("a")

您正在尝试在列表中查找a标记。您必须迭代mysoup或使用find代替findAll。

我已经更新了你的代码。找到它here。

相关问题更多 >

编程相关推荐

热门问题

热门文章