我有下面的代码成功地从https://www.canpages.ca/business/AB/edmonton/restaurants/183-720200-p41.html"中刮取业务类别,但是在第42页有一家公司没有类别(在“结果id”类下的类是“结果\业务类别”)
这个特定的公司在html中显示为实际上没有该类,而其他结果则有。我不确定最好的方法是什么,因为我的程序一旦看到类不存在就会崩溃。 错误为“AttributeError:'NoneType'对象没有属性'text'”,代码如下:
import re #regex
import requests #fetches html page content
from requests import get
from bs4 import BeautifulSoup #parses html page content
import pandas as pd
import numpy as np
#initialize empty list where we can store data
categories = []
#Get the contents of the page we're looking at by requesting the URL
results = requests.get("https://www.canpages.ca/business/AB/edmonton/restaurants/183-720200-p42.html", headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
#grab the container of each company by result id
companies_div = soup.find_all('div', {'id': re.compile('result-id-.*')})
for x in companies_div:
# Extract category class and split by white space. Category should follow [City Category] but sometimes typos result in [Category]
categoryChunk = x.find('div', class_='result__business-category').text.split()
# if list does not have [City Category] format and therefore list length of 2, mark as "-"
category = categoryChunk[1] if len(categoryChunk) == 2 else '-'
categories.append(category)
#ininitalize pd dataframe
companies = pd.DataFrame({
'category': categories,
})
print(companies)
companies.to_csv('companiestest6.csv')
我不知道如何才能基本上告诉程序“如果找不到类,请将类别标记为“-”,并非常感谢任何帮助
更新
我已将代码更新如下:
categoryDiv = x.find('div', class_='result__business-category')
if categoryDiv:
categoryChunk = categoryDiv.text.split()
if len(addressChunk) == 3:
category = categoryChunk[1]
categories.append(category)
else:
category = '-'
categories.append(category)
else:
category = '-'
categories.append(category)
这似乎很有效
似乎您应该能够相当简单地测试
.find
返回的内容这不会显式地测试长度为2的情况,但我假设这只是为了在找不到的情况下尝试获取
相关问题 更多 >
编程相关推荐