BeautifulSoup（bs4）无法使用“查找全部”、“选择”或“选择一”获取元素

def get_websites(): for yso in Company.objects.filter(crawled=False, source='YAG'): r = requests.get(yso.url) soup = BeautifulSoup(r.content, 'lxml') if soup.select_one(".g-recaptcha") != None: sys.exit("Captcha") soup_select = soup.select_one("a[href*='biz_redir']") try: yso.website = soup_select.text print('website for %s added' % (yso.website)) except Exception as e: print(e) print('no website for %s added' % yso.name) if not yso.crawled: yso.crawled = True yso.save()

1条回答

网友

1楼 · 发布于 2024-06-16 09:07:45

数据是动态加载的，因此requests不支持它。但是，该链接是通过网站上的JSON格式加载的，您可以使用json模块进行提取

import re
import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.yelp.com/biz/dallas-marketing-rockstar-dallas?adjust_creative=3cZu3ieq3omptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3ieq3omptvF-Yfj2ow%27"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

script = soup.select_one(
    "#wrap > div.main-content-wrap.main-content-wrap full > yelp-react-root > script"
).string

json_data = json.loads(re.search(r"({.*})", script).group(1))

print(
    "https://yelp.com"
    + json_data["bizDetailsPageProps"]["bizContactInfoProps"]["businessWebsite"]["href"]
)

另一种方法是使用Selenium来刮取支持动态内容的页面

安装时使用：pip install selenium

从here下载正确的ChromeDriver

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup


URL = "https://www.yelp.com/biz/dallas-marketing-rockstar-dallas?adjust_creative=3cZu3ieq3omptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3ieq3omptvF-Yfj2ow%27"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(URL)
# Wait for the page to fully render
sleep(5)

soup = BeautifulSoup(driver.page_source, "html.parser")
print("https://yelp.com" + soup.select_one("a[href*='biz_redir']")["href"])

driver.quit()

输出：

https://yelp.com/biz_redir?url=https%3A%2F%2Fwww.rockstar.marketing&website_link_type=website&src_bizid=CodEpKvY8ZM7IbCEWxpQ0g&cachebuster=1607826143&s=d214a1df7e2d21ba53939356ac6679631a458ec0360f6cb2c4699ee800d84520

相关问题更多 >

编程相关推荐

热门问题

热门文章