BeautifulSoup(bs4)无法使用“查找全部”、“选择”或“选择一”获取元素

2024-06-16 09:07:45 发布

您现在位置:Python中文网/ 问答频道 /正文

要爬网的Url示例:www.yelp.com/biz/dallas-marketing-rockstar-dallas?adjust_creative=3cZu3ieq3omptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3ieq3omptvF-Yfj2ow'

我的代码:

def get_websites():

    for yso in Company.objects.filter(crawled=False, source='YAG'):
        r = requests.get(yso.url)
        
        soup = BeautifulSoup(r.content, 'lxml')
        if soup.select_one(".g-recaptcha") != None:
            sys.exit("Captcha")
        soup_select = soup.select_one("a[href*='biz_redir']")
        try:
            yso.website = soup_select.text
            print('website for %s added' % (yso.website))
        except Exception as e:
            print(e)
            print('no website for %s added' % yso.name)

        if not yso.crawled:
            yso.crawled = True
            yso.save()

在CSS选择器soup.select_one("a[href*='biz_redir']")中使用lxmlhtml.parser返回None,同时soup.select("a[href*='biz_redir']")为空列表soup.find_all("a[href*='biz_redir']")为空列表

lxml version 4.5.0

beautifulsoup version 4.9.3

编辑:将"a[href*='biz_redir']"更改为a会产生相同的结果。如果语法是错误的,那么还有比语法更根本的错误


Tags: forwebsiteselectonelxmlhrefutmprint
1条回答
网友
1楼 · 发布于 2024-06-16 09:07:45

数据是动态加载的,因此requests不支持它。但是,该链接是通过网站上的JSON格式加载的,您可以使用json模块进行提取

import re
import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.yelp.com/biz/dallas-marketing-rockstar-dallas?adjust_creative=3cZu3ieq3omptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3ieq3omptvF-Yfj2ow%27"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

script = soup.select_one(
    "#wrap > div.main-content-wrap.main-content-wrap full > yelp-react-root > script"
).string

json_data = json.loads(re.search(r"({.*})", script).group(1))

print(
    "https://yelp.com"
    + json_data["bizDetailsPageProps"]["bizContactInfoProps"]["businessWebsite"]["href"]
)

另一种方法是使用Selenium来刮取支持动态内容的页面

安装时使用:pip install selenium

here下载正确的ChromeDriver

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup


URL = "https://www.yelp.com/biz/dallas-marketing-rockstar-dallas?adjust_creative=3cZu3ieq3omptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3ieq3omptvF-Yfj2ow%27"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(URL)
# Wait for the page to fully render
sleep(5)

soup = BeautifulSoup(driver.page_source, "html.parser")
print("https://yelp.com" + soup.select_one("a[href*='biz_redir']")["href"])

driver.quit()

输出:

https://yelp.com/biz_redir?url=https%3A%2F%2Fwww.rockstar.marketing&website_link_type=website&src_bizid=CodEpKvY8ZM7IbCEWxpQ0g&cachebuster=1607826143&s=d214a1df7e2d21ba53939356ac6679631a458ec0360f6cb2c4699ee800d84520

相关问题 更多 >