使用BeautifulSoup在ID或CLASS名称中查找特定词汇

1 投票

1 回答

1661 浏览

数据工程师

提问于 2025-04-18 15:18

我正在使用beautifulsoup这个工具，从电商网站的产品页面提取信息。我想要识别产品页面的方法是：

“类（CLASS）或ID属性中会包含‘thumb’这个词。”比如：class="product_thumbs" 或 id="thumbimages"等等。

目前我的程序只是在网址中查找.html，这只是针对一个电商网站的做法。但我想要它能搜索整个HTML，找出包含“thumb”这个词的ID和CLASS属性。

我现在的代码如下：

        if ".html" in childurl: # store details into product_details table if its a product page
              print("Product Found.!")
              print(childurl)
              soup = BeautifulSoup(urllib2.urlopen(childurl).read())
              priceele = soup.find(itemprop='price').string.strip()
              brandname = soup.find(itemprop='brand').string.strip()
              nameele = soup.find(itemprop='name').string.strip()
              image = soup.find(itemprop='image').get('src')

数据提取网页抓取 html解析 beautifulsoup 类名查找电商网站 id属性

1 个回答

试试用正则表达式模式

import bs4, re
html="""<html><body><div class="foo_thumb"></div><p class="wrong"></p><a id="barthumb"></a></body></html>"""
soup = bs4.BeautifulSoup(html)
predicates = [
    {'id' : re.compile('.*thumb.*')}, 
    {'class' : re.compile('.*thumb.*')},
]
for p in predicates:
    soup.find_all(**p)
#will print [<a id="barthumb"></a>], [<div class="foo_thumb"></div>]

回答于 2025-04-18 由 Python大师

分享举报

使用BeautifulSoup在ID或CLASS名称中查找特定词汇

1 个回答

撰写回答