清理已删除的HTML列表

2024-06-07 08:08:11 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试从wiki页面提取名称。使用BeautifulSoup，我可以得到一个非常脏的列表（包括许多无关的项目），我想清理，但我尝试“清理”列表时，列表保持不变

#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')

#2).    
#Attain the data I need, plus lot of unhelpful data   
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")

#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

#4).
#Check data
print(weapon_names_sanitised)
#Returns  a list identical to flithy_scraped_weapon_names

Tags： the in import url 列表 data names page

1条回答

网友

1楼 · 发布于 2024-06-07 08:08:11

问题在这一部分：

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

它应该是：

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in str(s) for xs in dirt)]

原因是flithy_scraped_weapon_names包含Tag对象，打印时将转换为字符串，但需要显式转换为字符串xs in str(s)才能按预期工作

清理已删除的HTML列表

相关问题更多 >

编程相关推荐

热门问题

热门文章

清理已删除的HTML列表

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >