清理已删除的HTML列表

2024-06-07 08:08:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从wiki页面提取名称。使用BeautifulSoup,我可以得到一个非常脏的列表(包括许多无关的项目),我想清理,但我尝试“清理”列表时,列表保持不变

#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')

#2).    
#Attain the data I need, plus lot of unhelpful data   
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")

#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

#4).
#Check data
print(weapon_names_sanitised)
#Returns  a list identical to flithy_scraped_weapon_names

Tags: theinimporturl列表datanamespage
1条回答
网友
1楼 · 发布于 2024-06-07 08:08:11

问题在这一部分:

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

它应该是:

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in str(s) for xs in dirt)]

原因是flithy_scraped_weapon_names包含Tag对象,打印时将转换为字符串,但需要显式转换为字符串xs in str(s)才能按预期工作

相关问题 更多 >

    热门问题