标记为空时用Beautifulsoup填充值

2024-04-25 13:42:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图解析出网页中某个类的所有td标记的内容,但是我希望有某种占位符内容,即使标记本身没有。例如,html包含如下td标记:

<td class="odds bdevtt moneylineodds " cfg="">+134</td>
<td class="odds bdevtt moneylineodds " cfg=""></td>
<td class="odds bdevtt moneylineodds " cfg="">-140</td>

我正在尝试获取一个类似['+134','-','-140']的列表作为输出,因此列表中的条目数等于匹配的标记数,其中'-'作为占位符表示标记为空。但是,下面的只返回['+134','-140']。你知道吗

soup.find_all('td', attrs={'class': 'odds bdevtt moneylineodds '})

Tags: 标记网页内容列表html条目allfind
3条回答

一种可能的解决方案是使用or运算符:

out = [td.get_text(strip=True) or '-' for td in soup.select('td.odds.bdevtt.moneylineodds')]
print(out)

印刷品:

['+134', '-', '-140']

一些快速基准:

txt = '''<td class="odds bdevtt moneylineodds " cfg="">+134</td>
<td class="odds bdevtt moneylineodds " cfg=""></td>
<td class="odds bdevtt moneylineodds " cfg="">-140</td>'''
​
from bs4 import BeautifulSoup
from timeit import timeit
​
soup = BeautifulSoup(txt, 'html.parser')
​
def using_or():
    return [td.get_text(strip=True) or '-' for td in soup.select('td.odds.bdevtt.moneylineodds')]
​
def using_if_else_1():
    return [td.text if td.text else '-' for td in soup.select('td.odds.bdevtt.moneylineodds')]
​
def using_if_else_2():
    return [t if (t := td.get_text(strip=True)) else '-' for td in soup.select('td.odds.bdevtt.moneylineodds')]
​
​
t1 = timeit(lambda: using_or(), number=10_000)
t2 = timeit(lambda: using_if_else_1(), number=10_000)
t3 = timeit(lambda: using_if_else_2(), number=10_000)
​
print(t1)
print(t2)
print(t3)
​

这张照片:

0.7735823660041206
0.8084569670027122
0.776867889042478

看起来,解决方案在性能方面是相同的。你知道吗

from bs4 import BeautifulSoup

html = """
<td class="odds bdevtt moneylineodds " cfg="">+134</td>
<td class="odds bdevtt moneylineodds " cfg=""></td>
<td class="odds bdevtt moneylineodds " cfg="">-140</td>
"""
soup = BeautifulSoup(html,"html.parser")
all = [i.text if i.text != "" else "-" for i in soup.find_all('td', attrs={'class': 'odds bdevtt moneylineodds '})]
print(all)

# output: ['+134', '-', '-140']

class属性的值中删除尾随空格,您将得到预期的结果。你知道吗

代码:

for elm in soup.find_all('td', attrs={'class': 'odds bdevtt moneylineodds'}):
  print(elm.text)

输出:

+134

-140

原因是当您执行代码时

html = """
<td class="odds bdevtt moneylineodds " cfg="">+134</td>
<td class="odds bdevtt moneylineodds " cfg=""></td>
<td class="odds bdevtt moneylineodds " cfg="">-140</td>
"""
soup = BeautifulSoup(html,"html.parser")   # <  It will trim the trailing spaces from class value
print(soup)

输出:

<td cfg="" class="odds bdevtt moneylineodds">+134</td>
<td cfg="" class="odds bdevtt moneylineodds"></td>
<td cfg="" class="odds bdevtt moneylineodds">-140</td>

相关问题 更多 >