从忽略子标记的多个标记中提取文本时出现问题

2024-06-02 08:28:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下示例html:

soup=BeautifulSoup('''<ul>
 <li class=“item">
 <span class="letter">A. </span>
Text I want </li>
 <li class="item">
 <span class="letter">B.</span>                           
Second text I want</li></ul>''')

我试图提取“我想要的文本”和“我想要的第二个文本”,忽略span标记。到目前为止,我所做的:

soup.li.find_all(text=True,recursive=False)

返回['\n', '\nText I want ']

如果我尝试:

for s in soup.ul:
    print(s.find(text=True,recursive=False))

我得到一个错误:

TypeError: find() takes no keyword arguments
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-93-f253cd430e83> in <module>
      1 for s in soup.ul:
----> 2     print(s.find(text=True,recursive=False))

TypeError: find() takes no keyword arguments

感谢您的帮助


1条回答
网友
1楼 · 发布于 2024-06-02 08:28:13

您可以使用列表理解来提取文本:

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    """<ul>
 <li class="item">
 <span class="letter">A. </span>
Text I want </li>
 <li class="item">
 <span class="letter">B.</span>                           
Second text I want</li>""",
    "html.parser",
)

texts = [
    txt
    for li in soup.select("li.item")
    for t in li.find_all(text=True, recursive=False)
    if (txt := t.strip())
]
print(texts)

印刷品:

['Text I want', 'Second text I want']

或者先删除<span>,然后获取文本:

for span in soup.select("span"):
    span.extract()

texts = [li.get_text(strip=True) for li in soup.select("li.item")]
print(texts)

印刷品:

['Text I want', 'Second text I want']

或者:查找<span>,然后.find_next_sibling(text=True)

texts = [
    li.find_next_sibling(text=True).strip()
    for li in soup.select("li.item span")
]
print(texts)

印刷品:

['Text I want', 'Second text I want']

相关问题 更多 >