BeautifulSoup在<em>标记后获取文本

2024-05-29 10:42:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在识别头的<strong>标记。然而,每当我试图获取其余信息以将其标识为“info”时,我只会返回<em>Parade </em>,而不是<p>标记中的所有内容

这是我的密码:

<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>

for strong_tag in soup.find_all('strong'):
    headers = strong_tag.text.replace(':', '').replace('\xa0', ' ').strip()

    info = strong_tag.next_sibling

    headerList.append(headers)
    infoList.append(info)

print(headerList)
print(infoList)

Tags: andin标记infoforastagreplace
2条回答

我想这就是你要找的。它查找父p元素,将soup对象转换为字符串,删除strong元素,然后将字符串转换回soup对象

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>", 'html.parser')
headerList = []
infoList = []

for strong_tag in soup.findAll('strong'):
    parent = strong_tag.find_parent('p')
    content = str(parent).replace(f'{strong_tag}', '')
    souped_content = BeautifulSoup(content, 'html.parser')
    infoList.append(souped_content)
    headerList.append(strong_tag)

print(headerList)
print(infoList)

这将产生以下结果:

[<strong>High School Honors: </strong>]
[<p><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>]

编辑

您也可以使用contents,但必须迭代所有NavigableStrings

info = ''
for text in soup.p.contents[1:]:
    if isinstance(text, NavigableString):
        info+=text
    else:
        info+= text.get_text()

示例

from bs4 import BeautifulSoup, NavigableString

html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'html.parser')

headers = soup.p.strong.get_text().replace(':', '')

info = ''
for text in soup.p.contents[1:]:
    if isinstance(text, NavigableString):
        info+=text
    else:
        info+= text.get_text()
print(headers)
print(info)

输出

High School Honors 
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.

使用get_text()split()

headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()

示例

from bs4 import BeautifulSoup

html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'lxml')

headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()

print(headers)
print(info)

输出

High School Honors 
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.

相关问题 更多 >

    热门问题