BeautifulSoup在<em>标记后获取文本

2条回答

网友

1楼 · 编辑于 2024-05-29 10:42:48

我想这就是你要找的。它查找父p元素，将soup对象转换为字符串，删除strong元素，然后将字符串转换回soup对象

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>", 'html.parser')
headerList = []
infoList = []

for strong_tag in soup.findAll('strong'):
    parent = strong_tag.find_parent('p')
    content = str(parent).replace(f'{strong_tag}', '')
    souped_content = BeautifulSoup(content, 'html.parser')
    infoList.append(souped_content)
    headerList.append(strong_tag)

print(headerList)
print(infoList)

这将产生以下结果：

[<strong>High School Honors: </strong>]
[<p><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>]

网友

2楼 · 编辑于 2024-05-29 10:42:48

编辑

您也可以使用contents，但必须迭代所有NavigableStrings：

info = ''
for text in soup.p.contents[1:]:
    if isinstance(text, NavigableString):
        info+=text
    else:
        info+= text.get_text()

示例

from bs4 import BeautifulSoup, NavigableString

html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'html.parser')

headers = soup.p.strong.get_text().replace(':', '')

info = ''
for text in soup.p.contents[1:]:
    if isinstance(text, NavigableString):
        info+=text
    else:
        info+= text.get_text()
print(headers)
print(info)

输出

High School Honors 
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.

使用get_text()和split()：

headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()

示例

from bs4 import BeautifulSoup

html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'lxml')

headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()

print(headers)
print(info)

输出

High School Honors 
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.

编辑

相关问题更多 >

编程相关推荐

热门问题

热门文章

BeautifulSoup在<em>标记后获取文本

编辑

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >