Python beautifulsoup删除自动关闭标记

 Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas Master of Science (Computer Science), Government College University Lahore Master of Science ( Computer Science ), University of Agriculture Faisalabad Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad 

3条回答

网友

1楼 · 编辑于 2024-04-16 15:23:11

因为这些 都没有关闭的对应项，Beautiful Soup会自动添加它们，从而生成以下HTML：

In [23]: soup = BeautifulSoup(html)

In [24]: soup.br
Out[24]: 
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br/></br></br></br>

当您在第一个 标记上调用Tag.extract时，您将删除其所有子代及其子代包含的字符串：

^{pr2}$

似乎您只需从span元素提取所有文本。如果是这样的话，不要费心移除任何东西：

In [28]: soup.span.text
Out[28]: '\nDoctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science ( Computer Science ), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad\n'

Tag.text属性从给定标记中提取所有字符串。在

网友

2楼 · 编辑于 2024-04-16 15:23:11

使用“展开”应该可以

soup = BeautifulSoup(html)
for match in soup.findAll('br'):
    match.unwrap()

网友

3楼 · 编辑于 2024-04-16 15:23:11

以下是一种方法：

for link2 in soup.findAll('span',{'class':'qualification'}):
    for s in link2.stripped_strings:
        print(s)

没有必要删除 标记，除非您需要删除它们以便以后处理。这里的link2.stripped_strings是一个生成器，它生成标记中的每个字符串，去掉前导空格和尾随空格。打印循环可以更简洁地写为：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章