使用beautifulsoup从结果包中删除特定内容

def get_description(link): redditFile = urllib2.urlopen(link) redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text return desc

<div class="op_gd14 FL"> <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> <a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a> </p><p> </p> </div>

2条回答

网友

1楼 · 编辑于 2024-04-20 00:50:25

您可以使用^{}从find()结果中删除不必要的标记：

descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')]                       # remove <a> tags
return descItem.get_text()                                 # return the text

网友

2楼 · 编辑于 2024-04-20 00:50:25

只需对最后一行进行一些更改并添加re模块

...
return re.sub(r'<a(.*)</a>','',desc)

输出：

'<div class="op_gd14 FL">\n    <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>  \n  </p><p>

相关问题更多 >

编程相关推荐

热门问题

热门文章