现在我想删除html页面的页眉和页脚。我发现页眉和页脚显示为每个div的最后两行。有人能告诉我如何从一个div中提取除最后两行以外的所有数据,如下所示:
<div class="page"><p />
<p></p>
<p>First line required
</p>
<p>Second line required
</p>
<p>Third line required
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
</p>
<p></p>
</div>
<div class="page"><p />
<p>line required 1
</p>
<p></p>
<p>line required 2
</p>
<p>line required 3
</p>
<p></p>
<p>line required 4
</p>
<p>line required 5
</p>
<p>line required 6
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
<p />
</div>
现有代码如下:
soup = BeautifulSoup(file_content, 'html.parser')
for num, page in enumerate(soup.select('.page'), 1):
content = page.get_text(strip=True, separator=' ').replace("\n", " ")
似乎达到了预期效果。你知道吗
注意事项:
<p>Line 2 not required
永远不会结束,<p />
标记似乎不是个好主意:Should I use the <p /> tag in markup?)谨致问候
最新答案:
输出:
相关问题 更多 >
编程相关推荐