我想从网页中提取所有(在本例中是两个)hast标记。你知道吗
<html>
<head>
</head>
<body>
<div class="predefinition">
<p class="part1">
<span class="part1-head">Entries:</span>
<a class="pr" href="/go_somewhere/">#hashA with space</a>,
<a class="pr" href="/go_somewhere/">#hashBwithoutsace</a>,
</p>
<span class="part2">Boundaries:</span>
<p>some boundary statement</p>
</div>
<div class="wrapper"> <!– I only want to search here–>
<p class="part1">
<span class="part1-head">Entries:</span>
<a class="pr" href="/go_somewhere/">#hash1 with space</a>, <!– I only want to find this–>
<a class="pr" href="/go_somewhere/">#hash2withoutsace</a>, <!– and this–>
</p>
<span class="part2">Boundaries:</span>
<p>some other boundary statement</p>
</div>
</body>
</html>
但我只对一个分支(在这个示例包装器中)中的哈希标记感兴趣:“#hash1 with space”和“#hash2withoutspace”。现在我的代码如下所示:
from bs4 import BeautifulSoup
import io
import re
f = io.open("minimal.html", mode="r", encoding="utf-8")
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
mydivs = soup.findAll("a", {"class": "pr"})
for div in mydivs:
print(re.findall(r'(?i)\#\w+', str(div)))
您可以找到带有
class
pr
的所有a
标记的文本,然后选择最后两个:输出:
相关问题 更多 >
编程相关推荐