使用Python抓取网站电子邮件

-3 投票

1 回答

58 浏览

提问于 2025-04-13 02:53

在我的Python代码中，我使用正则表达式来查找电子邮件：

soup = BeautifulSoup(driver.page_source, "html.parser")
text_email = soup.get_text()
emails1 = re.findall(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})', str(text_email))

大约90%的情况下，这段代码能正确返回电子邮件地址。

但是下面有一个例子，它返回了错误的电子邮件格式。

在网页上：

https:// s7health.pl/kontakt/

我们有电话、电子邮件和一些文本：

71 342 88 41
info@s7health.pl
Infolinia medyczna

上述文本的源代码是：

<a class="text-decoration-underline" href="tel:+48713428841">71 342 88 41</a><br /><a class="text-decoration-underline" href="mailto:info@s7health.pl">info@s7health.pl</a></div><style>.porto-u-3166.porto-u-heading{text-align:left}</style></div><div class="porto-u-heading  wpb_custom_95aa9a11c17ad45cfabaf210d84ee7cc porto-u-4257"><div class="porto-u-main-heading"><h3   style="font-weight:700;color:#0c6d70;font-size:1em;line-height:24px;">Infolinia medyczna</h3></div>

我的代码返回的电子邮件是：

41info@s7health.plInfo

但应该返回的电子邮件是：

info@s7health.pl

除了使用mailto短语查找电子邮件的问题——这个短语可能并不存在，为什么电子邮件中会多出这些额外的字符呢？这个问题怎么解决呢？

祝好

正则表达式自动化脚本数据清洗信息提取网页源代码网络抓取电子邮件提取格式验证

1 个回答

问题不在于正则表达式：

from bs4 import BeautifulSoup
import requests
import re

url = 'https://s7health.pl/kontakt/'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

email_regex = r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})'
email_addresses = re.findall(email_regex, response.text)

for email in email_addresses:
    print(email)

结果是：

info@s7health.pl
info@s7health.pl
ajax-loader@2x.gif

回答于 2025-04-13 由 Python大师

分享举报

使用Python抓取网站电子邮件

1 个回答

撰写回答