在抓取电子邮件地址时无法处理不需要的东西

import re import requests links = ( 'http://www.acupuncturetx.com', 'http://www.hcmed.org', 'http://www.drmindyboxer.com', 'http://wendyrobinweir.com', ) headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"} for link in links: r = requests.get(link,headers=headers) emails = re.findall(r"[\w\.-]+@[\w\.-]+",r.text) print(emails)

['react@16.5.2', 'react-dom@16.5.2', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com'] ['hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x-300x47.png'] ['leaflet@1.7.1'] ['8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress.com', 'requirejs-bolt@2.3.6', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wixstores-client-cart-icon@1.797.0', 'wixstores-client-gallery@1.1634.0']

['bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com'] [] [] ['wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com']

3条回答

网友

1楼 · 编辑于 2024-04-19 09:12:01

您可以测试使用包validate_email（pip install validate_email）捕获的所有内容，而不是只捕获电子邮件地址，并且只保留有效的电子邮件地址。代码可能是以下代码的某个版本：

from validate_email import validate_email
emails = [x if validate_email(x) else '' for x in list_of_potential_emails]

如果电子邮件（或服务器）存在，此包将与相应的服务器进行检查

网友

2楼 · 编辑于 2024-04-19 09:12:01

离开你离开的地方，你可以使用一个简单的检查器来验证它是否真的是一封有效的电子邮件

首先我们定义check函数：

def check(email):
    regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w+$'
    if re.match(regex, email):
        return True
    else:
        return False

然后我们使用它检查您电子邮件列表中的iTen：

for link in links:
    r = requests.get(link, headers=headers)
    emails_list = re.findall(r"[\w\.-]+@[\w\.-]+", r.text)
    emails_list = [email for email in emails_list if check(email)]
    print(emails_list)

输出：

['bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com']
[]
[]
['wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com']

网友

3楼 · 编辑于 2024-04-19 09:12:01

我碰巧有一个这样的正则表达式，它尊重RFC 5321，这将帮助您清除许多伪造（即：非本地）电子邮件地址，但不是全部。如果您想

例如，电子邮件8b4e078a51d04e0e9efdf470027f0ec1@...看起来确实是假的，但根据RFC，“本地名称”部分在技术上是正确的。。。您可以在本地名称部分添加检查（将在下面我的代码片段中match.group(1)）

下面是我对RFC兼容正则表达式的代码小贴士：

# See https://www.rfc-editor.org/rfc/rfc5321
EMAIL_REGEX = re.compile(r"([\w.~%+-]{1,64})@([\w-]{1,64}\.){1,16}(\w{2,16})", re.IGNORECASE | re.UNICODE)


# Cache this (it doesn't change often), all official top-level domains
TLD_URL = "https://datahub.io/core/top-level-domain-names/r/top-level-domain-names.csv.json"
OFFICIAL_TLD = requests.get(TLD_URL).json()
OFFICIAL_TLD = [x["Domain"].lstrip(".") for x in OFFICIAL_TLD]


def extracted_emails(text):
    for match in EMAIL_REGEX.finditer(text):
        top_level = match.group(3)
        if top_level in OFFICIAL_TLD:
            email = match.group(0)
            # Additionally, max length of domain should be at most 255
            # You could also simplify this to simply: len(email) < 255
            if len(top_level) + len(match.group(2)) < 255:
                yield email


# ... 8< ... stripped unchanged code for brevity

for link in links:
    r = requests.get(link,headers=headers)
    emails = list(extracted_emails(r.text))
    print(emails)

这将产生您的预期结果+一封虚假（但技术上正确）的8b4e078a51d04e0e9efdf470027f0ec1@...电子邮件

它使用严格遵守RFC 5321的正则表达式，并针对每个看起来像有效电子邮件的子字符串，根据官方列表仔细检查顶级域

输出：

相关问题更多 >

编程相关推荐

热门问题

热门文章