使用Python解析带子部分的多部分电子邮件

2 投票

3 回答

7178 浏览

提问于 2025-04-16 10:45

我正在使用这个函数来解析电子邮件。我可以解析“简单”的多部分电子邮件，但当电子邮件定义了多个边界（子部分）时，它会出现一个错误（UnboundLocalError: 本地变量 'html' 在赋值前被引用）。我希望这个脚本能够把文本和HTML部分分开，只返回HTML部分（如果没有HTML部分，就返回文本）。

def get_text(msg):
text = ""
if msg.is_multipart():
    for part in msg.get_payload():
        if part.get_content_charset() is None:
            charset = chardet.detect(str(part))['encoding']
        else:
            charset = part.get_content_charset()
        if part.get_content_type() == 'text/plain':
            text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
        if part.get_content_type() == 'text/html':
            html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
    if html is None:
        return text.strip()
    else:
        return html.strip()
else:
    text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
    return text.strip()

错误处理电子邮件解析多部分邮件边界处理文本与HTML

3 个回答

我在代码中做了以下修改：把unicode改成了str，然后把str(part)改成了bytes(part)，在这行代码中：charset = chardet.detect(str(part))['encoding']。我还用bs4处理了html。这段代码对我的项目很有帮助。谢谢。

回答于 2025-04-16 由 Python大师

分享举报

这里是根据OlliM的建议修改后的代码。如果不做这个改动，你就无法正确解析电子邮件中的“multipart/alternative”容器。

import chardet

def get_text(msg):
    """ Parses email message text, given message object
    This doesn't support infinite recursive parts, but mail is usually not so naughty.
    """
    text = ""
    if msg.is_multipart():
        html = None
        for part in msg.get_payload():
            if part.get_content_charset() is None:
                charset = chardet.detect(str(part))['encoding']
            else:
                charset = part.get_content_charset()
            if part.get_content_type() == 'text/plain':
                text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'text/html':
                html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'multipart/alternative':
                for subpart in part.get_payload():
                    if subpart.get_content_charset() is None:
                        charset = chardet.detect(str(subpart))['encoding']
                    else:
                        charset = subpart.get_content_charset()
                    if subpart.get_content_type() == 'text/plain':
                        text = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
                    if subpart.get_content_type() == 'text/html':
                        html = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')

        if html is None:
            return text.strip()
        else:
            return html.strip()
    else:
        text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
        return text.strip()

写出更优雅的结构，避免重复代码就留给读者自己去练习了。

另外，可以看看这个关于容器结构的有用图示。

回答于 2025-04-16 由 Python大师

分享举报

正如评论所说，你总是要检查html，但只在特定情况下声明它。这就是错误信息告诉你的意思，你在赋值之前就引用了html。在Python中，如果一个变量还没有被赋值，你是不能检查它是否为None的。例如，打开Python的交互式提示符：

>>> if y is None:
...   print 'none'
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'y' is not defined

如你所见，你不能仅仅检查是否为None来判断一个变量是否存在。回到你的具体情况。

你需要先把html设置为None，然后再检查它是否仍然是None。也就是说，把你的代码修改成这样：

def get_text(msg):
text = ""
if msg.is_multipart():
    html = None
    for part in msg.get_payload():
        if part.get_content_charset() is None:
            charset = chardet.detect(str(part))['encoding']
        else:
            charset = part.get_content_charset()
        if part.get_content_type() == 'text/plain':
            text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
        if part.get_content_type() == 'text/html':
            html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
    if html is None:
        return text.strip()
    else:
        return html.strip()
else:
    text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
    return text.strip()

这解释得更清楚一些：

http://code.activestate.com/recipes/59892-testing-if-a-variable-is-defined/

回答于 2025-04-16 由 Python大师

分享举报

使用Python解析带子部分的多部分电子邮件

3 个回答

撰写回答