Python 邮件 quoted-printable 编码问题

6 投票

3 回答

9962 浏览

提问于 2025-04-16 06:12

我正在用以下方法从Gmail提取电子邮件：

def getMsgs():
 try:
    conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
  except:
    print 'Failed to connect'
    print 'Is your internet connection working?'
    sys.exit()
  try:
    conn.login(username, password)
  except:
    print 'Failed to login'
    print 'Is the username and password correct?'
    sys.exit()

  conn.select('Inbox')
  # typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
  typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
  for num in data[0].split():
    typ, data = conn.fetch(num, '(RFC822)')
    msg = email.message_from_string(data[0][1])
    yield walkMsg(msg)

def walkMsg(msg):
  for part in msg.walk():
    if part.get_content_type() != "text/plain":
      continue
    return part.get_payload()

不过，有些邮件里的日期我几乎无法提取，因为一些编码相关的字符，比如'='，会随机出现在各种文本字段中。下面是一个我想提取的日期范围的例子：

姓名： KIRSTI 邮箱： kirsti@blah.blah 电话号码： + 999 99995192 聚会总人数： 4人，总共0 儿童到达/离开时间： 10月9= , 2010 - 10月13，2010 - 10月13，2010

有没有办法去掉这些编码字符呢？

文本处理字符编码 gmail 数据清洗日期解析编码问题邮件提取 quoted-printable

3 个回答

这叫做“引用-可打印编码”。你可能想用类似于 quopri.decodestring 的东西来处理它 - http://docs.python.org/library/quopri.html

回答于 2025-04-16 由 Python大师

分享举报

如果你使用的是Python3.6或更高版本，可以用email.message.Message.get_content()这个方法来自动解码文本。这个方法取代了get_payload()，不过get_payload()还是可以用的。

假设你有一个字符串s，里面包含了一封邮件的内容（这个内容是根据文档中的例子来的）：

Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <pepe@example.com>
To: Penelope Pussycat <penelope@example.com>,
 Fabrette Pussycat <fabrette@example.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

    Salut!

    Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.

    [1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

    --Pep=C3=A9
   =20

字符串中的非ASCII字符是用quoted-printable编码的，这在Content-Transfer-Encoding头部中有说明。

接下来，创建一个邮件对象：

import email
from email import policy

msg = email.message_from_string(s, policy=policy.default)

这里需要设置策略；否则会使用policy.compat32，这个策略返回的旧版邮件实例没有get_content方法。policy.default最终会成为默认策略，但在Python3.7之前，它仍然是policy.compat32。

get_content()方法会自动处理解码：

print(msg.get_content())

Salut!

Cela ressemble à un excellent recipie[1] déjeuner.

[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

--Pepé

如果你有一封多部分的邮件，get_content()需要在每个部分上调用，像这样：

for part in message.iter_parts():
    print(part.get_content())

回答于 2025-04-16 由 Python大师

分享举报

你可以使用 email.parser 这个模块来解码邮件信息，下面是一个简单的例子（很粗糙的例子！）：

from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()

# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()

# Or check for errors
print rootMessage.defects

# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)

通过使用 Message.get_payload 的 "decode" 参数，这个模块会根据内容的编码方式自动解码，比如你提到的那种可打印的编码。

回答于 2025-04-16 由 Python大师

分享举报

Python 邮件 quoted-printable 编码问题

3 个回答

撰写回答