使用Python imaplib导出Gmail - 文本换行问题导致乱码
我正在使用以下代码来导出特定 Gmail 文件夹中的所有邮件。
这个代码运行得不错,能提取出我预期的所有邮件,但似乎在处理换行符时出现了问题。
代码:
import imaplib
import email
import codecs
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('myUser@gmail.com', 'myPassword') #user / password
mail.list()
mail.select("myFolder") # connect to folder with matching label
result, data = mail.uid('search', None, "ALL") # search and return uids instead
i = len(data[0].split())
for x in range(i):
latest_email_uid = data[0].split()[x]
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = email_data[0][1]
email_message = email.message_from_string(raw_email)
save_string = str("C:\\\googlemail\\boxdump\\email_" + str(x) + ".eml") #set to save location
myfile = open(save_string, 'a')
myfile.write(email_message)
myfile.close()
我的问题是,当我拿到这个对象时,里面满是 '=0A',我猜这些是错误解析的换行符或回车符。
我在十六进制中能找到它,[d3 03 03 0a],但因为这不是“字符”,所以我找不到方法用 str.replace() 把这些部分去掉。我其实不想要这些换行符。
我可以把整个字符串转换成十六进制,然后用某种替换或正则表达式的方式处理,但这似乎有点过于复杂,因为问题出在编码或读取源数据上。
我看到的内容:
====
CAUTION: This email message and any attachments con= tain information that may be confidential and may be LEGALLY PRIVILEGED. If yo= u are not the intended recipient, any use, disclosure or copying of this messag= e or attachments is strictly prohibited. If you have received this email messa= ge in error please notify us immediately and erase all copies of the message an= d attachments. Thank you.
====
我想要的内容:
====
CAUTION: This email message and any attachments contain information that may be confidential and may be LEGALLY PRIVILEGED. If you are not the intended recipient, any use, disclosure or copying of this message or attachments is strictly prohibited. If you have received this email message in error please notify us immediately and erase all copies of the message and attachments. Thank you.
====
2 个回答
0
我想分享两个额外的经验,因为我花了一天时间经历了这些痛苦。
第一,最好在数据传输的层面上处理,这样你就可以从邮件中提取出电子邮件地址等信息。
第二,你还需要解码字符集。我曾遇到过一些问题,比如有人从网页或Word文档中复制粘贴HTML内容到邮件里,这让我在处理这些邮件时遇到了麻烦。
if maintype == 'multipart':
for part in email_message.get_payload():
if part.get_content_type() == 'text/plain':
text += part.get_payload().decode("quoted-printable").decode(part.get_content_charset())
希望这能对某些人有所帮助!
戴夫
2
你看到的这个是 Quoted Printable 编码。
试着把:
email_message = email.message_from_string(raw_email)
改成:
email_message = str(email.message_from_string(raw_email)).decode("quoted-printable")
想了解更多信息,可以查看 Python 编码模块中的标准编码。