使用正则表达式在Python中解析邮件头
我是一名Python初学者,正在尝试从电子邮件的头部信息中提取数据。我有成千上万封邮件存储在一个文本文件里,我想从每封邮件中提取发件人的地址、收件人的地址和日期,并把这些信息写入一个新的文件中,每条信息用分号隔开。
虽然这个方法不太好,但这是我想到的:
import re
emails = open("demo_text.txt","r") #opens the file to analyze
results = open("results.txt","w") #creates new file for search results
resultsList = []
for line in emails:
if "From - " in line: #recgonizes the beginning of a email message and adds a linebreak
newMessage = re.findall(r'\w\w\w\s\w\w\w.*', line)
if newMessage:
resultsList.append("\n")
if "From: " in line:
address = re.findall(r'[\w.-]+@[\w.-]+', line)
if address:
resultsList.append(address)
resultsList.append(";")
if "To: " in line:
if "Delivered-To:" not in line: #avoids confusion with 'Delivered-To:' tag
address = re.findall(r'[\w.-]+@[\w.-]+', line)
if address:
for person in address:
resultsList.append(person)
resultsList.append(";")
if "Date: " in line:
date = re.findall(r'\w\w\w\,.*', line)
resultsList.append(date)
resultsList.append(";")
for result in resultsList:
results.writelines(result)
emails.close()
results.close()
这是我的'demo_text.txt'文件:
From - Sun Jan 06 19:08:49 2013
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Delivered-To: somebody_1@hotmail.com
Received: by 10.48.48.3 with SMTP id v3cs417003nfv;
Mon, 15 Jan 2007 10:14:19 -0800 (PST)
Received: by 10.65.211.13 with SMTP id n13mr5741660qbq.1168884841872;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Return-Path: <nobody@hotmail.com>
Received: from bay0-omc3-s21.bay0.hotmail.com (bay0-omc3-s21.bay0.hotmail.com [65.54.246.221])
by mx.google.com with ESMTP id e13si6347910qbe.2007.01.15.10.13.58;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Received-SPF: pass (google.com: domain of nobody@hotmail.com designates 65.54.246.221 as permitted sender)
Received: from hotmail.com ([65.54.250.22]) by bay0-omc3-s21.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668);
Mon, 15 Jan 2007 10:13:48 -0800
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
Mon, 15 Jan 2007 10:13:47 -0800
Message-ID: <BAY115-F12E4E575FF2272CF577605A1B50@phx.gbl>
Received: from 65.54.250.200 by by115fd.bay115.hotmail.msn.com with HTTP;
Mon, 15 Jan 2007 18:13:43 GMT
X-Originating-IP: [200.122.47.165]
X-Originating-Email: [nobody@hotmail.com]
X-Sender: nobody@hotmail.com
From: =?iso-8859-1?B?UGF1bGEgTWFy7WEgTGlkaWEgRmxvcmVuemE=?=
<nobody@hotmail.com>
To: somebody_1@hotmail.com, somebody_2@gmail.com, 3_nobodies@yahoo.com.ar
Bcc:
Subject: fotos
Date: Mon, 15 Jan 2007 18:13:43 +0000
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="----=_NextPart_000_d98_1c4f_3aa9"
X-OriginalArrivalTime: 15 Jan 2007 18:13:47.0572 (UTC) FILETIME=[E68D4740:01C738D0]
Return-Path: nobody@hotmail.com
输出结果是:
somebody_1@hotmail.com;somebody_2@gmail.com;3_nobodies@yahoo.com.ar;Mon, 15 Jan 2007 18:13:43 +0000;
这个输出结果本来是可以的,但在我的'demo_text.txt'文件中的'From:'字段(第24行)有一个换行,所以我漏掉了'nobody@hotmail.com'这个地址。
我不太确定怎么告诉我的代码跳过换行符,仍然能在'From:'标签中找到电子邮件地址。
更一般来说,我相信还有很多更合理的方法来完成这个任务。如果有人能给我指个方向,我会非常感激。
2 个回答
0
为了跳过换行符,你不能一行一行地读取文件。你可以尝试把文件加载进来,然后用你的关键词(比如 From、To 等)作为边界。比如当你搜索 'From -' 时,可以用其他关键词作为边界,这样它们就不会被包含在列表的那部分内容里。
另外,提到这一点是因为你说你是初学者:在 Python 中,给非类变量命名时,通常用下划线来分隔单词。所以 resultsList 应该改成 results_list。
2
你的示例文本实际上就是mbox格式,这种格式可以通过mailbox
模块中的合适对象来完美处理:
from mailbox import mbox
import re
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\@[0-9A-Za-z._-]+")
mymbox = mbox("demo.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
date = [ email["date"], ]
print ";".join(from_address + to_address + date)