解析邮件并从正文中获取数字
我想从电子邮件的正文中提取第一个数字。通过使用电子邮件库,我把邮件的正文提取成了一个字符串。但是问题是,在真正的纯文本正文开始之前,有一些关于编码的信息(这些信息里包含数字)。我该如何可靠地跳过这些内容,确保不管是哪个客户端发送的邮件,都能直接获取到第一个数字呢?
如果我这样做:
match = re.search('\d+', string, re.MULTILINE)
它会在关于编码的信息中找到第一个匹配,而不是在实际的邮件内容中。
好的,我加一个例子。这是它可能的样子(我会提取123)。不过我想它在其他客户端发送时可能会看起来不同。
--14dae93404410f62f404b2e65e10 Content-Type: text/plain; charset=ISO-8859-1 Junk 123 Junk --14dae93404410f62f404b2e65e10 Content-Type: text/html; charset=ISO-8859-1 <p>Junk 123 Junk</p> --14dae93404410f62f404b2e65e10--
更新:现在我在迭代器这块卡住了 :-/ 我真的尝试过,但我搞不懂。这段代码:
msg = email.message_from_string(raw_message)
for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
print part
输出:
--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1
Junk 123 Junk
--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1
<p>Junk 123 Junk</p>
--14dae93404410f62f404b2e65e10--
为什么它不直接输出:
Junk 123 Junk
?
2 个回答
0
你可以使用这个:
match = re.search(r"Content-Type:.*?[\n\r]+\D*(\d+)", subject)
if match:
result = match.group(1)
解释:
"
Content-Type: # Match the characters “Content-Type:” literally
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
[\n\r] # Match a single character present in the list below
# A line feed character
# A carriage return character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\D # Match a single character that is not a digit 0..9
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
( # Match the regular expression below and capture its match into backreference number 1
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
6
你可能想用迭代器来跳过一些子部分的标题。
http://docs.python.org/library/email.iterators.html#module-email.iterators
这个例子会打印出每个消息子部分的正文内容,前提是这些内容是文本格式的(text/plain):
for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
for body_line in email.iterators.body_line_iterator(part):
print body_line