解析邮件并从正文中获取数字

1 投票
2 回答
1182 浏览
提问于 2025-04-17 07:21

我想从电子邮件的正文中提取第一个数字。通过使用电子邮件库,我把邮件的正文提取成了一个字符串。但是问题是,在真正的纯文本正文开始之前,有一些关于编码的信息(这些信息里包含数字)。我该如何可靠地跳过这些内容,确保不管是哪个客户端发送的邮件,都能直接获取到第一个数字呢?

如果我这样做:

match = re.search('\d+', string, re.MULTILINE)

它会在关于编码的信息中找到第一个匹配,而不是在实际的邮件内容中。

好的,我加一个例子。这是它可能的样子(我会提取123)。不过我想它在其他客户端发送时可能会看起来不同。

--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1

Junk 123 Junk

--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1

<p>Junk 123 Junk</p>

--14dae93404410f62f404b2e65e10--

更新:现在我在迭代器这块卡住了 :-/ 我真的尝试过,但我搞不懂。这段代码:

msg = email.message_from_string(raw_message)
for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
    print part

输出:

--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1

Junk 123 Junk

--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1

<p>Junk 123 Junk</p>

--14dae93404410f62f404b2e65e10--

为什么它不直接输出:

Junk 123 Junk

?

2 个回答

0

你可以使用这个:

match = re.search(r"Content-Type:.*?[\n\r]+\D*(\d+)", subject)
if match:
    result = match.group(1)

解释:

"
Content-Type:    # Match the characters “Content-Type:” literally
.                # Match any single character that is not a line break character
   *?               # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
[\n\r]           # Match a single character present in the list below
                    # A line feed character
                    # A carriage return character
   +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\D               # Match a single character that is not a digit 0..9
   *                # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(                # Match the regular expression below and capture its match into backreference number 1
   \d               # Match a single digit 0..9
      +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
6

你可能想用迭代器来跳过一些子部分的标题。

http://docs.python.org/library/email.iterators.html#module-email.iterators

这个例子会打印出每个消息子部分的正文内容,前提是这些内容是文本格式的(text/plain):

for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
   for body_line in email.iterators.body_line_iterator(part):
       print body_line

撰写回答