解析邮件并从正文中获取数字

1 投票

2 回答

1182 浏览

提问于 2025-04-17 07:21

我想从电子邮件的正文中提取第一个数字。通过使用电子邮件库，我把邮件的正文提取成了一个字符串。但是问题是，在真正的纯文本正文开始之前，有一些关于编码的信息（这些信息里包含数字）。我该如何可靠地跳过这些内容，确保不管是哪个客户端发送的邮件，都能直接获取到第一个数字呢？

如果我这样做：

match = re.search('\d+', string, re.MULTILINE)

它会在关于编码的信息中找到第一个匹配，而不是在实际的邮件内容中。

好的，我加一个例子。这是它可能的样子（我会提取123）。不过我想它在其他客户端发送时可能会看起来不同。

--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1

Junk 123 Junk

--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1

<p>Junk 123 Junk</p>

--14dae93404410f62f404b2e65e10--

更新：现在我在迭代器这块卡住了 :-/ 我真的尝试过，但我搞不懂。这段代码：

msg = email.message_from_string(raw_message)
for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
    print part

输出：

--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1

Junk 123 Junk

--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1

<p>Junk 123 Junk</p>

--14dae93404410f62f404b2e65e10--

为什么它不直接输出：

Junk 123 Junk

迭代器文本处理数据清洗邮件解析数字提取编码信息电子邮件库正文提取

2 个回答

你可以使用这个：

match = re.search(r"Content-Type:.*?[\n\r]+\D*(\d+)", subject)
if match:
    result = match.group(1)

解释：

"
Content-Type:    # Match the characters “Content-Type:” literally
.                # Match any single character that is not a line break character
   *?               # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
[\n\r]           # Match a single character present in the list below
                    # A line feed character
                    # A carriage return character
   +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\D               # Match a single character that is not a digit 0..9
   *                # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(                # Match the regular expression below and capture its match into backreference number 1
   \d               # Match a single digit 0..9
      +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"

回答于 2025-04-17 由 Python大师

分享举报

你可能想用迭代器来跳过一些子部分的标题。

http://docs.python.org/library/email.iterators.html#module-email.iterators

这个例子会打印出每个消息子部分的正文内容，前提是这些内容是文本格式的（text/plain）：

for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
   for body_line in email.iterators.body_line_iterator(part):
       print body_line

回答于 2025-04-17 由 Python大师

分享举报

解析邮件并从正文中获取数字

2 个回答

撰写回答