Python:只在特定模式后读取文本

2024-05-16 00:37:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个电子邮件通信pdf如下所示

Jerrmy Bret <jeremy.brett@mnop.com>
To: Jonathan Small <j.small@xyz.com>

FYI...

From: Keven Koster <keve.koster@mnop.com>
To: Jerrmy Bret <jeremy.brett@mnop.com>
Date: 21 Sep 2019
Subject: Approval Required for Travel

Can't Approve as Ruth's approval is required

目标:我想阅读邮件正文,即Can't Approve as Ruth's approval is required。你知道吗

我目前的做法是:

我在用正则表达式。但首先整个pdf文件被转换成文本。我正在将它们转换为列表。你知道吗

txt = pdf_to_text(email) # let's assume there is a function that does the conversion jobs. 
txt = txt.split('\n')
pat = re.compile(r'appro.*\,re.I)
extract_txt = [f for f in txt if pat.search(f)]

上面的代码生成如下列表:

['Approval', 'Approve','approval']

我想要的是只在邮件正文而不是主题部分运行regex。

一些假设:

  1. 邮件正文应包含“批准”一词
  2. 主题行可能包含也可能不包含“批准”一词

我如何处理这个问题?确保只接收邮件内容的一种方法是确保在subject line之后应用regex。 有什么线索吗?你知道吗

另外,我不能使用任何像IMAPlib这样的python电子邮件库。你知道吗


Tags: totxtcompdfis电子邮件邮件approve
2条回答

如果Approve/Approved/Approval是否在电子邮件的主题或正文中无关紧要,您可以这样做:

import re

text = '''From: Jerrmy Bret <jeremy.brett@mnop.com>
To: Jonathan Small <j.small@xyz.com>
Date: 21 Sep 2019
Subject: Stuff

FYI...

From: Keven Koster <keve.koster@mnop.com>
To: Jerrmy Bret <jeremy.brett@mnop.com>
Date: 21 Sep 2019
Subject: Approval Required for Travel

Can't Approve as Ruth's approval is required

From: Jerrmy Bret <jeremy.brett@mnop.com>
To: Keven Koster <keve.koster@mnop.com>
Date: 21 Sep 2019
Subject: Approval Required for Travel

ok thanks Keven, will talk to Ruth
'''

email_regex = re.compile(
    r'(From:(?:(?!From:).)+)',
    re.DOTALL|re.MULTILINE
)
approval_regex = re.compile(
    r'approv(?:e|ed|al)',
    re.IGNORECASE
)
approved_emails = [
   email for email in email_regex.findall(text)
   if approval_regex.search(email)
]
print(approved_emails)

# output
[
   "From: Keven Koster <keve.koster@mnop.com>\nTo: Jerrmy Bret <jeremy.brett@mnop.com>\nDate: 21 Sep 2019\nSubject: Approval Required for Travel\n\nCan't Approve as Ruth's approval is required\n\n",
   'From: Jerrmy Bret <jeremy.brett@mnop.com>\nTo: Keven Koster <keve.koster@mnop.com>\nDate: 21 Sep 2019\nSubject: Approval Required for Travel\n\nok thanks Keven, will talk to Ruth\n'
]

如果有关系的话,你可以把approval_regex改成这样:

approval_regex = re.compile(
    r'Subject:.+\n.*approv(?:e|ed|al)',
    re.IGNORECASE|re.DOTALL|re.MULTILINE
)

假设您已将这些内容全部转换为文本行,并假设邮件格式是一致的,例如“发件人”字段是新电子邮件的开头和最后一封邮件正文的结尾,“主题”字段是邮件的最后一个标头和正文的开头。当您看到主题行指示下一行是主体时,您可以将标志设置为True。然后,当看到表示主体已结束的From行时,设置该标志。你知道吗

当旗子是真的,你在身体里,你可以做任何你想做的事。在下面的示例代码中,我只是将邮件正文中的所有行(不包括空行)收集到一个列表中。然后,我可以对该列表执行任何我喜欢的操作,例如检查它是否包含approve的行。你知道吗

import re

emails = """
From: Jerrmy Bret <jeremy.brett@mnop.com>
To: Jonathan Small <j.small@xyz.com>
Date: 21 Sep 2019
Subject: Stuff

FYI...

From: Keven Koster <keve.koster@mnop.com>
To: Jerrmy Bret <jeremy.brett@mnop.com>
Date: 21 Sep 2019
Subject: Approval Required for Travel

Can't Approve as Ruth's approval is required

From: Jerrmy Bret <jeremy.brett@mnop.com>
To: Keven Koster <keve.koster@mnop.com>
Date: 21 Sep 2019
Subject: Approval Required for Travel

ok thanks Keven, will talk to Ruth

"""
body = False
email_bodys = []
for line in emails.splitlines():
    if not line:
        continue
    if line.startswith("From: "):
        body = False
    if body:
        email_bodys.append(line)
    if line.startswith("Subject: "):
        body = True
print("email bodys detected in the text are:\n\t" + "\n\t".join(email_bodys))

print("text in body which contain approve:")
for email_body in email_bodys:
    if re.findall(r'approve', email_body, re.I):
        print("\t" + email_body)

输出

email bodys detected in the text are:
    FYI...
    Can't Approve as Ruth's approval is required
    ok thanks Keven, will talk to Ruth
text in body which contain approve:
    Can't Approve as Ruth's approval is required

相关问题 更多 >