Regex解析电子邮件Python

RECEIVED: 2012-11 20 09:59:24 SUBJECT: Get Boddy --- Original Sender: Mark Twain. --- ----- Original Message ----- From: Boby Indo To: Obum Hunter At: 11/20 9:59:22 ***NEW ISSUE SUPPORTED THROUGH UNIVERSALITY vs 104-13 on AY 3s JAN 10+BB {MYXV ABC 4116 SM MYXV YA 102-15 <DO>} | 2010/11 4.0s 4.0s 6+ BB {MYXV ABC 4132 NS MYXV YT 102-22 <DO>} | 2010 4.5s 4.5s ABO 2006-OP1 M1 00442PAG5 19-24 p5 ***SECOND SUPPORTED TRHOUGH INVERSALITY GEVINGS 10+BB {NXTW VXA 4061 SL MYXV YA 103-22 <DO>} | 11 wala 3.5s 3.5s 10+BB {NXTW VXA 12-47 SP MYXV YA 106-20 <DO>} | 22 wala 4.0s 4.0s ------------------------------------------------------------ © Copyright 2012 The Ridgly Group, Inc. All rights reserved. See http://www.examply.html for important information disclosure.

2条回答

网友

1楼 · 编辑于 2024-04-25 08:24:03

>>> s="""RECEIVED: 2012-11 20 09:59:24
... SUBJECT: Get Boddy
...  - Original Sender: Mark Twain.  -
... 
...   - Original Message   -
... From: Boby Indo
... To: Obum Hunter 
... At: 11/20  9:59:22
... 
... ***NEW ISSUE SUPPORTED THROUGH UNIVERSALITY   vs 104-13 on AY 3s JAN   
... 10+BB {MYXV ABC 4116    SM  MYXV YA 102-15 <DO>} | 2010/11 4.0s             4.0s
... 6+ BB {MYXV ABC 4132    NS  MYXV YT 102-22 <DO>} | 2010 4.5s                4.5s
... ABO 2006-OP1 M1     00442PAG5     19-24      p5 
... ***SECOND SUPPORTED TRHOUGH INVERSALITY GEVINGS
... 10+BB  {NXTW VXA 4061   SL  MYXV YA 103-22 <DO>} | 11 wala 3.5s             3.5s
... 10+BB  {NXTW VXA 12-47  SP  MYXV YA 106-20 <DO>} | 22 wala 4.0s             4.0s
... 
...                               
... © Copyright 2012 The Ridgly Group, Inc. All rights reserved. See
... http://www.examply.html for important information disclosure."""
>>> r=r'(?P<header>\*\*\*[^\n]*)\n(?P<body>[\s\S]*?\n)\n'
>>> for match in re.finditer(r, s):
...     print match.group('body')
... 
10+BB {MYXV ABC 4116    SM  MYXV YA 102-15 <DO>} | 2010/11 4.0s             4.0s
6+ BB {MYXV ABC 4132    NS  MYXV YT 102-22 <DO>} | 2010 4.5s                4.5s

10+BB  {NXTW VXA 4061   SL  MYXV YA 103-22 <DO>} | 11 wala 3.5s             3.5s
10+BB  {NXTW VXA 12-47  SP  MYXV YA 106-20 <DO>} | 22 wala 4.0s             4.0s

网友

2楼 · 编辑于 2024-04-25 08:24:03

看看这对您是否有效，您需要的行以数字开头，后跟加号：

^[0-9]*\+.*$

这将与预期输出相匹配：

^{pr2}$

^ Matches the beginning of the string.
[0-9] Matches any single character in the range 0-9.
* Matches 0 or more of the preceeding token. This is a greedy match, and will match as many characters as possible before satisfying the next token.
\+ Matches a + character.
. Matches any character.
$ Matches the end of the string.

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import re
with open("/path/to/file", "r") as fileInput:
    listLines = [   line.strip()
                    for line in fileInput.readlines()
                    if re.match("^[0-9]*\+.*$", line)
                    ] 


for line in listLines:
    print line

>>> 10+BB {MYXV ABC 4116    SM  MYXV YA 102-15 <DO>} | 2010/11 4.0s             4.0s
>>> 6+ BB {MYXV ABC 4132    NS  MYXV YT 102-22 <DO>} | 2010 4.5s                4.5s
>>> 10+BB  {NXTW VXA 4061   SL  MYXV YA 103-22 <DO>} | 11 wala 3.5s             3.5s
>>> 10+BB  {NXTW VXA 12-47  SP  MYXV YA 106-20 <DO>} | 22 wala 4.0s             4.0s

更新以满足新要求：

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import re
with open("/path/to/file", "r") as fileInput:
    regex = re.compile(r"\*{3}[^\*]*?(?:(?=^-*$)|(?=\*))", re.MULTILINE)

    listMsg = [ [   line.strip()
                    for line in message.split("\n")
                    if not line.startswith("*") and line.strip()
                    ]
                for message in regex.findall(fileInput.read())
                ]

>>> 10+BB {MYXV ABC 4116    SM  MYXV YA 102-15 <DO>} | 2010/11 4.0s             4.0s
>>> 6+ BB {MYXV ABC 4132    NS  MYXV YT 102-22 <DO>} | 2010 4.5s                4.5s
>>> ABO 2006-OP1 M1     00442PAG5     19-24      p5
>>> 10+BB  {NXTW VXA 4061   SL  MYXV YA 103-22 <DO>} | 11 wala 3.5s             3.5s
>>> 10+BB  {NXTW VXA 12-47  SP  MYXV YA 106-20 <DO>} | 22 wala 4.0s             4.0s

更新以提取电子邮件的整个正文：

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import re
with open("/path/to/file", "r") as fileInput:
    regex = re.compile(r"(?<=^At:)([^\n\r]*)(.*?)(?=^-*-$)", re.MULTILINE|re.DOTALL)

    print regex.search(fileInput.read()).groups()[1]

>>> ACE 2006-OP1 ZZ 111111111 19-24 Z5 ZZW 2012-0P1 SD 222222222 77-00 150
>>> ***NEW ISSUE SUPPORTED THROUGH UNIVERSALITY   vs 104-13 on AY 3s JAN   
>>> 10+BB {MYXV ABC 4116    SM  MYXV YA 102-15 <DO>} | 2010/11 4.0s             4.0s
>>> 6+ BB {MYXV ABC 4132    NS  MYXV YT 102-22 <DO>} | 2010 4.5s                4.5s
>>> ABO 2006-OP1 M1     00442PAG5     19-24      p5 
>>> ***SECOND SUPPORTED TRHOUGH INVERSALITY GEVINGS                      
>>> 10+BB  {NXTW VXA 4061   SL  MYXV YA 103-22 <DO>} | 11 wala 3.5s             3.5s
>>> 10+BB  {NXTW VXA 12-47  SP  MYXV YA 106-20 <DO>} | 22 wala 4.0s             4.0s

相关问题更多 >

编程相关推荐

热门问题

热门文章