需要修改regex才能在其他情况下工作

2024-05-23 23:41:11 发布

您现在位置:Python中文网/ 问答频道 /正文

刚发现我的文件的结构可能不同,我的正则表达式只是因为这种变化有时才工作。我的正则表达式是
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\rACTIVITY.+?',wholefile)

它当前与文件的以下部分匹配。你知道吗

----------           LOW VOLTAGE SUMMARY BY AREA            ----------

         BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

       12006  [AMISTAD 69.0]   0.971   1.8700  10 NEW MEXICO    121
       11223  [WHITESA213.8]   0.918   1.9900  11 EL PASO       110
       70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
       70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
       79086  [PAGOSA   115]   0.937   2.0333  73 WAPA R.M.     791

ACTIVITY? 
PDEV

ENTER OUTPUT DEVICE CODE:
 0 FOR NO OUTPUT
 1 FOR PROGRESS WINDOW

不过,文件的这一部分有时如下所示

    ----------           LOW VOLTAGE SUMMARY BY AREA            ----------

         BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

       12006  [AMISTAD 69.0]   0.742  13.2060  10 NEW MEXICO    121
       11223  [WHITESA213.8]   0.916   1.8367  11 EL PASO       110
       70187  [FTGARLND69.0]   0.936  19.6099  70 PSCOLORADO    710
       73216  [WINDRIVR 115]   0.858   3.6100  73 WAPA R.M.     750

(VFSCAN) AT TIME = 20.0000 UP TO  100 BUSES WITH LOW FREQUENCY BELOW 59.600:

X ----- BUS ------ X    FREQ       X ----- BUS ------ X    FREQ
12063 [ROSEBUD 13.8]   59.506     

在这两种情况下,我只想捕捉以下部分:

----------           LOW VOLTAGE SUMMARY BY AREA            ----------

     BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

   12006  [AMISTAD 69.0]   0.971   1.8700  10 NEW MEXICO    121
   11223  [WHITESA213.8]   0.918   1.9900  11 EL PASO       110
   70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
   70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
   79086  [PAGOSA   115]   0.937   2.0333  73 WAPA R.M.     791

我的正则表达式如何返回上面的部分,而不管我查看的是哪个版本的文件?你知道吗


Tags: 文件namezonenewbytimeareasummary
2条回答

我不建议使用正则表达式,而是做一些解析。假设您的数据位于名为data的字符串中:

lines = [line for line in data.split("\n")]

# find start of header
for index, line in enumerate(lines):
    if "LOW VOLTAGE SUMMARY BY AREA" in line:
        start_index = index
        break

# first first data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[start_index:]):
    if line.strip() and line.split()[0].isdigit():
        first_entry_index = start_index + index
        break

# find last data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[first_entry_index:]):
    # we don't do this inside the if because it's possible
    # to end the data with only entries and whitespace
    end_entry_index = first_entry_index + index

    if line.strip() and not line.split()[0].isdigit():
        break

# print all lines between header and last data entry
print("\n".join(lines[start_index:end_entry_index]))

这应该管用

v6 = re.findall(r'(?s)     \s*LOW VOLTAGE SUMMARY BY AREA.*?\r(ACTIVITY|\(VFSCAN\)).+?',wholefile)

相关问题 更多 >