如何只打印符合python模式的数据文件的某一部分

2024-05-13 23:56:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用python打印文件中的某些数据块。基本上,它应该作为解析器工作,只输出符合我的条件的块。 我的文件包含呼叫中心的日志。我想要以“######”开始并以"</soap:Body>>"结束的部分,但它也应该包含一个特定的数字,在我的文件中称为msisdn:"<msisdn>any number</msisdn>"

文件也有点大。因此,在执行readlines()时,不能使用regex for i,data in enumerate(line) 这里的数据是分裂的,我不能搜索整个区块,我需要。你知道吗

文件部分如下:

####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false> 
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
    <TransactionId>DATA030620160431128801011429ADD</TransactionId>
    <msisdn>8801011429</msisdn>
    <productCode>DATA</productCode>
    <action>ADD</action>
    <IMSI>405801124044563</IMSI>
    <SubsType>PrePaid</SubsType>
  </VASProxyType>
</soap:Body>> 
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
    <TransactionId>DATA030620160431128801011429ADD</TransactionId>

输出应为:

<;[ACTIVE]ExecuteThread:'13'对于队列:'weblogic.kernel.Default(自调整)>;<;>;<;>;<;1465021250886>;[PipelinePairNode1,PipelinePairNode1请求,CreateVASReportingStage,请求]*CreateVASWrapper报告阶段VAS V-3.0*

你知道吗 数据0306201604311288011429ADD 8801011429 数据 添加 405801124044563 预付款的 &燃气轮机

请帮忙!你知道吗


Tags: 文件数据orghttpdefaultforbodykernel
2条回答

正如在注释中所建议的:您的XML无效。最好确保XML有效,然后使用像[etree][1]或[Beautiful Soup][2]这样的解析器。你知道吗

但是如果你想使用regex,你可以尝试:

import re

mytext = [
    '####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false>',
    '####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
    '<VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
    '    <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
    '    <msisdn>8801011429</msisdn>',
    '    <productCode>DATA</productCode>',
    '    <action>ADD</action>',
    '    <IMSI>405801124044563</IMSI>',
    '    <SubsType>PrePaid</SubsType>',
    '</VASProxyType>',
    '</soap:Body>',
    '<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
    '    <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
    '        <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
]

searches = [
    {
       "if_in": "<[ACTIVE] ExecuteThread:",
       "search": "<\[ACTIVE[^<>]+> <<WLS Kernel>> <> <> <\d+>",
    },
    {
        "if_in": "PipelinePairNode1, PipelinePairNode1_request, Create",
        "search": "< \[PipelinePairNode1, PipelinePairNode1_request, Create[^\[\]]+\]",
    },
    {
        "if_in": "CreateVASWrapper Reprting Stage VAS",
        "search": "CreateVASWrapper Reprting Stage VAS[^*]+",
    },
    {
        "if_in": "<TransactionId>",
        "search": "(?<=<TransactionId>)[^<>]+",
    },
    {
        "if_in": "<msisdn>",
        "search": "(?<=<msisdn>)[^<>]+",
    },
    {
        "if_in": "<action>",
        "search": "(?<=<action>)[^<>]+",
    },
    {
        "if_in": "<IMSI>",
        "search": "(?<=<IMSI>)[^<>]+",
    },
    {
        "if_in": "<SubsType>",
        "search": "(?<=<SubsType>)[^<>]+",
    },
]

result = ""
found_once = []

for item in mytext:
    for search in searches:
        if search['if_in'] in item and search['if_in'] not in found_once:
            f = re.findall(search['search'], item)
            if f:
                result += f[0] + " "
                found_once.append(search['if_in'])

print result

如果您想找到其他内容,请将其添加到searches。你知道吗

结果是:

<[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] CreateVASWrapper Reprting Stage VAS V-3.0  DATA030620160431128801011429ADD 8801011429 ADD 405801124044563 PrePaid

处理此类问题的标准方法是编写某种“基于事件”的解析器(如SAXXML解析器……):解析器逐行读取文件(不需要读取内存中的全部内容),根据自己的规则扫描行(这就是您可能要使用regexps的地方,但有时纯字符串方法也同样有效),并且根据行内容的不同,会发出一个给定的“事件”(将由回调方法处理)和相关数据。你知道吗

在您的例子中,会有一个事件,用于开始一个有趣的数据块的行(以“#######”开头的行),另一个事件用于包含xml数据的行,还有一个事件用于块的最后一行(包含“”的行)-类似于这样:

class Parser(object):

    def parse(self, logfile):
        self.in_block = False
        for line in logfile:
            if self.is_block_start(line):
                self.in_block = True
                self.handle_block_start(line)
            elif self.in_block:
                if self.is_data(line):
                    self.handle_data(line)
                elif self.is_block_end(line):
                    self.in_block = False
                    self.handle_block_end(line)
            else:
                continue

    def is_block_start(self):
        # your code here

    def is_data(self):
        # your code here

    def is_block_end(self):
        # your code here

    def handle_block_start(self, line):
        # your code here

    def handle_data(self, line):
        # your code here

    def handle_block_end(self, line):
        # your code here

相关问题 更多 >