Python 将日志信息分割为标记

1 投票

5 回答

1274 浏览

提问于 2025-04-17 18:06

我有一条日志信息，格式是这样的：

[2013-Mar-05 18:21:45.415053] (ThreadID) <Module name> [Logging level]    Message Desciption : This is the message.

我想把它变成一个字典，格式如下：

{'time stamp': 2013-Mar-05 18:21:45.415053, 'ThreadId': 4139, 'Module name': ModuleA , 'Message Description': My Message, 'Message' : This is the message }

我尝试用空格把日志信息分开，然后选择一些部分来制作列表。大概是这样的：

for i in line1.split(" "):

这样做会得到一些像这样的部分：

['2013-Mar-05', '18:21:45.415053]', '(ThreadID)', '<Module name>', '[Logging level]',    'Message Desciption', ':', 'This is the message.']

然后我可以挑选这些部分，把它们放进需要的列表里。

有没有更好的方法来提取这些部分呢？这里有一些规律，比如时间戳会在[]中，线程ID会在()里，模块名称会在<>中。我们能不能利用这些信息，直接提取出这些部分呢？

数据提取数据清洗信息提取日志处理文本分割字典格式

5 个回答

下面这个怎么样？（注释解释了发生了什么）

log = '[2013-Mar-05 18:21:45.415053] (ThreadID) <Module name> [Logging level]    Message Description : This is the message.'

# Define functions on how to proces the different kinds of tokens
time_stamp = logging_level = lambda x: x.strip('[ ]')
thread_ID = lambda x: x.strip('( )')
module_name = lambda x: x.strip('< >')
message_description = message = lambda x: x

# Names of the tokens used to make the dictionary keys
keys = ['time stamp', 'ThreadId',
        'Module name', 'Logging level',
        'Message Description', 'Message']
# Define functions on how to process the message
funcs = [time_stamp, thread_ID,
         module_name, logging_level,
         message_description, message]
# Define the tokens at which to split the message
split_on = [']', ')', '>', ']', ':']

msg_dict = {}

for i in range(len(split_on)):
    # Split up the log one token at a time
    temp, log = log.split(split_on[i], 1)
    # Process the token using the defined function
    msg_dict[keys[i]] = funcs[i](temp) 

msg_dict[keys[i]] = funcs[i](log) # Process the last token
print msg_dict

回答于 2025-04-17 由 Python大师

分享举报

这里使用了正则表达式，希望能对你有所帮助！

import re

string = '[2013-Mar-05 18:21:45.415053] (4444) <Module name> [Logging level]  Message Desciption : This is the message.'

regex = re.compile(r'\[(?P<timestamp>[^\]]*?)\] \((?P<threadid>[^\)]*?)\) \<(?P<modulename>[^\>]*?)\>[^:]*?\:(?P<message>.*?)$')

for match in regex.finditer(string):
    dict = {'timestamp': match.group("timestamp"), 'threadid': match.group("threadid"), 'modulename': match.group('modulename'), 'message': match.group('message')}

print dict

输出结果：

{'timestamp': '2013-Mar-05 18:21:45.415053', 'message': ' 这是消息。', 'modulename': '模块名称', 'threadid': '4444'}

解释一下：我在正则表达式中使用了分组，这样可以在后面的脚本中方便地使用这些部分。想了解更多信息，可以查看这个链接：http://docs.python.org/2/library/re.html。简单来说，我是从左到右逐行查找一些分隔符，比如 [、<、( 等等。

回答于 2025-04-17 由 Python大师

分享举报

这里有一个和@Oli的回答非常相似的解答，不过这个正则表达式（regex）看起来更容易理解。我使用了groupdict()，这样就不需要自己再创建一个字典，因为正则表达式会自动生成一个。日志字符串是从左到右解析的，每找到一个匹配项就会消耗掉它。

fmt = re.compile(
      r'\[(?P<timestamp>.+?)\]\s+' # Save everything within [] to group timestamp
      r'\((?P<thread_id>.+?)\)\s+' # Save everything within () to group thread_id
      r'\<(?P<module_name>.+?)\>\s+' # Save everything within <> to group module_name
      r'\[(?P<log_level>.+?)\]\s+' # Save everything within [] to group to log_level
      r'(?P<message_desc>.+?)(\s:\s|$)' # Save everything before \s:\s or end of line to           group message_desc,
      r'(?P<message>.+$)?' # if there was a \s:\s, save everything after it to group   message. This last group is optional
      )

log = '[2013-Mar-05 18:21:45.415053] (4139) <ModuleA> [DEBUG]  Message Desciption : An example message!'

match = fmt.search(log)

print match.groupdict()

示例：

log = '[2013-Mar-05 18:21:45.415053] (4139) <ModuleA> [DEBUG]  Message Desciption : An       example message!'
match = fmt.search(log)

print match.groupdict() 
{'log_level': 'DEBUG',
 'message': 'An example message!',
 'module_name': 'ModuleA',
 'thread_id': '4139',
 'timestamp': '2013-Mar-05 18:21:45.415053'}

这是用你在这个回答评论中提到的第一个测试字符串的示例。

log = '[2013-Mar-05 18:21:45.415053] (0x7aa5e3a0) <Logger> [Info] Opened settings file : /usr/local/ABC/ABC/var/loggingSettings.ini'

match = fmt.search(log)

print match.groupdict()
{'log_level': 'Info',
 'message': '/usr/local/ABC/ABC/var/loggingSettings.ini',
 'message_desc': 'Opened settings file',
 'module_name': 'Logger',
 'thread_id': '0x7aa5e3a0',
 'timestamp': '2013-Mar-05 18:21:45.415053'}

这是用你在这个回答评论中提到的第二个测试字符串的示例：

log = '[2013-Mar-05 18:21:45.415053] (0x7aa5e3a0) <Logger> [Info] Creating a new settings file'

match = fmt.search(log)

print match.groupdict()
{'log_level': 'Info',
 'message': None,
 'message_desc': 'Creating a new settings file',
 'module_name': 'Logger',
 'thread_id': '0x7aa5e3a0',
 'timestamp': '2013-Mar-05 18:21:45.415053'}

编辑：已修正以适应提问者的示例。

回答于 2025-04-17 由 Python大师

分享举报

Python 将日志信息分割为标记

5 个回答

撰写回答