分析文本日志文件以从日志消息中提取某些数据字段

3条回答

网友

1楼 · 编辑于 2024-05-13 04:00:47

使用字符串的内置replace()方法读取的行。有关字符串方法的列表，请参见http://docs.python.org/library/stdtypes.html#string-methods。

new_string = my_string.replace(' ', ' | ')

如果您还需要删除“columns”，那么您将有更多的机会首先拆分字符串，删除带有“slices”的列，然后加入拆分器上的列表。

cols = my_string.split(' ')
cols = cols[:2] + cols[4:8] + cols[11:]  #Just making up some arbitrary removed columns
new_string = ' | '.join(cols)

注意：这假设您的输入将始终以空格分隔，并且您的数据不包含空格。如果您的输入数据更复杂，那么分割代码会变得更有趣一些。

网友

2楼 · 编辑于 2024-05-13 04:00:47

你的输入文件格式很烦人。我们可以在空白处分割输入，但是要捕获的某些字段应该包含空白。我们可以将输入拆分为列号，但我不确定每个字符串的长度是否总是相同的；这些数字可能在位数上有所不同。所以最好的解决方案应该包括正则表达式。

一个单一的正则表达式来解析整行代码，对于编写和理解都是相当麻木的。但我们可以从较短的模式建立模式。我认为结果很容易理解。另外，如果文件格式更改或要捕获的字段发生更改，我认为您可以很容易地更改此项。

注意，我们使用Python“string repetition”操作符*来重复较短的模式。如果我们有两个要识别和捕获的单词，可以使用c*2重复捕获模式两次。

在所需输出的示例中，有一些额外的空格。我编写模式是为了不捕获任何空白，但如果您真的需要空白，可以根据需要编辑模式。

如果不知道正则表达式，应该阅读Pythonre模块的文档。简而言之，将捕获括在括号中的模式部分，而其他部分将匹配但不捕获。\s匹配空白，而\S匹配非空白。^模式中的{}表示“1或更多”，而*表示“0或更多”。^和$分别匹配模式的开始和结束。

import re

# Define patterns we want to recognize.

c = r'(\S+)\s+'  # a word we want to capture
s = r'\S+\s+'  # a word we want to skip
mesg = r'(\S.*\S)\s+--Sev\s+'  # mesg to capture; terminated by string '--Sev'
w2 = r'(\S+\s+\S+)\s+'  # two words separated by some white space
w2semi = r'(\S+\s+\S+)\s*;\s+'  # two words terminated by a semicolon
tail = r'(.*\S)\s*;'

# Join together the above patterns to make one giant pattern that parses
# the input.
s_pat = ( r'^\s*' + 
    c*2 + s*3 + c*1 + s*10 + c*2 + s*14 + c*1 + s*14 +
    mesg + w2 + w2semi*2 + tail +
    r'\s*$')

# Pre-compile the pattern for speed.
pat = re.compile(s_pat)

# Test string and the expected output result.
s_input = "83b14af0-949b-71e0-18d5-0ad781020000 40ba8352-8dd2-71dc-12b8-0ad781020000 1 -1407714483 20 COLG-GRA-617-RD1.oss 1 181895426 12 oss-ap-1.oss 0 0 48 0 0 0 1307845644 1307845647 0 2 12 0 0 0  0 0 12 0 0 0  0 0 1307845918 3 OpC 6 opcecm 9 SNMPTraps 8 IBB_COLG 4 ATM0 0  0  0  69 Cisco Agent Interface Up (linkUp Trap) on interface ATM0 --Sev Normal 372 Generic: 3; Specific: 0; Enterprise: .1.3.6.1.4.1.9.1.569;"
s_correct = "83b14af0-949b-71e0-18d5-0ad781020000|40ba8352-8dd2-71dc-12b8-0ad781020000|COLG-GRA-617-RD1.oss|1307845644|1307845647|1307845918|Cisco Agent Interface Up (linkUp Trap) on interface ATM0|Normal 372|Generic: 3|Specific: 0|Enterprise: .1.3.6.1.4.1.9.1.569"

# re.match() returns a "match group"
m = re.match(pat, s_input)
# m.groups() returns sequence of captured strings; join with '|'
s_output = '|'.join(m.groups())

# sanity check
if s_correct == s_output:
    print "excellent"
else:
    print "bogus"

# excellent.

通过编写、测试和调试模式，编写程序来实际处理文件非常简单。

# use the pattern defined above, named "pat"
with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
    for line_num, line in enumerate(f_in, 1):
        try:
            m = re.match(pat, line)
            s_output = '|'.join(m.groups())
            f_out.write(s_output + '\n')
        except Exception:
            print("unable to parse line %d: %s" % (line_num, line)

这将一次读取一行文件，处理该行，并将处理后的行写入输出文件。

注意，我在一行上使用了多个with语句。这适用于最近的任何Python，但不适用于2.5或3.0。

网友

3楼 · 编辑于 2024-05-13 04:00:47

如果您使用的是linux，那么使用sed命令很容易替换字符。它比在python中逐行读取要快，因为您的文件太大了。

sed -i 's/pattern/|/g' inputfile

上面的命令将用|替换所有模式字符串。

相关问题更多 >

编程相关推荐

热门问题

热门文章