需要用Python预处理IIS日志
我有一个12GB的IIS日志文件,文件开头的内容如下:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2014-01-05 00:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2014-01-05 00:00:00 192.168.1.208 GET /air/onlineMIS/AutoUpdateGV.aspx - 80 - 117.194.39.88 Mozilla/5.0+(Windows+NT+5.1;+rv:26.0)+Gecko/20100101+Firefox/26.0 200 0 0 75
2014-01-05 00:00:00 192.168.1.208 GET /air/onlineMIS/AutoUpdateGV.aspx - 80 - 59.180.241.153 Mozilla/5.0+(Windows+NT+6.1;+WOW64;+Trident/7.0;+rv:11.0)+like+Gecko 200 0
我想要删除
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2014-01-05 00:00:00
#Fields:
而不想创建一个新文件。
最后我写了下面的脚本,感谢alecxe的帮助。
line = file.readline()
if not line.startswith('#Software: Microsoft Internet Information Services '):
file.seek(0)
return
# Skip the next 2 lines.
for i in xrange(2):
file.readline()
# Parse the 4th line (regex)
full_regex = []
line = file.readline()
fields = {
'date': '(?P<date>^\d+[-\d+]+',
'time': '[\d+:]+)',
'cs-uri-stem': '(?P<path>/\S*)',
'cs-uri-query': '(?P<query_string>\S*)',
'c-ip': '(?P<ip>[\d*.]*)',
'cs(User-Agent)': '(?P<user_agent>\S+)',
'cs(Referer)': '(?P<referrer>\S+)',
'sc-status': '(?P<status>\d+)',
'sc-bytes': '(?P<length>\S+)',
'cs-host': '(?P<host>\S+)',
}
# Skip the 'Fields: ' prefix.
line = line[9:]
for field in line.split():
try:
regex = fields[field]
except KeyError:
regex = '\S+'
full_regex.append(regex)
self.regex = re.compile(' '.join(full_regex))
start_pos = file.tell()
nextline = file.readline()
file.seek(start_pos)
1 个回答
1
你可以使用 fileinput 模块来直接修改文件,循环中打印的内容会直接写回到文件里:
import fileinput
for line in fileinput.input("input.txt", inplace=True):
if line.startswith('#Fields '):
print line[9:].strip()
elif not line.startswith('#'):
print line.strip()
如你所见,这里不需要使用复杂的正则表达式,只需要检查这一行是否以 #Fields
或 #
开头就可以了。