需要用Python预处理IIS日志

1 投票

1 回答

1285 浏览

提问于 2025-04-18 01:51

我有一个12GB的IIS日志文件，文件开头的内容如下：

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2014-01-05 00:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2014-01-05 00:00:00 192.168.1.208 GET /air/onlineMIS/AutoUpdateGV.aspx - 80 - 117.194.39.88 Mozilla/5.0+(Windows+NT+5.1;+rv:26.0)+Gecko/20100101+Firefox/26.0 200 0 0 75
2014-01-05 00:00:00 192.168.1.208 GET /air/onlineMIS/AutoUpdateGV.aspx - 80 - 59.180.241.153 Mozilla/5.0+(Windows+NT+6.1;+WOW64;+Trident/7.0;+rv:11.0)+like+Gecko 200 0

我想要删除

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2014-01-05 00:00:00
#Fields:

而不想创建一个新文件。

最后我写了下面的脚本，感谢alecxe的帮助。

        line = file.readline()
        if not line.startswith('#Software: Microsoft Internet Information Services '):
            file.seek(0)
            return
        # Skip the next 2 lines.
        for i in xrange(2):
            file.readline()
        # Parse the 4th line (regex)
        full_regex = []
        line = file.readline()
        fields = {
            'date': '(?P<date>^\d+[-\d+]+',
            'time': '[\d+:]+)',
            'cs-uri-stem': '(?P<path>/\S*)',
            'cs-uri-query': '(?P<query_string>\S*)',
            'c-ip': '(?P<ip>[\d*.]*)',
            'cs(User-Agent)': '(?P<user_agent>\S+)',
            'cs(Referer)': '(?P<referrer>\S+)',
            'sc-status': '(?P<status>\d+)',
            'sc-bytes': '(?P<length>\S+)',
            'cs-host': '(?P<host>\S+)',
        }
        # Skip the 'Fields: ' prefix.
        line = line[9:]
        for field in line.split():
            try:
                regex = fields[field]
            except KeyError:
                regex = '\S+'
            full_regex.append(regex)
        self.regex = re.compile(' '.join(full_regex))

        start_pos = file.tell()
        nextline = file.readline()
        file.seek(start_pos)

文件操作数据清洗 iis日志日志预处理

1 个回答

你可以使用 fileinput 模块来直接修改文件，循环中打印的内容会直接写回到文件里：

import fileinput

for line in fileinput.input("input.txt", inplace=True):
    if line.startswith('#Fields '):
        print line[9:].strip()
    elif not line.startswith('#'):
        print line.strip()

如你所见，这里不需要使用复杂的正则表达式，只需要检查这一行是否以 #Fields 或 # 开头就可以了。

回答于 2025-04-18 由 Python大师

分享举报

需要用Python预处理IIS日志

1 个回答

撰写回答