在python3中解析日志文件中的IP地址

2024-04-28 03:05:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我对python还很陌生,我正在尝试(使用python3)遍历大量大型自定义日志文件,从某些GET请求中提取参数,并尝试从中收集一些统计信息。我走得有点远,但我被两个问题困住了,我和我的同事不明白为什么他们让我们如此头痛。我会把这两个问题分开贴出来,以免让你困惑。在

我的日志文件如下所示:

80 172.23.131.149 "2018-07-05 13:08:25 860" "POST /bios/servlet/bios.servlets.sso.WaffleLoginServlet HTTP/1.1" 401 5 891 891 "-" "Java/1.8.0_171"
8080 172.23.131.251 "2018-07-05 13:08:26 594" "HEAD /bios/servlet/bios.servlets.web.Ping?level=3 HTTP/1.0" 200 - 1953 1953 "-" "-"
8080 172.23.131.252 "2018-07-05 13:08:26 594" "HEAD /bios/servlet/bios.servlets.web.Ping?level=3 HTTP/1.0" 200 - 953 953 "-" "-"
80 172.23.131.149 "2018-07-05 13:08:28 188" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156240.234375%2C6576777.34375%2C156269.53125%2C6576806.640625 HTTP/1.1" 200 133210 3547 3516 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 188" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156240.234375%2C6576748.046875%2C156269.53125%2C6576777.34375 HTTP/1.1" 200 108066 3547 3532 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 188" "POST /bios/servlet/bios.servlets.GetGeometryComponents HTTP/1.1" 401 4 2484 2484 "-" "Java/1.8.0_171"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156210.9375%2C6576806.640625%2C156240.234375%2C6576835.9375 HTTP/1.1" 200 123953 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156210.9375%2C6576777.34375%2C156240.234375%2C6576806.640625 HTTP/1.1" 200 147132 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156269.53125%2C6576777.34375%2C156298.828125%2C6576806.640625 HTTP/1.1" 200 145701 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.137.120 "2018-07-06 10:04:32 856" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_GRA?FORMAT=image%2Fpng&TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&STYLES=&SRS=EPSG%3A5850&BBOX=150000,6580000,151875,6581875&WIDTH=256&HEIGHT=256 HTTP/1.1" 200 58443 0 0 "https://iservice.stockholm.se/open/TyckTill/Pages/TyckTill.aspx?systemId=synpunktsportalen" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
80 172.23.137.120 "2018-07-06 10:04:25 400" "GET /bios/dpwebmap/cust_sth/slk/tycktill/app.htmlclient.gwt.DPWebApp.nocache.js HTTP/1.1" 200 3924 0 0 "https://iservice.stockholm.se/open/TyckTill/Pages/TyckTill.aspx?systemId=synpunktsportalen" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"

我要做的是用字符串REQUEST=GetMap提取所有行的IP地址。我使用的正则表达式是:

^{pr2}$

我使用键ip来计算代码中日志文件中所有IP地址的出现次数。在

我一直在盯着regex,试图前后改变一下,但还是不起作用。 But it works in Regex101 which is very confusing

任务的完整代码是:

#!/usr/bin/env python3

import os
import re
from collections import Counter

# regular expression

#rexp = [r'(?P<timestamp>\d{1,2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2}\.\d{3}) client (?P<client>(?:\d{1,3}\.){3}\d{1,3}).+query: (?P<domain>.+) IN (?P<qtype>[A-Z]+) \+.+\({2}(?P<server>(?:\d{1,3}\.){3}\d{1,3})\){2}'
#rexp = r"(^.+layers=(?P<domain>.*?)&)" # sök efter LAYERS= eller layers=

rexp_layer = r"(^.+layers=(?P<domain>.*?)[&\s])"                # search for the name of the requested layer (between the string 'LAYERS=' or 'layers=' and a ampersand '&' or blankspace ' ') in each line and give it the key 'domain'
rexp_port = r"(?P<port>\d{2,4} )"                               # search for the 2 or 4 digit value in the beginning of each line
rexp_ip = r"(?P<ip>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))"

rexp_date = r"(?P<datum>\d{4}\-\d{2}\-\d{2})"                   # search for the date in format XXXX-XX-XX and give it the key 'datum'
rexp_time = r"(?P<tid>\d{2}\:\d{2}\:\d{2})"                     # search for the timestamp XX:XX:XX and give it the key 'tid'

rexp_name = r"(^.+/bios/wms/app/(?P<name>.+?)\?)"     # search for the name of the called WMS-service (are between the string '/bios/wms/app/' and a '?') and give it the key 'name'to the FIRST occurrence of "?", "+?" makes the "+" non-greedy

rexp_coordsys = r"(^.+&\wRS=(?P<koordsys>.*?)[&\s])"             # search for the coordinate system between the string '&SRS=' or '&CRS=' and a ampersand '&' and give it the key 'koordsys'

rexp_width = r"(^.+WIDTH=(?P<width>.*?)&)"                      # search for the width of the requested picture (are between the string 'WIDTH=' and a ampersand '&') and give it the key 'width'
rexp_height = r"(^.+HEIGHT=(?P<height>.*?)[&\s])"               # search for the height of the requested picture (are between the string 'HEIGHT=' and a ampersand '&') and give it the key 'height'

# rexp_bbox = r"(((?P<bbox_xmin>-?\d+\.?\d*)%2C)((?P<bbox_ymin>-?\d+\.?\d*)%2C)((?P<bbox_xmax>-?\d+\.?\d*)%2C)((?P<bbox_ymax>-?\d+\.?\d*)[\s&]))"  # FUNKAR INTE ÄNNU HÄR KAN MAN FORTSÄTTA

# create counter dictionary
cnt_domains = Counter()                 # for counting the occurrances of a certain layer
cnt_port = Counter()                    # for counting the occurrances of a certain layer
cnt_ip = Counter()                      # for counting the occurrances of a IP-adress
#cnt_date = Counter()                    # for counting the occurrances of a certain date  -- i probably will not use that

cnt_name = Counter()                    # for counting the occurrances of a certain service
cnt_coordsys = Counter()                # for counting the occurrances of a certain coordinate system
cnt_width = Counter()                   # for counting the occurrances of a certain requested width
cnt_height = Counter()                  # for counting the occurrances of a certain requested height
cnt_bbox = Counter()

# Compile regular expression for faster computing
rexp_layer_compile = re.compile(rexp_layer, re.IGNORECASE)      # get the regex to look for occurrences of LAYERS or layers - seems to work
rexp_port_compile = re.compile(rexp_port)
rexp_ip_compile = re.compile(rexp_ip)
rexp_name_compile = re.compile(rexp_name, re.IGNORECASE)        # No diffenence with re.IGNORECASE
rexp_coordsys_compile = re.compile(rexp_coordsys)               # mixes in regex for layers
rexp_width_compile = re.compile(rexp_width, re.IGNORECASE)
rexp_height_compile = re.compile(rexp_height, re.IGNORECASE)
# rexp_bbox_compile = re.compile(rexp_bbox)

# Path to folder with log files
#path = '/home/uwestephan/Logg-file-parsing/ws00848'
# path = '/home/uwestephan/Logg-file-parsing/ws00524'
# path = '/home/uwestephan/Logg-file-parsing/ws00524_test'

path = '/home/uwestephan/Logg-file-parsing/ws00848_test'

# setting the line counters to zero
matchedGETMAP = 0
failedGETMAP = 0
failed = 0
failedLAYER = 0

# open file
for filename in os.listdir(path):
    filmedsokvag = (path+"/"+filename)
    print (filmedsokvag)

    # read file / gather data
    f = open(filmedsokvag, 'r')

    # exclude all lines that do not have the string 'GetMap' in it
    for line in f:
        if re.findall('GetMap',line):                   # check if there is a string 'GetMap' in the line in the log file

            m = re.match(rexp_layer_compile, line)      # match the name of the requested layer
            p = re.match(rexp_port_compile, line)       # match the port
            i = re.match(rexp_ip_compile, line)                 # match the IP-adress
            n = re.match(rexp_name_compile, line)       # match the name of the WMS-service thats requested
            c = re.match(rexp_coordsys_compile, line)   # match the coordinate system
            w = re.match(rexp_width_compile, line)      # match the width of the requested picture that the WMS-service is sending
            h = re.match(rexp_height_compile, line)     # match the height of the requested picture that the WMS-service is sending
#            b = re.match(rexp_bbox_compile, line)

            if m:
                cnt_domains.update([m.group('domain')])     # here I try to count the occurrences of a the layer names
                # matchedGETMAP += 1                          # add 1 to the line counter that count processed lines in the file (as i do not process all lines in this if sentence)
            else:
#                failedGETMAP += 1
                failedLAYER += 1                        # Counts the number of lines with a getmap request who do NOT have the parameter LAYER called


            if p:
                cnt_port.update([p.group('port')])          # here I try to count the occurrences of a the differnt ports
#            else:
#               continue

            if i:
                cnt_ip.update([i.group('ip')])              # here I try to count the occurrences of the IP-adresses - THAT ONE DOES NOT WORK

            #For debugging only - the regular expression for the IP adress seems not to work
            else:
                print("Cannot find IP address")

            if n:
                cnt_name.update([n.group('name')])          # here I try to count the occurrences of a the names of the WMS-services
                matchedGETMAP += 1                          # add 1 to the line counter that count processed lines in the file (as i do not process all lines in this if sentence)
            else:
                failedGETMAP += 1

            if c:
                cnt_coordsys.update([c.group('koordsys')])  # here I try to count the occurrences of a coordinate systems
#            else:
#                continue

            if w:
                cnt_width.update([w.group('width')])        # here I try to count the occurrences of the widths of the requested pictures that the WMS-service is sending
#            else:
#                continue

            if h:
                cnt_height.update([h.group('height')])        # here I try to count the occurrences of the heights of the requested pictures that the WMS-service is sending
#            else:
#                continue

#            if b:
#                cnt_bbox.update([b.group('bbox_xmin')])        # here I try to count the occurrences of the heights of the requested pictures that the WMS-service is sending
#            else:
#                continue



        else:
            failed += 1         # add 1 to the counter that counts the lines that NOT processed by the if sentence above
            continue


# Remove hyphon from the cnt_domains dictionary - not realy neccesarry -> IT CREATES NOT A COUNTER DICTIONARY BUT A NORMAL DICTIONARY
# cnt_domains = {key.replace('"',''): val for key,val in cnt_domains.items()}


# Create an empty dictionary for my replace values
f100 = open('Oversattningstabell_for_lagernamn_csv.csv', 'r')
DictionaryReplaceValues = {}
for line in f100:
    x = line.split(",")
    a = x[0]
    b = x[1]
    c = len(b)-1        # Removes the \n from the end of each line by counting the lenght of the line b and the reassigning a shorter string back to b
    b = b[0:c]          # Removes the \n from the end of each line by counting the lenght of the line b and the reassigning a shorter string back to b
    DictionaryReplaceValues[a]=b

print("\n\nDet här är min Replacement dictionary")
for key in DictionaryReplaceValues.keys():
    print (key, " = ", DictionaryReplaceValues[key])


# Create an empty dictionary for the translated dictionary - Not really neccesarry
cnt_domains_newname = {}

# Replace the old dictionary with an new one using the translating dictionary DictionaryReplaceValues
cnt_domains_newname = dict((DictionaryReplaceValues.get(key, key), value) for (key, value) in cnt_domains.items())

# Make a counter out of the dictionary created above
new_counter_cnt_domains_newname = Counter(cnt_domains_newname)



# Output Results
print('[*] %d Number of GetMap request that matched the regular expression' % (matchedGETMAP))
print('[*] %d Number of GetMap request that failed to match the regular expression' % (failedGETMAP), end='\n\n')
print('[*] %d Number of other request in the log files ' % (failed), end='\n\n')
print('[*] %d Number of GetMap requests that request the Top layer of the WMS' % (failedLAYER), end='\n\n')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Layer Queried')
print('[*] ============================================')
#for domain, count in cnt_domains_newname.most_common(100):
for domain, count in new_counter_cnt_domains_newname.most_common(100):
    print('[*] %60s: %d' % (domain, count))
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Port Queried')
print('[*] ============================================')
for port, count in cnt_port.most_common(100):
    print('[*] %60s: %d' % (port, count))
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring IP-adresses Queried')
print('[*] ============================================')
for ip, count in cnt_ip.most_common(100):
    print('[*] %60s: %d' % (ip, count))
#    print(ip, count)
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring WMS-name Queried')
print('[*] ============================================')
for name, count in cnt_name.most_common(100):
    print('[*] %60s: %d' % (name, count))
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Coordinate Systemes Queried')
print('[*] ============================================')
for koordsys, count in cnt_coordsys.most_common(100):
    print('[*] %60s: %d' % (koordsys, count))
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Picture Widths Queried')
print('[*] ============================================')
for width, count in cnt_width.most_common(100):
    print('[*] %60s: %d' % (width, count))
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Picture Heights Queried')
print('[*] ============================================')
for height, count in cnt_height.most_common(100):
    print('[*] %60s: %d' % (height, count))
print('[*] ============================================')
#print('[*] ============================================')
#print('[*] 100 Most Frequently Occurring BBOX_xmin Queried')
#print('[*] ============================================')
#for bbox_xmin, count in cnt_bbox.most_common(100):
#    print('[*] %30s: %d' % (bbox_xmin, count))
#print('[*] ============================================')

# Output results to file
with open('parseroutput.txt', 'w') as fd:
    print('[*] %d Number of GetMap request that matched the regular expression' % (matchedGETMAP), file=fd)
    print('[*] %d Number of GetMap request that failed to match the regular expression' % (failedGETMAP), end='\n\n', file=fd)
    print('[*] %d Number of other request in the log files ' % (failed), end='\n\n', file=fd)
    print('[*] %d Number of GetMap requests that request the Top layer of the WMS' % (failedLAYER), end='\n\n', file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring Layer Queried', file=fd)
    print('[*] ============================================', file=fd)
    for domain, count in new_counter_cnt_domains_newname.most_common(100):
        print('%s: %d' % (domain, count), file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring Port Queried', file=fd)
    print('[*] ============================================', file=fd)
    for port, count in cnt_port.most_common(100):
        print('%s: %d' % (port, count), file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring IP-adresses Queried', file=fd)
    print('[*] ============================================', file=fd)
    for ip, count in cnt_ip.most_common(100):
        print('%s: %d' % (ip, count), file=fd)
        print(ip, count)
    print('[*] ============================================', file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring WMS-name Queried', file=fd)
    print('[*] ============================================', file=fd)
    for name, count in cnt_name.most_common(100):
        print('%s: %d' % (name, count), file=fd)
    print('[*] ============================================', file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring Coordinate Systemes Queried', file=fd)
    print('[*] ============================================', file=fd)
    for koordsys, count in cnt_coordsys.most_common(100):
        print('%s: %d' % (koordsys, count), file=fd)
    print('[*] ============================================', file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring Picture Widths Queried', file=fd)
    print('[*] ============================================', file=fd)
    for width, count in cnt_width.most_common(100):
        print('%s: %d' % (width, count), file=fd)
    print('[*] ============================================', file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring Picture Heights Queried', file=fd)
    print('[*] ============================================', file=fd)
    for height, count in cnt_height.most_common(100):
        print('%s: %d' % (height, count), file=fd)
    print('[*] ============================================', file=fd)

你有什么想法如何实现正则表达式提取IP地址?在


Tags: ofthetoinreforcountline
2条回答

下面的表达式可以获取IP地址

rexp_ip = r".*\s(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*"

您可以使用re.findall查找所需的主要行(ip、request/time/port、request type等),然后使用urllib.parse来查找其他必需的值:

import re
from urllib.parse import parse_qs
def parse_line(_d:str, flag = 'datum'):
  _headers = {'datum':['datum', 'tid'], 'server':['WMS_service', 'coord', 'width', 'height']}
  if flag == 'datum':
    return dict(zip(_headers[flag], re.findall('\d+\-\d+\-\d+|\d+:\d+:\d+', _d)))
  new_d = parse_qs(_d)
  return dict(zip(_headers[flag], [*re.findall('/bios/wms/app/(.*?)\?', _d),  *new_d.get('SRS', new_d.get('CRS', [])), *new_d.get('WIDTH', []), *new_d.get('HEIGHT', [])]))

file_data = [i.strip('\n') for i in open('filename.txt')]
new_data = [[re.findall('\d+\.\d+\.\d+\.\d+|\d+', re.sub('".*?"', '', i)), re.findall('".*?"', i)] for i in file_data]
final_results = []
for a, b in new_data:
  _temp = dict(zip(['port', 'ip'], a))
  _temp1 = {**_temp, **parse_line(b[0])} if len(b) == 1 else {**_temp, **parse_line(b[0]), **parse_line(b[1], 'server')}
  final_results.append(_temp1)

for i in final_results:
  print(i)

输出:

^{pr2}$

相关问题 更多 >