如何组合.log扩展名中zip文件中的数据并组合它们

import fnmatch with ZipFile("path/HTWebLog_p1.zip") as zipfiles: file_list = zipfiles.namelist() #get only the .log files csv_files = fnmatch.filter(file_list, "*.log") #iterate with a list comprehension to get the individual dataframes data = [pd.read_csv(zipfiles.open(file_name), delimiter=',', header=0) for file_name in csv_files] #combine into one dataframe df = pd.concat(data) df.head()

#Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2006-11-01 00:00:08 #Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status 2006-11-01 00:00:08 W3SVC1 127.0.0.1 GET /Default.aspx - 80 - 70.80.84.76 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://www.google.com/search?sourceid=navclient&aq=t&ie=UTF-8&rls=GGLD,GGLD:2005-19,GGLD:en&q=Tulip+hotel 200 0 0 2006-11-01 00:00:08 W3SVC1 127.0.0.1 GET /Tulip/home/en-us/home_index.aspx - 80 - 70.80.84.76 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) - 200 0 0 2006-11-01 00:00:08 W3SVC1 127.0.0.1 GET /Tulip/includes/js/CommonUtil.js - 80 - 70.80.84.76 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://www.hotelTulip.com.hk/Tulip/home/en-us/home_index.aspx 200 0 0

1条回答

网友

1楼 · 发布于 2024-06-06 07:58:47

日志文件不是CSV文件（逗号分隔的值），因此CSV解析器当然会阻塞它们

如果不知道要从日志文件中提取什么，请尝试以下操作

import fnmatch

data = []
with ZipFile("path/HTWebLog_p1.zip") as zipfiles:
  file_list = zipfiles.namelist()
  log_files = fnmatch.filter(file_list, "*.log")
  for file_name in log_files:
      with zipfiles.open(file_name) as lines:
          data.extend(lines.readlines())

这只会将原始行读入data。如果您想从中解析出各个字段，您可能需要一些更复杂的东西，但至少希望这能让您朝着正确的方向开始

更详细地说，错误消息告诉您，CSV解析器检查了前几行，发现其中没有一行包含逗号，因此它们都被解析为一列文本。但是现在在第5行突然出现了一行，其中包含了一些逗号，这违反了格式定义（CSV文件中的每个记录都需要包含相同数量的列）。但当然，如果您查看数据，这些逗号实际上根本不是列分隔符

数据似乎有固定数量的列，因此如果跳过前几行，而使用delimiter=' '（列是空格分隔的，而不是逗号分隔的），那么您可能可以使用CSV读取器

相关问题更多 >

编程相关推荐

热门问题

热门文章