将AWK正则表达式转换为Python脚本

Question

大家早上好，

我想请教你们一个问题：上周末我刚开始学习Python，因为我的同事向我展示了如何通过用Python重写一个Bash脚本来大幅缩短执行时间。我对它运行得那么快感到很惊讶。现在我想对我另一个脚本做同样的事情。

这个脚本读取一个日志文件，使用AWK过滤日志中的某些字段，并将它们写入一个新文件。下面是这个脚本执行的正则表达式。我想把这个正则表达式用Python重写，因为我现在的脚本在处理一个大约有100,000行的日志文件时，执行时间大约需要1小时。我希望能尽可能缩短这个时间。

cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service command status; do echo "Service: $service, Command: $command, Status: $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command | grep $status | wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log; done

这个AWK命令会获取看起来像这样的行：

2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >

并输出像这样的行：

CC_SMS_SERVICE_51408 submit_resp: 0

我尝试自己写Python脚本，但在写正则表达式时遇到了困难。到目前为止，我有以下内容：

#!/usr/bin/python

# Import RegEx module
import re as regex
# Log file to work on
filetoread = open('/tmp/ pdu_log.log', "r")
# File to write output to
filetowrite =  file('/tmp/ pdu_log_clean.log', "w")
# Perform filtering in the log file
linetoread = filetoread.readlines()
for line in linetoread:
    filter0 = regex.sub(r"<G_","",line)
    filter1 = regex.sub(r"\."," ",filter0)
# Write new log file
    filetowrite.write(filter1)
filetowrite.close()
# Read new log and get required fields from it
filtered_log =  open('/tmp/ pdu_log_clean.log', "r")
filtered_line = filtered_log.readlines()
for line in filtered_line:
    token = line.split(" ")
    print token[0], token[1], token[5], token[13], token[20]
print "Done"

我知道这看起来很糟糕，但请记住，我才刚开始学习Python两天。

我一直在这个小组和网上寻找可以使用的代码片段，但到目前为止找到的都不符合我的需求，或者对我来说太复杂了。

如果你们能给我一些建议或意见，帮助我完成这个任务，我将非常感激。

另外，你们能推荐一本好的、简单易懂的Python学习书籍吗？我读过Swaroop C H的《A Byte of Python》（这是一本很好的入门书！），现在正在读Mark Pilgrim的《Dive into Python》。我在寻找一本用简单的语言解释，并且直截了当的书（类似于《A Byte of Python》的写作风格）。

提前谢谢你们！

祝好，

Junior

=====对Eli评论的回复=====

抱歉，大家，我试着在Eli的回答下评论，但我的评论太长了，无法保存。我也试着回复自己的帖子，但因为我是新用户，8小时内不能回复！所以我唯一的选择就是在我的帖子上添加编辑 :)

总之，回应Eli的评论：-

好的，让我们看看。我的目标是从日志文件中过滤出几个字段，并将它们写入一个新的日志文件。正如我之前提到的，当前的日志文件有成千上万行，像这样：

2011-05-16 09:46:22,361 [Thread-4847133] PDU D

日志文件中的所有行都类似，并且它们的长度相同（字段数量相同）。大多数字段是用空格分开的，只有几个字段是我用AWK处理的（去掉了“

我希望这样更清楚了。

祝好，

Junior

正则表达式数据处理脚本优化 bash 日志文件编程学习 awk 代码片段

将AWK正则表达式转换为Python脚本

1 个回答

撰写回答