获取文件的最后n行,类似于tail
我正在为一个网页应用写一个日志文件查看器,我想要能分页显示日志文件中的内容。文件里的内容是按行排列的,最新的内容在最下面。
所以我需要一个 tail()
方法,可以从底部读取 n
行,并且支持偏移量。以下是我想到的实现:
def tail(f, n, offset=0):
"""Reads a n lines from f with an offset of offset lines."""
avg_line_length = 74
to_read = n + offset
while 1:
try:
f.seek(-(avg_line_length * to_read), 2)
except IOError:
# woops. apparently file is smaller than what we want
# to step back, go to the beginning instead
f.seek(0)
pos = f.tell()
lines = f.read().splitlines()
if len(lines) >= to_read or pos == 0:
return lines[-to_read:offset and -offset or None]
avg_line_length *= 1.3
这样做合理吗?有没有推荐的方式来处理带偏移量的日志文件?
36 个回答
103
假设你在一个类似Unix的系统上使用Python 2,你可以这样做:
import os
def tail(f, n, offset=0):
stdin,stdout = os.popen2("tail -n "+n+offset+" "+f)
stdin.close()
lines = stdout.readlines(); stdout.close()
return lines[:,-offset]
如果你使用的是Python 3,你可以这样做:
import subprocess
def tail(f, n, offset=0):
proc = subprocess.Popen(['tail', '-n', n + offset, f], stdout=subprocess.PIPE)
lines = proc.stdout.readlines()
return lines[:, -offset]
132
这个方法可能比你用的更快。它不假设每行的长度,逐块地向后读取文件,直到找到足够数量的换行符('\n')。
def tail( f, lines=20 ):
total_lines_wanted = lines
BLOCK_SIZE = 1024
f.seek(0, 2)
block_end_byte = f.tell()
lines_to_go = total_lines_wanted
block_number = -1
blocks = [] # blocks of size BLOCK_SIZE, in reverse order starting
# from the end of the file
while lines_to_go > 0 and block_end_byte > 0:
if (block_end_byte - BLOCK_SIZE > 0):
# read the last block we haven't yet read
f.seek(block_number*BLOCK_SIZE, 2)
blocks.append(f.read(BLOCK_SIZE))
else:
# file too small, start from begining
f.seek(0,0)
# only read what was not read
blocks.append(f.read(block_end_byte))
lines_found = blocks[-1].count('\n')
lines_to_go -= lines_found
block_end_byte -= BLOCK_SIZE
block_number -= 1
all_read_text = ''.join(reversed(blocks))
return '\n'.join(all_read_text.splitlines()[-total_lines_wanted:])
我不喜欢对行长度做一些复杂的假设,因为实际上你根本无法知道这些信息。
通常情况下,这个方法会在第一次或第二次循环中找到最后20行。如果你说的74个字符是准确的,那就把块的大小设置为2048,这样你几乎可以立刻获取到最后20行。
另外,我也不想费太多脑筋去调整和操作物理操作系统的块。使用这些高级的输入输出工具,我怀疑你会看到对齐操作对性能有什么影响。如果你使用的是低级的输入输出,可能会有速度上的提升。
更新
对于Python 3.2及以上版本,处理字节时,文本文件(那些没有在模式字符串中加"b"的文件)只允许相对于文件开头进行查找(唯一的例外是可以用seek(0, 2)查找文件的末尾)。:
例如:f = open('C:/.../../apache_logs.txt', 'rb')
def tail(f, lines=20):
total_lines_wanted = lines
BLOCK_SIZE = 1024
f.seek(0, 2)
block_end_byte = f.tell()
lines_to_go = total_lines_wanted
block_number = -1
blocks = []
while lines_to_go > 0 and block_end_byte > 0:
if (block_end_byte - BLOCK_SIZE > 0):
f.seek(block_number*BLOCK_SIZE, 2)
blocks.append(f.read(BLOCK_SIZE))
else:
f.seek(0,0)
blocks.append(f.read(block_end_byte))
lines_found = blocks[-1].count(b'\n')
lines_to_go -= lines_found
block_end_byte -= BLOCK_SIZE
block_number -= 1
all_read_text = b''.join(reversed(blocks))
return b'\n'.join(all_read_text.splitlines()[-total_lines_wanted:])
24
我最后用的代码。我觉得这是目前为止最好的:
def tail(f, n, offset=None):
"""Reads a n lines from f with an offset of offset lines. The return
value is a tuple in the form ``(lines, has_more)`` where `has_more` is
an indicator that is `True` if there are more lines in the file.
"""
avg_line_length = 74
to_read = n + (offset or 0)
while 1:
try:
f.seek(-(avg_line_length * to_read), 2)
except IOError:
# woops. apparently file is smaller than what we want
# to step back, go to the beginning instead
f.seek(0)
pos = f.tell()
lines = f.read().splitlines()
if len(lines) >= to_read or pos == 0:
return lines[-to_read:offset and -offset or None], \
len(lines) > to_read or pos > 0
avg_line_length *= 1.3