Python中的数据库压缩

0 投票

6 回答

788 浏览

提问于 2025-04-16 09:00

我有一些每小时的日志，内容像这样：

user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined

我想把所有这些平面的文件压缩成一个文件。日志里大约有3000万用户，我只想要每个用户最新的日志。

我最终想要的日志格式是这样的：

user1:added comment
user2:log out
user3:joined

我第一次尝试在小规模上做一个字典，像这样：

log['user1'] = "added comment"

如果我用3000万个键值对来做一个字典，会不会占用很大的内存呢？还是说我应该用像sqlite这样的东西来存储它们，然后再把sqlite表里的内容放回到一个文件里？

数据存储 sqlite 日志管理内存优化数据库压缩用户数据处理

6 个回答

你也可以反着处理日志行——然后用一个集合来记录你见过哪些用户：

s = set()

# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them...  There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()

for line in lines:
    line = line.strip()
    user, op = line.split(':')
    if not user in s:
         print line
         s.add(user)

回答于 2025-04-16 由 Python大师

分享举报

各种dbm模块（在Python 3中是dbm，在Python 2中有anydbm、gdbm、dbhash等）可以让你创建简单的数据库，用来存储键值对的映射关系。它们会保存在磁盘上，所以不会占用太多内存。如果你想的话，还可以把它们当作日志来存储。

回答于 2025-04-16 由 Python大师

分享举报

如果你对每条日志记录使用intern()这个方法，那么无论这条日志出现多少次，你都只会用一个字符串来表示它。这样可以大大减少内存的使用。

>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True

回答于 2025-04-16 由 Python大师

分享举报

Python中的数据库压缩

6 个回答

撰写回答