在Python中比较文件中的4000万行和600万项列表

1 投票

3 回答

1357 浏览

提问于 2025-04-16 10:14

我有一个文件，里面有4000万条记录，格式是：

#No Username

另外，我还有一个包含600万个用户名的列表。

我想尽快找到这些用户名中有哪些是重复的。现在我已经写了以下代码：

import os
usernames=[]
common=open('/path/to/filf','w')
f=open('/path/to/6 million','r')
for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)
#noOfUsers=len(usernames)
for l in f:
    l=l.split(' ')
    if(l[1] in usernames):
        common.write(l[1]+'\n')
common.close()
f.close()

我该如何提高这段代码的运行效率呢？

性能优化数据处理文件比较大数据分析重复数据检测

3 个回答

如果你用用户名作为键来创建一个字典，那么在字典中检查一个键是否存在的速度会比在列表中查找一个元素快得多。

回答于 2025-04-16 由 Python大师

分享举报

如果你这个操作需要做很多次，我建议你考虑使用线程。下面是一些伪代码。

首先，在Linux系统中把文件分割成每个包含100,000行的小文件：

> split -l 100000 usernames.txt usernames_

然后，启动一些线程来并行处理这些文件。

 import threading
 usernames_one = set()
 usernames_two = set()
 filereaders = []

 # Define this class, which puts all the lines in the file into a set
 class Filereader(threading.Thread):
  def __init__(self, filename, username_set):
    # while 1:
    # read a line from filename, put it in username_set
  ...

 # loop through possible usernames_ files, and spawn a thread for each:
 # for.....
 f = Filereader('usernames_aa', usernames_one)
 filereaders.append(f)
 f.start()
 # do the same loop for usernames_two

 # at the end, wait for all threads to complete
 for f in filereaders:
     f.join()

 # then do simple set intersection:
 common_usernames = usernames_one ^ usernames_two

 # then write common set to a file:
 common_file = open("common_usernames.txt",'w')
 common_file.write('\n'.join(common_usernames))

你需要检查一下集合的添加操作是否是线程安全的。如果不是，你当然可以为每个线程处理的文件创建一个集合的列表，最后再把它们合并在一起，然后进行交集操作。

回答于 2025-04-16 由 Python大师

分享举报

我看到两个明显的改进点：首先，把用户名改成一个集合。然后，创建一个结果列表，并一次性用 '\n'.join(resultlist) 将结果写入文件。

import os

usernames = []

for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)

usernames = set(usernames)

f = open('/path/to/6 million','r')
resultlist = [] 
for l in f:
    l = l.split(' ')
    if (l[1] in usernames):
        resultlist.append(l[1])
f.close()

common=open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()

补充说明：假设你只想找到最常见的名字：

usernames = set(os.listdir('/path/to/directory/with/usernames/'))
from collections import Counter

f = open('/path/to/6 million')
name_counts = Counter(line.split()[1] for line in f if line in usenames)
print name_counts.most_common()

补充说明2：根据你的说明，这里是如何创建一个文件，里面包含在路径中的用户名和一个600万行的文件中共同出现的名字：

import os
usernames = set(os.listdir('/path/to/directory/with/usernames/'))

f = open('/path/to/6 million')
resultlist = [line.split()[1] for line in f if line[1] in usernames]

common = open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()

回答于 2025-04-16 由 Python大师

分享举报

在Python中比较文件中的4000万行和600万项列表

3 个回答

撰写回答