使用不同核心数进行多进程时,序列化结果保存到磁盘导致对象大小不同
我到处找了找,但还是没找到这个问题的答案。我觉得我的代码没问题,但我注意到每次运行代码并把结果(一个字典)保存到磁盘时,文件的大小会根据我使用的核心数不同而有所变化。
Using 4 cores results in a file 48,418 KB
Using 8 cores (hyperthreading) results in a file 59,880 KB
结果应该是一样的(看起来也确实是),所以我只是好奇是什么导致了这个大小的差异。
我简单检查了一下这两个保存的对象,它们在每个字典里的项目数量都是一样的:
4 cores has 683 keys and 6,015,648 values
8 cores has 683 keys and 6,015,648 values
我想我可以检查一下每个键的值是否完全相同,但我觉得这样做可能会花费不少时间。
唯一可能导致这个问题的代码,应该是把数据分成小块进行处理的部分,这些代码是:
def split_list_multi(listOfLetterCombos,threads=8):
"""Split a list into N parts for use with multiprocessing module. Takes a list(or set)
which should be the various letter combinations created using make_letter_combinations().
Divides the list into N (where n is the number of threads) equal parts and returns a dict
where the key is the thread number and the value is a slice of the list.
With 4 threads and a list of 2000 items, the results dict would be {'1': [0:500],
'2': [500:1000], '3': [1000:1500], '4': [1500,2000]} and the number of threads."""
fullLength = len(listOfLetterCombos)
single = math.floor(fullLength/threads)
results = {}
counter = 0
while counter < threads:
if counter == (threads-1):
results[str(counter)] = listOfLetterCombos[single*counter::]
else:
results[str(counter)] = listOfLetterCombos[single*counter:single*(counter+1)]
counter += 1
return results,threads
def main(numOfLetters,numThreads):
wordList = pickle.load( open( r'd:\download\allwords.pickle', 'rb'))
combos = make_letter_combinations(numOfLetters)
split = split_list_multi(combos,numThreads)
doneQueue = multiprocessing.Queue()
jobs = []
startTime = time.time()
for num in range(split[1]):
listLetters = split[0][str(num)]
thread = multiprocessing.Process(target=worker, args=(listLetters,wordList,doneQueue))
jobs.append(thread)
thread.start()
resultdict = {}
for i in range(split[1]):
resultdict.update(doneQueue.get())
for j in jobs:
j.join()
pickle.dump( resultdict, open( 'd:\\download\\results{}letters.pickle'.format(numOfLetters), "wb" ) )
endTime = time.time()
totalTime = (endTime-startTime)/60
print("Took {} minutes".format(totalTime))
return resultdict
1 个回答
3
来自:
cPickle - 对同一个对象进行序列化时结果不同
cPickle - 对同一个对象进行序列化时结果不同
“看起来相同的对象并不能保证会产生完全相同的序列化字符串。”
序列化协议就像一个虚拟机器,而序列化字符串就是这个虚拟机器的程序。对于一个特定的对象,可能存在多个序列化字符串(也就是程序),它们都能准确地重建这个对象。”
这真是个棘手的问题!