内存有限的通用排序工具。
tinysort的Python项目详细描述
受Spotify’s luigi framework启发的记忆中实验性Pythonmapreduce。
单词计数示例
目前唯一的mapreduce实现是在内存和串行中,但是 将添加具有并行映射和reduce阶段的实现。
fromcollectionsimportCounterimportjsonimportreimportsysfromtinymr.memoryimportMemMapReduceclassWordCount(MemMapReduce):""" The go-to MapReduce example. Don't worry, a better example is on its way: https://github.com/geowurster/tinymr/issues/11 """# Counting word occurrence does not benefit from sorting post-map or# post-reduce and our `final_reducer()` doesn't care about key order# so we can disable sorting for a speed boost.sort_map=Falsesort_reduce=Falsesort_final_reduce=Falsedef__init__(self):""" Stash a regex to strip off punctuation so we can grab it later. """self.pattern=re.compile('[\W_]+')defmapper(self,line):""" Take a line of text from the input file and figure out how many times each word appears. An alternative, simpler implementation would be: def mapper(self, item): for word in item.split(): word = self.pattern.sub('', word) if word: yield word.lower(), 1 This simpler example is more straightforward, but holds more data in-memory. The implementation below does more work per line but potentially has a smaller memory footprint. Like anything MapReduce the implementation benefits greatly from knowing a lot about the input data. """# Normalize all words to lowercaseline=line.lower().strip()# Strip off punctuationline=[self.pattern.sub('',word)forwordinline]# Filter out empty stringsline=filter(lambdax:x!='',line)# Return an iterable of `(word, count)` pairsreturnCounter(line).items()defreducer(self,key,values):""" Just sum the number of times each word appears in the entire dataset. At this phase `key` is a word and `values` is an iterable producing all of the values for that word from the map phase. Something like: key = 'the' values = (1, 1, 2, 2, 1) The word `the` appeared once on 3 lines and twice on two lines for a total of `7` occurrences, so we yield: ('the', 7) """yieldkey,sum(values)defoutput(self,pairs):""" Normally this phase is where the final dataset is written to disk, but since we're operating in-memory we just want to re-structure as a dictionary. `pairs` is an iterator producing `(key, iter(values))` tuples from the reduce phase, and since we know that we only produced one key from each reduce we want to extract it for easier access later. """return{k:tuple(v)[0]fork,vinpairs}wc=WordCount()withopen('LICENSE.txt')asf:out=wc(f)print(json.dumps(out,indent=4,sort_keys=True))
截断输出:
{"a":1,"above":2,"advised":1,"all":1,"and":8,"andor":1}
字数统计工作流
在内部,工作流如下:
输入数据:
$ head -10 LICENSE.txt New BSD License Copyright (c) 2015, Kevin D. Wurster All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
map
计算每行中每个单词的出现次数。
# Input lineline='Copyright (c) 2015, Kevin D. Wurster'# Sanitized wordswords=['Copyright','c','2015','Kevin','D','Wurster']# Return tuples with word as the first element and count as the secondpairs=[('Copyright',1),('c',1),('2015',1),('Kevin',1),('D',1),('Wurster',1)]
分区
按word组织所有(word, count)对word键是 保存在这一点,以防数据被排序。排序抓住倒数第二个 键,以便数据可以在一个键上进行分区,并在另一个键上使用 (word, sort, count)倒数第二个键用于排序,因此 下面出现的匹配word只因为没有给出sort键。
出现在多行输入文本中的单词有多个 (word, count)对。2中的count表示 在一行中出现两次,但是我们的输入数据没有这个 条件下面的输出被截断字典值是包含 允许排序键的元组,这在别处有解释
{'2015':[(1,)]'above':[(1,)]'all':[(1,)]'and':[(1,),(1,),(1,)]'are':[(1,),(1,)]'binary':[(1,)]'bsd':[(1,)]'c':[(1,)]'code':[(1,)]}
减少
每个word的总和count。
# The ``reducer()`` receives a key and an iterator of valueskey='the'values=(1,1,1)defreducer(key,values):yieldkey,sum(values)
分区
减速机不需要产生与给定的密钥相同的密钥,因此数据 再次按键分区,这对于这个wordcount示例来说是多余的。 再次保存这些键,以防数据被排序并且只匹配word 因为没有提供可选的sort密钥。下面的输出被截断。
{'2015':[(1,)]'above':[(1,)]'all':[(1,)]'and':[(3,)]'are':[(2,)]'binary':[(1,)]'bsd':[(1,)]'c':[(1,)]'code':[(1,)]}
输出
默认实现是从 final_reducer(),看起来像:
values=[('the',(3,)),('in',(1,),]
但是字典更有用,我们知道我们只有一本 reduce阶段中每个word的值,因此我们可以提取该值 出一本字典。
return{k:tuple(v)[0]fork,vinvalues}
由于value键中的数据是 _始终是一个iterable对象,但可以是迭代器呼叫tuple() 扩展iterable并让我们得到第一个元素
开发
$ git clone https://github.com/geowurster/tinymr.git $cd tinymr $ pip install -e .\[dev\]$ py.test tests --cov tinymr --cov-report term-missing
许可证
见LICENSE.txt
变更日志
见CHANGES.md