内存有限的通用排序工具。

tinysort的Python项目详细描述


Spotify’s luigi framework启发的记忆中实验性Pythonmapreduce。

https://travis-ci.org/geowurster/tinymr.svg?branch=masterhttps://coveralls.io/repos/geowurster/tinymr/badge.svg?branch=master

单词计数示例

目前唯一的mapreduce实现是在内存和串行中,但是 将添加具有并行映射和reduce阶段的实现。

fromcollectionsimportCounterimportjsonimportreimportsysfromtinymr.memoryimportMemMapReduceclassWordCount(MemMapReduce):"""
    The go-to MapReduce example.

    Don't worry, a better example is on its way:
    https://github.com/geowurster/tinymr/issues/11
    """# Counting word occurrence does not benefit from sorting post-map or# post-reduce and our `final_reducer()` doesn't care about key order# so we can disable sorting for a speed boost.sort_map=Falsesort_reduce=Falsesort_final_reduce=Falsedef__init__(self):"""
        Stash a regex to strip off punctuation so we can grab it later.
        """self.pattern=re.compile('[\W_]+')defmapper(self,line):"""
        Take a line of text from the input file and figure out how many
        times each word appears.

        An alternative, simpler implementation would be:

            def mapper(self, item):
                for word in item.split():
                    word = self.pattern.sub('', word)
                    if word:
                        yield word.lower(), 1

        This simpler example is more straightforward, but holds more data
        in-memory.  The implementation below does more work per line but
        potentially has a smaller memory footprint.  Like anything
        MapReduce the implementation benefits greatly from knowing a lot
        about the input data.
        """# Normalize all words to lowercaseline=line.lower().strip()# Strip off punctuationline=[self.pattern.sub('',word)forwordinline]# Filter out empty stringsline=filter(lambdax:x!='',line)# Return an iterable of `(word, count)` pairsreturnCounter(line).items()defreducer(self,key,values):"""
        Just sum the number of times each word appears in the entire
        dataset.

        At this phase `key` is a word and `values` is an iterable producing
        all of the values for that word from the map phase.  Something like:

            key = 'the'
            values = (1, 1, 2, 2, 1)

        The word `the` appeared once on 3 lines and twice on two lines for
        a total of `7` occurrences, so we yield:

            ('the', 7)
        """yieldkey,sum(values)defoutput(self,pairs):"""
        Normally this phase is where the final dataset is written to disk,
        but since we're operating in-memory we just want to re-structure as
        a dictionary.

        `pairs` is an iterator producing `(key, iter(values))` tuples from
        the reduce phase, and since we know that we only produced one key
        from each reduce we want to extract it for easier access later.
        """return{k:tuple(v)[0]fork,vinpairs}wc=WordCount()withopen('LICENSE.txt')asf:out=wc(f)print(json.dumps(out,indent=4,sort_keys=True))

截断输出:

{"a":1,"above":2,"advised":1,"all":1,"and":8,"andor":1}

字数统计工作流

在内部,工作流如下:

输入数据

$ head -10 LICENSE.txt

New BSD License

Copyright (c) 2015, Kevin D. Wurster
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

map

计算每行中每个单词的出现次数。

# Input lineline='Copyright (c) 2015, Kevin D. Wurster'# Sanitized wordswords=['Copyright','c','2015','Kevin','D','Wurster']# Return tuples with word as the first element and count as the secondpairs=[('Copyright',1),('c',1),('2015',1),('Kevin',1),('D',1),('Wurster',1)]

分区

word组织所有(word, count)word键是 保存在这一点,以防数据被排序。排序抓住倒数第二个 键,以便数据可以在一个键上进行分区,并在另一个键上使用 (word, sort, count)倒数第二个键用于排序,因此 下面出现的匹配word只因为没有给出sort键。

出现在多行输入文本中的单词有多个 (word, count)对。2中的count表示 在一行中出现两次,但是我们的输入数据没有这个 条件下面的输出被截断字典值是包含 允许排序键的元组,这在别处有解释

{'2015':[(1,)]'above':[(1,)]'all':[(1,)]'and':[(1,),(1,),(1,)]'are':[(1,),(1,)]'binary':[(1,)]'bsd':[(1,)]'c':[(1,)]'code':[(1,)]}

减少

每个word的总和count

# The ``reducer()`` receives a key and an iterator of valueskey='the'values=(1,1,1)defreducer(key,values):yieldkey,sum(values)

分区

减速机不需要产生与给定的密钥相同的密钥,因此数据 再次按键分区,这对于这个wordcount示例来说是多余的。 再次保存这些键,以防数据被排序并且只匹配word 因为没有提供可选的sort密钥。下面的输出被截断。

{'2015':[(1,)]'above':[(1,)]'all':[(1,)]'and':[(3,)]'are':[(2,)]'binary':[(1,)]'bsd':[(1,)]'c':[(1,)]'code':[(1,)]}

输出

默认实现是从 final_reducer(),看起来像:

values=[('the',(3,)),('in',(1,),]

但是字典更有用,我们知道我们只有一本 reduce阶段中每个word的值,因此我们可以提取该值 出一本字典。

return{k:tuple(v)[0]fork,vinvalues}

由于value键中的数据是 _始终是一个iterable对象,但可以是迭代器呼叫tuple() 扩展iterable并让我们得到第一个元素

开发

$ git clone https://github.com/geowurster/tinymr.git
$cd tinymr
$ pip install -e .\[dev\]$ py.test tests --cov tinymr --cov-report term-missing

许可证

LICENSE.txt

变更日志

CHANGES.md

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
junit cucumber为什么会找到“runTest.java”来运行测试?   在Eclipse中找不到java KeyPairGenerator   java NotSerializableException即使在实现Serializable之后   noclassdeffounderror(java字符串连接)为什么会出现这种异常?   java Guice:将接口绑定到由动态代理创建的实例   使用Spring数据neo4j创建空间索引时发生java错误   java对于需要在50多个excel文件上运行并且每个文件平均包含25k行的项目,最佳的方法是什么   javaNIO中的java缓冲区写入/发送消息问题   如何在Java/eclipse中添加不调用super()的警告   JavaSpring:mvcUrl映射错误的id   java应该在getInstance或构造函数中使用Init方法吗?   安卓中的java空指针异常错误   java Jsoup不能完全获取原始html代码