Python tinysort包_程序模块 - PyPI

内存有限的通用排序工具。

tinysort的Python项目详细描述

受Spotify’s luigi framework启发的记忆中实验性Pythonmapreduce。

https://travis-ci.org/geowurster/tinymr.svg?branch=master

https://coveralls.io/repos/geowurster/tinymr/badge.svg?branch=master

单词计数示例

目前唯一的mapreduce实现是在内存和串行中，但是将添加具有并行映射和reduce阶段的实现。

fromcollectionsimportCounterimportjsonimportreimportsysfromtinymr.memoryimportMemMapReduceclassWordCount(MemMapReduce):"""
    The go-to MapReduce example.

    Don't worry, a better example is on its way:
    https://github.com/geowurster/tinymr/issues/11
    """# Counting word occurrence does not benefit from sorting post-map or# post-reduce and our `final_reducer()` doesn't care about key order# so we can disable sorting for a speed boost.sort_map=Falsesort_reduce=Falsesort_final_reduce=Falsedef__init__(self):"""
        Stash a regex to strip off punctuation so we can grab it later.
        """self.pattern=re.compile('[\W_]+')defmapper(self,line):"""
        Take a line of text from the input file and figure out how many
        times each word appears.

        An alternative, simpler implementation would be:

            def mapper(self, item):
                for word in item.split():
                    word = self.pattern.sub('', word)
                    if word:
                        yield word.lower(), 1

        This simpler example is more straightforward, but holds more data
        in-memory.  The implementation below does more work per line but
        potentially has a smaller memory footprint.  Like anything
        MapReduce the implementation benefits greatly from knowing a lot
        about the input data.
        """# Normalize all words to lowercaseline=line.lower().strip()# Strip off punctuationline=[self.pattern.sub('',word)forwordinline]# Filter out empty stringsline=filter(lambdax:x!='',line)# Return an iterable of `(word, count)` pairsreturnCounter(line).items()defreducer(self,key,values):"""
        Just sum the number of times each word appears in the entire
        dataset.

        At this phase `key` is a word and `values` is an iterable producing
        all of the values for that word from the map phase.  Something like:

            key = 'the'
            values = (1, 1, 2, 2, 1)

        The word `the` appeared once on 3 lines and twice on two lines for
        a total of `7` occurrences, so we yield:

            ('the', 7)
        """yieldkey,sum(values)defoutput(self,pairs):"""
        Normally this phase is where the final dataset is written to disk,
        but since we're operating in-memory we just want to re-structure as
        a dictionary.

        `pairs` is an iterator producing `(key, iter(values))` tuples from
        the reduce phase, and since we know that we only produced one key
        from each reduce we want to extract it for easier access later.
        """return{k:tuple(v)[0]fork,vinpairs}wc=WordCount()withopen('LICENSE.txt')asf:out=wc(f)print(json.dumps(out,indent=4,sort_keys=True))

截断输出：

{"a":1,"above":2,"advised":1,"all":1,"and":8,"andor":1}

字数统计工作流

在内部，工作流如下：

输入数据：

$ head -10 LICENSE.txt

New BSD License

Copyright (c) 2015, Kevin D. Wurster
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

map

计算每行中每个单词的出现次数。

# Input lineline='Copyright (c) 2015, Kevin D. Wurster'# Sanitized wordswords=['Copyright','c','2015','Kevin','D','Wurster']# Return tuples with word as the first element and count as the secondpairs=[('Copyright',1),('c',1),('2015',1),('Kevin',1),('D',1),('Wurster',1)]

分区

按word组织所有(word, count)对word键是保存在这一点，以防数据被排序。排序抓住倒数第二个键，以便数据可以在一个键上进行分区，并在另一个键上使用 (word, sort, count)倒数第二个键用于排序，因此下面出现的匹配word只因为没有给出sort键。

出现在多行输入文本中的单词有多个 (word, count)对。2中的count表示在一行中出现两次，但是我们的输入数据没有这个条件下面的输出被截断字典值是包含允许排序键的元组，这在别处有解释

{'2015':[(1,)]'above':[(1,)]'all':[(1,)]'and':[(1,),(1,),(1,)]'are':[(1,),(1,)]'binary':[(1,)]'bsd':[(1,)]'c':[(1,)]'code':[(1,)]}

减少

每个word的总和count。

# The ``reducer()`` receives a key and an iterator of valueskey='the'values=(1,1,1)defreducer(key,values):yieldkey,sum(values)

分区

减速机不需要产生与给定的密钥相同的密钥，因此数据再次按键分区，这对于这个wordcount示例来说是多余的。再次保存这些键，以防数据被排序并且只匹配word 因为没有提供可选的sort密钥。下面的输出被截断。

{'2015':[(1,)]'above':[(1,)]'all':[(1,)]'and':[(3,)]'are':[(2,)]'binary':[(1,)]'bsd':[(1,)]'c':[(1,)]'code':[(1,)]}

输出

默认实现是从 final_reducer()，看起来像：

values=[('the',(3,)),('in',(1,),]

但是字典更有用，我们知道我们只有一本 reduce阶段中每个word的值，因此我们可以提取该值出一本字典。

return{k:tuple(v)[0]fork,vinvalues}

由于value键中的数据是 _始终是一个iterable对象，但可以是迭代器呼叫tuple() 扩展iterable并让我们得到第一个元素

开发

$ git clone https://github.com/geowurster/tinymr.git
$cd tinymr
$ pip install -e .\[dev\]$ py.test tests --cov tinymr --cov-report term-missing

许可证

见LICENSE.txt

变更日志

见CHANGES.md

欢迎加入QQ群-->： 979659372

tinysort 0.1

tinysort的Python项目详细描述

单词计数示例

字数统计工作流

开发

许可证

变更日志

推荐PyPI第三方库

rkquer

Sweepatic-urlnorm

aio-space-track-api

opus-api

aioga

cloudtoken-plugin.okta

etianen-cms

diapysef

idna_ssl

openlabs_payment_gatewa

wadus

django-todopago

ties-lib

cgettext

k5test

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

tinysort 0.1

tinysort的Python项目详细描述

单词计数示例

字数统计工作流

开发

许可证

变更日志

推荐PyPI第三方库

rkquer

Sweepatic-urlnorm

aio-space-track-api

opus-api

aioga

cloudtoken-plugin.okta

etianen-cms

diapysef

idna_ssl

openlabs_payment_gatewa

wadus

django-todopago

ties-lib

cgettext

k5test

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签