如何在python中实现mapreduce对模式

2024-05-18 23:31:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试在python中尝试mapreduce pairs模式。需要检查一个单词是否在文本文件中,然后找到它旁边的单词并生成两个单词对。继续遇到:

neighbors = words[words.index(w) + 1]
ValueError: substring not found

或者

^{pr2}$

文件工作_试用版.py在

from mrjob.job import MRJob

class MRCountest(MRJob):
    # Word count
    def mapper(self, _, document):
        # Assume document is a list of words.
        #words = []
        words = document.strip()

        w = "the"
        neighbors = words.index(w)
        for word in words:
            #searchword = "the"
            #wor.append(str(word))
            #neighbors = words[words.index(w) + 1]
            yield(w,1)

    def reducer(self, w, values):
        yield(w,sum(values))

if __name__ == '__main__':
    MRCountest.run()

编辑: 尝试使用pairs模式在文档中搜索特定单词的每个实例,然后每次都找到它旁边的单词。然后为每个实例生成一对结果,即查找“the”的实例及其旁边的单词,即[the]、[book]、[the]、[cat]等

from mrjob.job import MRJob

class MRCountest(MRJob):
# Word count
def mapper(self, _, document):
    # Assume document is a list of words.
    #words = []
    words = document.split(" ")

    want = "the"
    for w, want in enumerate(words, 1):
        if (w+1) < len(words):
            neighbors = words[w + 1]
            pair = (want, neighbors)
            for u in neighbors:
                if want is "the":
                    #pair = (want, neighbors)
                    yield(pair),1
    #neighbors = words.index(w)
    #for word in words:

        #searchword = "the"
        #wor.append(str(word))
        #neighbors = words[words.index(w) + 1]
        #yield(w,1)

#def reducer(self, w, values):
    #yield(w,sum(values))

if __name__ == '__main__':
MRCountest.run()

就目前的情况来看,我得到了每一个词对的产出率是同一对的倍数。在

This image shows the pseudo code I'm trying to implement


Tags: theinselfforindexdefneighbors单词
1条回答
网友
1楼 · 发布于 2024-05-18 23:31:41

当您使用words.index("the")时,您将只得到列表或字符串中“the”的第一个实例,正如您所发现的那样,如果“the”不存在,您将得到一个错误。在

你还提到你正在尝试产生对,但只产生一个单词。在

我想你想做的是这样:

def get_word_pairs(words):
    for i, word in enumerate(words):
        if (i+1) < len(words):
            yield (word, words[i + 1]), 1
        if (i-1) > 0:
            yield (word, words[i - 1]), 1

假设你对两个方向的邻居都感兴趣。(如果没有,你只需要第一个收益。)

最后,由于您使用document.strip(),我怀疑文档实际上是一个字符串而不是一个列表。如果是这样的话,可以使用words = document.split(" ")来获取单词列表,假设没有标点符号。在

相关问题 更多 >

    热门问题