使用App Engin中的MapReduce创建分组依据

2024-04-19 09:12:17 发布

您现在位置：Python中文网/ 问答频道 /正文

2836

网友

男 | 程序猿一只，喜欢编程写python代码。

我正在寻找一种使用MapReduce在数据存储中的查询中进行分组操作的方法。AFAIK应用引擎在GQL中不支持单独分组，其他开发人员建议的一种好方法是使用MapReduce。在

我下载了source code，正在研究demo code，我试图在我的案例中实现。但我没有成功。下面是我如何努力做到的。也许我做的一切都是错的。如果有人能帮我，我会谢谢你。在

我想做的是：我在数据存储中有一堆联系人，每个联系人都有一个日期。有一堆同一日期的重复联系。我想做的是简单的分组，收集相同日期的相同联系人。在

例如：

假设我有这样的联系人：

联系人姓名：Foo1日期：2012年10月1日
联系人姓名：Foo2日期：2012年5月2日
联系人姓名：Foo1日期：2012年10月1日

所以在MapReduce操作之后会是这样的：

联系人姓名：Foo1日期：2012年10月1日
联系人姓名：Foo2日期：2012年5月2日

对于按功能分组，我认为单词计数起作用。在

编辑

日志中唯一显示的是：

/mapreduce/pipeline/run 200
Running GetContactData.WordCountPipeline((u'2012-02-02',), *{})#da26a9b555e311e19b1e6d324d450c1a

结束编辑

如果我做错了什么，如果我用MapReduce做了一个groupby，请帮助我如何用MapReduce来完成。在

这是我的代码：

from Contacts import Contacts
from google.appengine.ext import webapp
from google.appengine.ext.webapp import template
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.api import mail
from google.appengine.ext.db import GqlQuery
from google.appengine.ext import db


from google.appengine.api import taskqueue
from google.appengine.api import users

from mapreduce.lib import files
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce import shuffler

import simplejson, logging, re


class GetContactData(webapp.RequestHandler):

    # Get the calls based on the user id
    def get(self):
        contactId = self.request.get('contactId')
        query_contacts = Contact.all()
        query_contacts.filter('contact_id =', int(contactId))
        query_contacts.order('-timestamp_')
        contact_data = []
        if query_contacts != None:
            for contact in query_contacts:
                    pipeline = WordCountPipeline(contact.date)
                    pipeline.start()
                    record = { "contact_id":contact.contact_id,
                               "contact_name":contact.contact_name,
                               "contact_number":contact.contact_number,
                               "timestamp":contact.timestamp_,
                               "current_time":contact.current_time_,
                               "type":contact.type_,
                               "current_date":contact.date }
                    contact_data.append(record)

        self.response.headers['Content-Type'] = 'application/json'
        self.response.out.write(simplejson.dumps(contact_data)) 

class WordCountPipeline(base_handler.PipelineBase):
  """A pipeline to run Word count demo.

  Args:
    blobkey: blobkey to process as string. Should be a zip archive with
      text files inside.
  """

  def run(self, date):
    output = yield mapreduce_pipeline.MapreducePipeline(
        "word_count",
        "main.word_count_map",
        "main.word_count_reduce",
        "mapreduce.input_readers.DatastoreInputReader",
        "mapreduce.output_writers.BlobstoreOutputWriter",
        mapper_params={
            "date": date,
        },
        reducer_params={
            "mime_type": "text/plain",
        },
        shards=16)
    yield StoreOutput("WordCount", output)

class StoreOutput(base_handler.PipelineBase):
  """A pipeline to store the result of the MapReduce job in the database.

  Args:
    mr_type: the type of mapreduce job run (e.g., WordCount, Index)
    encoded_key: the DB key corresponding to the metadata of this job
    output: the blobstore location where the output of the job is stored
  """

  def run(self, mr_type, output):
      logging.info(output) # here I should append the grouped duration in JSON

Tags： the run from import self output date pipeline

1条回答

网友

1楼 · 发布于 2024-04-19 09:12:17

我基于这个question中提供的@autumngard代码，并根据我的目的进行了修改，它成功了。在

使用App Engin中的MapReduce创建分组依据

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用App Engin中的MapReduce创建分组依据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >