使用Python MRJob在EMR上引导库

1 投票
2 回答
2055 浏览
提问于 2025-04-18 05:12

问题描述:

我正在尝试在亚马逊的EMR上运行一个map-reduce任务,使用的是python的MRJob库,但在给节点安装所需的库和软件包时遇到了麻烦。

详细信息:

这是我写的一个简单的 python mrjob 代码:

    import re
    from mrjob.job import MRJob
    from sentClassifier import sentClassify
    import nltk

    .. do something ..

有一些库,比如NLTK,需要被导入,还有一些我自己写的模块,比如 from sentClassifier import sentClassify 也需要导入。

我想知道,怎样才能最好地给EMR节点安装这些方法和包,这样它们就可以使用了。我的代码在本地机器上运行得很好。

这是我写的 mrjob.conf 文件:

    runners:
      emr:
        aws_access_key_id: ***
        aws_secret_access_key: ***
        ec2_core_instance_type: m1.large
        ec2_key_pair: mykey
        ec2_key_pair_file: mykey.pem
        num_ec2_core_instances: 5
        pool_wait_minutes: 2
        pool_emr_job_flows: true
        ssh_tunnel_is_open: true
        ssh_tunnel_to_job_tracker: true
      hadoop:
        setup:
          - virtualenv venv
          - . venv/bin/activate
          - pip install mr3po simplejson
          - sudo easy_install https://code.google.com/p/nltk/downloads/detail?name=nltk-2.0b9-py2.6.egg&can=2&q=

但是这个任务失败了。

我查阅了以下资料,并尝试了各种方法,但还是没有成功:

错误日志:

    Scanning SSH logs for probable cause of failure
    Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
    Traceback (most recent call last):
    File "obidroidMR.py", line 5, in <module>
       import nltk
       ImportError: No module named nltk
       (while reading from s3://mrjob-   51b9493c1a467671/tmp/obidroidMR.shreyas.20140503.012933.336228/files/STDIN)
       Attempting to terminate job...
       Job appears to have already been terminated
       Killing our SSH tunnel (pid 12909)
       Traceback (most recent call last):
         File "obidroidMR.py", line 107, in <module>
         ObidroidReview.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
         mr_job.execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
         self.run_job()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 809, in _run
         self._wait_for_job_to_complete()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
         raise Exception(msg)
         Exception: Job on job flow j-2R8G1Q3RIE9ED failed with status WAITING: Waiting after step failed
         Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
         Traceback (most recent call last):
         File "obidroidMR.py", line 5, in <module>
         import nltk
         ImportError: No module named nltk

任何帮助都将非常感谢。

2 个回答

0

由于亚马逊的弹性Map Reduce使用的是基于亚马逊Linux的AMI,我确认我可以在亚马逊Linux AMI 2014.03.1 - ami-fb8e9292(64位)上安装nltk,方法如下:

sudo easy_install -U pip
sudo easy_install -U distribute
sudo pip install -U pyyaml nltk

你可以尝试把这三行代码放进你的mrjob.conf文件里。

2

mrjob.conf 文件中,安装包所需的配置行可能没有放在正确的位置。对于在 EMR 上运行的任务,相关的设置应该放在 emr: 下,而不是 hadoop: 下(后者是用于在本地 Hadoop 安装上运行任务的配置)。

如果是简单的 Linux 命令,比如 pipapt-get,你可以这样安装包:

runners:
  emr:
    aws_access_key_id: ***
    ... all the other stuff ...
    bootstrap_cmds:
    - sudo apt-get install -y python-boto
    - sudo pip install simplejson

我没有尝试过专门安装 NLTK,所以在这方面我帮不了你,但你应该可以按照这个方法来安装。

如果安装过程可能更复杂,我建议你使用 EMR CLI 通过 ssh 登录到你的主节点:

$ ./elastic-mapreduce -j JOB_FLOW_ID --ssh

然后实际尝试安装这个包。如果你找到了一系列成功安装包的命令,你可以直接把这些命令复制粘贴到你的 mrjob.conf 文件中。

撰写回答