Python Hadoop流式处理与未在数据节点上安装的包的导入
我尝试在Python的Hadoop流处理里导入scikit-image库,之前也看过一些StackOverflow上的帖子,像这个和这个,但都没有解决我的问题。
我真正想问的是,即使我用-file选项把打包好的scikit-image文件夹的zip/mod文件分发到各个节点,运行在数据节点上的Python脚本怎么知道如何提取这些包并导入到代码里呢?请注意,我在主节点上安装了scikit-image,并且可以在本地进行实验。
我的脚本很简单:就是Python流处理的经典单词计数示例,在mapper.py里多加了一个“import skimage”。http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
我的命令是:
hadoop jar hadoop-streaming.jar \
-file mapper.py -mapper mapper.py \
-file reducer.py -reducer reducer.py \
-file ./skimage.mod \
-input /user/text/* \
-output /user/textoutput/
屏幕输出:
packageJobJar: [mapper.py, reducer.py, ./skimage.zip] [/usr/lib/gphd/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0/hadoop-streaming-2.0.2-alpha-gphd-2.0.1.0.jar] /tmp/streamjob6159562120374599467.jar tmpDir=null
14/04/04 18:00:02 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/04/04 18:00:02 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/04/04 18:00:03 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/04/04 18:00:03 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/04/04 18:00:03 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/04 18:00:03 INFO mapred.FileInputFormat: Total input paths to process : 1
14/04/04 18:00:03 INFO mapreduce.JobSubmitter: number of splits:2
14/04/04 18:00:03 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/04/04 18:00:03 WARN conf.Configuration: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/04/04 18:00:03 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/04/04 18:00:03 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/04/04 18:00:03 WARN conf.Configuration: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/04/04 18:00:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1384839777050_0106
14/04/04 18:00:04 INFO client.YarnClientImpl: Submitted application application_1384839777050_0106 to ResourceManager at hdm3.gphd.local/172.28.9.252:8032
14/04/04 18:00:04 INFO mapreduce.Job: The url to track the job: http://hdm3.gphd.local:8088/proxy/application_1384839777050_0106/
14/04/04 18:00:04 INFO mapreduce.Job: Running job: job_1384839777050_0106
14/04/04 18:00:08 INFO mapreduce.Job: Job job_1384839777050_0106 running in uber mode : false
14/04/04 18:00:08 INFO mapreduce.Job: map 0% reduce 0%
14/04/04 18:00:12 INFO mapreduce.Job: Task Id : attempt_1384839777050_0106_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
我查看了Hadoop作业的错误日志,发现它在抱怨找不到“import skimage”,这意味着数据节点没有找到这个库。