在IPython笔记本中运行MRJob
我正在尝试在IPython笔记本中运行mrjob的示例
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
然后用代码运行它
mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value
但是出现了这个错误:
TypeError: <module '__main__' (built-in)> is a built-in class
有没有办法在IPython笔记本中运行mrjob?
2 个回答
3
我还没有找到“完美的方法”,但你可以做的一件事是创建一个笔记本单元,使用 %%file
这个魔法命令,把单元的内容写入一个文件:
%%file wordcount.py
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
然后在后面的单元中用 mrjob
来运行这个文件:
import wordcount
reload(wordcount)
mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value
注意,我把我的文件命名为 wordcount.py
,并且从 wordcount
模块中导入了 MRWordFrequencyCount
这个类——文件名和模块名必须要一致。此外,Python 会缓存导入的模块,当你修改了 wordcount.py
文件时,iPython 不会重新加载这个模块,而是会使用之前缓存的旧版本。这就是我在这里调用 reload()
的原因。
参考链接: https://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ
更新(更简短)
为了让第二个笔记本单元更简短,你可以通过在笔记本中调用命令行来运行 mrjob。
! python mrjob.py shakespeare.txt
参考链接: http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb