擅长:python、mysql、java
<p>我终于用join实现了一个解决方案。为了避免Spark出现异常,我必须给部门加上一个0值:</p>
<pre><code>employee = [['Raffery',31], ['Jones',33], ['Heisenberg',33], ['Robinson',34], ['Smith',34]]
department = [31,33]
# invert id and name to get id as the key
employee = sc.parallelize(employee).map(lambda e: (e[1],e[0]))
# add a 0 value to avoid an exception
department = sc.parallelize(department).map(lambda d: (d,0))
employee.join(department).map(lambda e: (e[1][0], e[0])).collect()
output: [('Jones', 33), ('Heisenberg', 33), ('Raffery', 31)]
</code></pre>