<p>你看,为什么你这样做是行不通的。首先,您试图从<a href="https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.Row" rel="noreferrer">Row</a>类型中获取整数,您的collect的输出如下:</p>
<pre><code>>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)
</code></pre>
<p>如果你吃了这样的东西:</p>
<pre><code>>>> firstvalue = mvv_list[0].mvv
Out: 1
</code></pre>
<p>您将得到<code>mvv</code>值。如果需要数组的所有信息,可以采用以下方法:</p>
<pre><code>>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]
</code></pre>
<p>但如果你在另一个专栏中尝试同样的方法,你会得到:</p>
<pre><code>>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
</code></pre>
<p>这是因为<code>count</code>是一个内置方法。该列与<code>count</code>同名。解决方法是将<code>count</code>的列名更改为<code>_count</code>:</p>
<pre><code>>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
</code></pre>
<p>但不需要这种解决方法,因为您可以使用字典语法访问列:</p>
<pre><code>>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
</code></pre>
<p>最终会成功的!</p>