X'对象没有属性'functionName' - Pyspark / Python
我刚开始接触Pyspark,所以边学边试。
我在尝试使用单元测试(UnitTest),但是遇到了一些错误,具体如下:
def drop_duplicates(df):
df = df.dropDuplicates(df)
return df
import unittest
class TestNotebook(unittest.TestCase):
def test_drop_duplicates(self):
data = (
['1', '2020-01-01 00:00:00', '2020-01-01 01:00:00', '2', '1'],
['1', '2020-01-01 00:00:00', '2020-01-01 01:00:00', '3', '1'],
['1', '2020-01-01 00:00:00', '2020-01-01 01:00:00', '2', '2'],
['2', '2020-01-01 00:00:00', '2020-01-01 01:00:00', '2', '1']
)
columns = ["ID", "TimeFrom", "TimeTo", "Serial", "Code"]
df = spark.createDataFrame(data, columns)
expected_data = [
('1', '2020-01-01 00:00:00', '2020-01-01 01:00:00', '2', '1'),
('1', '2020-01-01 00:00:00', '2020-01-01 01:00:00', '2', '2')
]
self.assertEqual(drop_duplicates(df), expected_data)
res = unittest.main(argv=[''], verbosity=2, exit=False)
(虽然断言可能会失败,但我会在解决这个错误后知道)不过现在我只遇到了以下错误:
File "/tmp/ipykernel_15937/2907449366.py", line 2, in drop_duplicates
df = df.dropDuplicates(df)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 4207, in dropDuplicates
raise PySparkTypeError(
pyspark.errors.exceptions.base.PySparkTypeError: [NOT_LIST_OR_TUPLE] Argument `subset` should be a list or tuple, got DataFrame.
我是不是漏掉了什么?我在阅读这个方法的文档,但似乎还是搞不明白。