我有以下数据帧:
+--------------------+-------+------------------+--------------------+--------------------+--------------------+----------------+-----------+--------------+----------+
| user| sku| category| query| click_time| query_time| city| country| location|time_spent|
+--------------------+-------+------------------+--------------------+--------------------+--------------------+----------------+-----------+--------------+----------+
|212dffe34129f84ac...|2842092|pcmcat209000050008| Hp tablet|2011-08-21T11:09:...|2011-08-21T11:07:...| Reusel|Netherlands|51.3625,5.1653| 13|
|fc3540111a0aaca56...|2969477|pcmcat247400050000|Toshiba satellite...|2011-09-08T22:22:...|2011-09-08T22:22:...| Reusel|Netherlands|51.3625,5.1653| 60|
|3eb03056dd77ccad8...|6974946| abcat0410012| oltmpus|2011-08-28T19:11:...|2011-08-28T19:08:...| null|Netherlands| 52.3667,4.9| 0|
|0ae72d93c4b9394d5...|9755322|pcmcat247400050001| hp laptop|2011-09-05T15:34:...|2011-09-05T15:33:...|Roelofarendsveen|Netherlands|52.2033,4.6333| 4|
|dd99a827d4790eaf3...|2363054| abcat0101001| Samsung LCD 550|2011-09-01T10:29:...|2011-09-01T10:28:...| null|Netherlands| 52.3667,4.9| 42|
|fa7bf9fb76ef1abe3...|9643151| abcat0404004| memory stick|2011-09-06T17:17:...|2011-09-06T17:15:...| null|Netherlands| 52.3667,4.9| 0|
|037180033c5941c86...|9929926| cat02015| Castle|2011-08-28T22:07:...|2011-08-28T22:06:...| Tarragona| Spain| 41.1167,1.25| 6|
|45bebf83124cef122...|1303083| abcat0508009| microsoft office|2011-08-30T19:33:...|2011-08-30T19:31:...| Tarragona| Spain| 41.1167,1.25| 20|
|86fa480fe4ff0abac...|2842056|pcmcat209000050008|Inventory checker...|2011-09-06T18:17:...|2011-09-06T18:17:...| null| Belgium| 50.85,4.35| 37|
|0d1521a7bbe3973d6...|1014055| abcat0201010| Zune|2011-08-25T12:24:...|2011-08-25T12:23:...| null|Netherlands| 52.3667,4.9| 0|
我希望数据按查询名称分组。但不仅如此,我只想要来自比利时和荷兰的前10名查询名称。首先,我将数据分组如下:
from pyspark.sql.functions import countDistinct, avg, stddev
mydf= df.groupBy(df.query).count()
mydf= mydf.orderBy(mydf["count"].desc())
mydf= mydf.limit(11)
mydf = mydf.dropna()
mydf.show()
+--------------------+-----+
| query|count|
+--------------------+-----+
| Hp touchpad| 271|
| hp touchpad| 193|
| Touchpad| 151|
|LaborDay_Computer...| 138|
| touchpad| 128|
| Hp tablet| 88|
|LaborDay_Televisi...| 86|
| iPad| 65|
| Laptops| 63|
| Ipad| 61|
+--------------------+-----+
mydf= df.groupBy(when(col("country") == "Netherlands", df.groupBy(df.query))).count()
但我不知道如何与荷兰和比利时相匹配
您可以使用
filter
。我从您的数据集中获取了样本数据。但是逻辑也应该适用于您的数据集相关问题 更多 >
编程相关推荐