按两个因素分组的前10个术语

2024-04-28 14:39:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据帧:

+--------------------+-------+------------------+--------------------+--------------------+--------------------+----------------+-----------+--------------+----------+
|                user|    sku|          category|               query|          click_time|          query_time|            city|    country|      location|time_spent|
+--------------------+-------+------------------+--------------------+--------------------+--------------------+----------------+-----------+--------------+----------+
|212dffe34129f84ac...|2842092|pcmcat209000050008|           Hp tablet|2011-08-21T11:09:...|2011-08-21T11:07:...|          Reusel|Netherlands|51.3625,5.1653|        13|
|fc3540111a0aaca56...|2969477|pcmcat247400050000|Toshiba satellite...|2011-09-08T22:22:...|2011-09-08T22:22:...|          Reusel|Netherlands|51.3625,5.1653|        60|
|3eb03056dd77ccad8...|6974946|      abcat0410012|             oltmpus|2011-08-28T19:11:...|2011-08-28T19:08:...|            null|Netherlands|   52.3667,4.9|         0|
|0ae72d93c4b9394d5...|9755322|pcmcat247400050001|           hp laptop|2011-09-05T15:34:...|2011-09-05T15:33:...|Roelofarendsveen|Netherlands|52.2033,4.6333|         4|
|dd99a827d4790eaf3...|2363054|      abcat0101001|     Samsung LCD 550|2011-09-01T10:29:...|2011-09-01T10:28:...|            null|Netherlands|   52.3667,4.9|        42|
|fa7bf9fb76ef1abe3...|9643151|      abcat0404004|        memory stick|2011-09-06T17:17:...|2011-09-06T17:15:...|            null|Netherlands|   52.3667,4.9|         0|
|037180033c5941c86...|9929926|          cat02015|              Castle|2011-08-28T22:07:...|2011-08-28T22:06:...|       Tarragona|      Spain|  41.1167,1.25|         6|
|45bebf83124cef122...|1303083|      abcat0508009|    microsoft office|2011-08-30T19:33:...|2011-08-30T19:31:...|       Tarragona|      Spain|  41.1167,1.25|        20|
|86fa480fe4ff0abac...|2842056|pcmcat209000050008|Inventory checker...|2011-09-06T18:17:...|2011-09-06T18:17:...|            null|    Belgium|    50.85,4.35|        37|
|0d1521a7bbe3973d6...|1014055|      abcat0201010|                Zune|2011-08-25T12:24:...|2011-08-25T12:23:...|            null|Netherlands|   52.3667,4.9|         0| 

我希望数据按查询名称分组。但不仅如此,我只想要来自比利时和荷兰的前10名查询名称。首先,我将数据分组如下:

from pyspark.sql.functions import countDistinct, avg, stddev


mydf= df.groupBy(df.query).count()


mydf= mydf.orderBy(mydf["count"].desc())

mydf= mydf.limit(11)

mydf = mydf.dropna()
mydf.show()

+--------------------+-----+
|               query|count|
+--------------------+-----+
|         Hp touchpad|  271|
|         hp touchpad|  193|
|            Touchpad|  151|
|LaborDay_Computer...|  138|
|            touchpad|  128|
|           Hp tablet|   88|
|LaborDay_Televisi...|   86|
|                iPad|   65|
|             Laptops|   63|
|                Ipad|   61|
+--------------------+-----+


mydf= df.groupBy(when(col("country") == "Netherlands", df.groupBy(df.query))).count()


但我不知道如何与荷兰和比利时相匹配


Tags: 数据dftimecountquerycountrynullhp
1条回答
网友
1楼 · 发布于 2024-04-28 14:39:15

您可以使用filter。我从您的数据集中获取了样本数据。但是逻辑也应该适用于您的数据集

    #InputDF
    # +     -+     -+
    # |    country|      query|
    # +     -+     -+
    # |netherlands|hp touchpad|
    # |netherlands|       ipad|
    # |      india|   touchpad|
    # |    belgium|       ipad|
    # +     -+     -+

    filtered_df = df.filter("country in ('netherlands', 'belgium')")

    filtered_df.show()

    # +     -+     -+
    # |    country|      query|
    # +     -+     -+
    # |netherlands|hp touchpad|
    # |netherlands|       ipad|
    # |    belgium|       ipad|
    # +     -+     -+

    filtered_df.groupBy("query").count().orderBy(F.desc("count")).limit(10).show()

    # +     -+  -+
    # |      query|count|
    # +     -+  -+
    # |       ipad|    2|
    # |hp touchpad|    1|
    # +     -+  -+

相关问题 更多 >