按两个因素分组的前10个术语

+--------------------+-------+------------------+--------------------+--------------------+--------------------+----------------+-----------+--------------+----------+ | user| sku| category| query| click_time| query_time| city| country| location|time_spent| +--------------------+-------+------------------+--------------------+--------------------+--------------------+----------------+-----------+--------------+----------+ |212dffe34129f84ac...|2842092|pcmcat209000050008| Hp tablet|2011-08-21T11:09:...|2011-08-21T11:07:...| Reusel|Netherlands|51.3625,5.1653| 13| |fc3540111a0aaca56...|2969477|pcmcat247400050000|Toshiba satellite...|2011-09-08T22:22:...|2011-09-08T22:22:...| Reusel|Netherlands|51.3625,5.1653| 60| |3eb03056dd77ccad8...|6974946| abcat0410012| oltmpus|2011-08-28T19:11:...|2011-08-28T19:08:...| null|Netherlands| 52.3667,4.9| 0| |0ae72d93c4b9394d5...|9755322|pcmcat247400050001| hp laptop|2011-09-05T15:34:...|2011-09-05T15:33:...|Roelofarendsveen|Netherlands|52.2033,4.6333| 4| |dd99a827d4790eaf3...|2363054| abcat0101001| Samsung LCD 550|2011-09-01T10:29:...|2011-09-01T10:28:...| null|Netherlands| 52.3667,4.9| 42| |fa7bf9fb76ef1abe3...|9643151| abcat0404004| memory stick|2011-09-06T17:17:...|2011-09-06T17:15:...| null|Netherlands| 52.3667,4.9| 0| |037180033c5941c86...|9929926| cat02015| Castle|2011-08-28T22:07:...|2011-08-28T22:06:...| Tarragona| Spain| 41.1167,1.25| 6| |45bebf83124cef122...|1303083| abcat0508009| microsoft office|2011-08-30T19:33:...|2011-08-30T19:31:...| Tarragona| Spain| 41.1167,1.25| 20| |86fa480fe4ff0abac...|2842056|pcmcat209000050008|Inventory checker...|2011-09-06T18:17:...|2011-09-06T18:17:...| null| Belgium| 50.85,4.35| 37| |0d1521a7bbe3973d6...|1014055| abcat0201010| Zune|2011-08-25T12:24:...|2011-08-25T12:23:...| null|Netherlands| 52.3667,4.9| 0|

from pyspark.sql.functions import countDistinct, avg, stddev mydf= df.groupBy(df.query).count() mydf= mydf.orderBy(mydf["count"].desc()) mydf= mydf.limit(11) mydf = mydf.dropna() mydf.show() +--------------------+-----+ | query|count| +--------------------+-----+ | Hp touchpad| 271| | hp touchpad| 193| | Touchpad| 151| |LaborDay_Computer...| 138| | touchpad| 128| | Hp tablet| 88| |LaborDay_Televisi...| 86| | iPad| 65| | Laptops| 63| | Ipad| 61| +--------------------+-----+ mydf= df.groupBy(when(col("country") == "Netherlands", df.groupBy(df.query))).count()

1条回答

网友

1楼 · 发布于 2024-04-28 14:39:15

您可以使用filter。我从您的数据集中获取了样本数据。但是逻辑也应该适用于您的数据集

    #InputDF
    # +     -+     -+
    # |    country|      query|
    # +     -+     -+
    # |netherlands|hp touchpad|
    # |netherlands|       ipad|
    # |      india|   touchpad|
    # |    belgium|       ipad|
    # +     -+     -+

    filtered_df = df.filter("country in ('netherlands', 'belgium')")

    filtered_df.show()

    # +     -+     -+
    # |    country|      query|
    # +     -+     -+
    # |netherlands|hp touchpad|
    # |netherlands|       ipad|
    # |    belgium|       ipad|
    # +     -+     -+

    filtered_df.groupBy("query").count().orderBy(F.desc("count")).limit(10).show()

    # +     -+  -+
    # |      query|count|
    # +     -+  -+
    # |       ipad|    2|
    # |hp touchpad|    1|
    # +     -+  -+

相关问题更多 >

编程相关推荐

热门问题

热门文章