Skip to content Skip to sidebar Skip to footer

How To Merge Multiple Rows Into Single Cell Based On Id And Then Count?

How to merge multiple rows into single cell based on id using PySpark? I have a dataframe with ids and products. First I want to merge the products with the same id together into a

Solution 1:

You can do this is PySpark by using groupby. First group on the id column and merge the products together into a single, sorted list. To get the count of the number of such lists, use groupby again and aggregate by count.

from pyspark.sql import functions as F

df2 = (df
  .groupby("id")
  .agg(F.concat_ws("-", F.sort_array(F.collect_list("product"))).alias("products"))
  .groupby("products")
  .agg(F.count("id")).alias("count"))

This should give you a dataframe like this:

+--------------+-----+
|      products|count|
+--------------+-----+
|   HOME-mobile|    2|
|  mobile-watch|    1|
|cd-music-video|    1|
+--------------+-----+

Post a Comment for "How To Merge Multiple Rows Into Single Cell Based On Id And Then Count?"