How To Merge Multiple Rows Into Single Cell Based On Id And Then Count?
How to merge multiple rows into single cell based on id using PySpark? I have a dataframe with ids and products. First I want to merge the products with the same id together into a
Solution 1:
You can do this is PySpark by using groupby
. First group on the id column and merge the products together into a single, sorted list. To get the count of the number of such lists, use groupby
again and aggregate by count
.
from pyspark.sql import functions as F
df2 = (df
.groupby("id")
.agg(F.concat_ws("-", F.sort_array(F.collect_list("product"))).alias("products"))
.groupby("products")
.agg(F.count("id")).alias("count"))
This should give you a dataframe like this:
+--------------+-----+
| products|count|
+--------------+-----+
| HOME-mobile| 2|
| mobile-watch| 1|
|cd-music-video| 1|
+--------------+-----+
Post a Comment for "How To Merge Multiple Rows Into Single Cell Based On Id And Then Count?"