Identify Partition Key Column From A Table Using Pyspark
I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a lis
Solution 1:
It can be done using desc as shown below:
df=spark.sql("""desc test_dev_db.partition_date_table""")
>>> df.show(truncate=False)
+-----------------------+---------+-------+|col_name |data_type|comment|+-----------------------+---------+-------+|emp_id |int|null||emp_name |string |null||emp_salary |int|null||emp_date |date|null||year|string |null||month|string |null||day|string |null||# Partition Information||||# col_name |data_type|comment||year|string |null||month|string |null||day|string |null|+-----------------------+---------+-------+
Since this table was partitioned, So here you can see the partition column information along with their datatypes.
It seems your are interested in just partition column name and their respective data types. Hence I am creating a list of tuples.
partition_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
>>> print partition_list
[(u'emp_id', u'int'), (u'emp_name', u'string'), (u'emp_salary', u'int'), (u'emp_date', u'date'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string'), (u'# Partition Information', u''), (u'# col_name', u'data_type'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string')]
partition_details = [partition_list[index+1:] for index,item inenumerate(partition_list) if item[0]=='# col_name']
>>> print partition_details
[[(u'year', u'string'), (u'month', u'string'), (u'day', u'string')]]
It will return empty list in case table is not partitioned. Hope this helps.
Solution 2:
Another simple method through pyspark script .
from pyspark.sql.typesimport *
import pyspark.sql.functionsas f
from pyspark.sqlimport functions as F
from pyspark.sql.functionsimport col, concat, lit
descschema = StructType([ StructField("col_name", StringType())
,StructField("data_type", StringType())
,StructField("comment", StringType())])
df = spark.sql(f"describe formatted serve.cust_transactions" )
df2=df.where((f.col("col_name")== 'Part 0') | (f.col("col_name")== 'Part 2') | (f.col("col_name")== 'Name')).select(f.col('data_type'))
df3 =df2.toPandas().transpose()
display(df3)
Result would be :
Post a Comment for "Identify Partition Key Column From A Table Using Pyspark"