I've got a DF with columns of different time cycles (1/6, 3/6, 6/6 etc.) and would like to "explode" all the columns to create a new DF in which each row is a 1/6 cycle.
from pyspark import Row
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, arrays_zip, col
spark = SparkSession.builder \
.appName('DataFrame') \
.master('local[*]') \
.getOrCreate()
df = spark.createDataFrame([Row(a=1, b=[1, 2, 3, 4, 5, 6], c=[11, 22, 33], d=['foo'])])
| a| b| c| d|
+---+------------------+------------+-----+
| 1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|
+---+------------------+------------+-----+
I'm doing the explode:
df2 = (df.withColumn("tmp", arrays_zip("b", "c", "d"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
But the output is not what I want:
| a| b| c| d|
+---+---+----+-----+
| 1| 1| 11|[foo]|
| 1| 2| 22|[foo]|
| 1| 3| 33|[foo]|
| 1| 4|null|[foo]|
| 1| 5|null|[foo]|
| 1| 6|null|[foo]|
+---+---+----+-----+
I would want it to look like this:
| a| b| c| d|
+---+---+---+---+
| 1| 1| 11|foo|
| | 2| | |
| | 3| 22| |
| | 4| | |
| | 5| 33| |
| | 6| | |
+---+---+---+---+
I am new to Spark and from the start I've got complicated topics ! :)
Update: Maybe someone has a solution without usage of UDFs?
from How to explode multiple columns, different types and different lengths?
No comments:
Post a Comment