Hemant Vishwakarma: How to explode multiple columns, different types and different lengths?

Thursday, 11 July 2019

How to explode multiple columns, different types and different lengths?

I've got a DF with columns of different time cycles (1/6, 3/6, 6/6 etc.) and would like to "explode" all the columns to create a new DF in which each row is a 1/6 cycle.

from pyspark import Row 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import explode, arrays_zip, col

spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

df = spark.createDataFrame([Row(a=1, b=[1, 2, 3, 4, 5, 6], c=[11, 22, 33], d=['foo'])])

|  a|                 b|           c|    d|
+---+------------------+------------+-----+
|  1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|
+---+------------------+------------+-----+

I'm doing the explode:

df2 = (df.withColumn("tmp", arrays_zip("b", "c", "d"))
       .withColumn("tmp", explode("tmp"))
       .select("a", col("tmp.b"), col("tmp.c"), "d"))

But the output is not what I want:

|  a|  b|   c|    d|
+---+---+----+-----+
|  1|  1|  11|[foo]|
|  1|  2|  22|[foo]|
|  1|  3|  33|[foo]|
|  1|  4|null|[foo]|
|  1|  5|null|[foo]|
|  1|  6|null|[foo]|
+---+---+----+-----+

I would want it to look like this:

|  a|  b|  c|  d|
+---+---+---+---+
|  1|  1| 11|foo|
|   |  2|   |   |
|   |  3| 22|   |
|   |  4|   |   |
|   |  5| 33|   |
|   |  6|   |   |
+---+---+---+---+

I am new to Spark and from the start I've got complicated topics ! :)

Update: Maybe someone has a solution without usage of UDFs?

from How to explode multiple columns, different types and different lengths?

Hemant Vishwakarma

Thursday, 11 July 2019

How to explode multiple columns, different types and different lengths?

No comments:

Post a Comment