Sunday, 18 September 2022

Spark- get column names as a list from nested data

For the dataframe below, which was generated from an avro file, I'm trying to get the column names as a list or other format so that I can use it in a select statement. node1 and node2 have the same elements. For example I understand that we could do df.select(col('data.node1.name')), but I'm not sure 1)how to select all columns at once without hardcode all the column names, and 2) how to handle the nested part. I think to make it readable,the productvalues and porders should be select into separate individual dataframes/tables? Many thanks for your help.

Input schema:

root
  |-- metadata: struct
  |...
  |-- data :struct 
  |    |--node1 : struct
  |    |   |--name : string
  |    |   |--productlist: array
  |    |        |--element : struct
       |              |--productvalues: array
       |                   |--element : struct
       |                         |-- pname:string
       |                         |-- porders:array
       |                                |--element : struct
       |                                      |-- ordernum: int
       |                                      |-- field: string
       |--node2 : struct
  |        |--name : string
  |        |--productlist: array
  |             |--element : struct
                      |--productvalues: array
                          |--element : struct
                                 |-- pname:string
                                 |-- porders:array
                                        |--element : struct
                                              |-- ordernum: int
                                              |-- field: string


from Spark- get column names as a list from nested data

No comments:

Post a Comment