Friday, 17 December 2021

Read spark data with column that clashes with partition name

I have the following file paths that we read with partitions on s3

prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json

When I read these with pyspark

self.spark \
    .read \
    .option("basePath", 'prefix') \
    .schema(self.schema) \
    .json(['company=abcd/service=xyz/date=2021-01-01/'])

All the files have the same schema and get loaded in the table as rows. A file could be something like this:

{"id": "foo", "color": "blue", "date": "2021-12-12"}

The issue is that sometimes the files have the date field that clashes with my partition code, like date. So I want to know if it is possible to load the files without the partition columns, rename the JSON date column and then add the partition columns.

Final table would be:

| id  | color | file_date  | company | service | date       |
-------------------------------------------------------------
| foo | blue  | 2021-12-12 | abcd    | xyz     | 2021-01-01 |
| bar | red   | 2021-10-10 | abcd    | xyz     | 2021-01-01 |
| baz | green | 2021-08-08 | abcd    | xyz     | 2021-01-01 |

EDIT:

More information: I have 5 or 6 partitions sometimes and date is one of them (not the last). And I need to read multiple date partitions at once too. The schema that I pass to Spark contains also the partition columns which makes it more complicated.

I don't control the input data so I need to read as is. I can rename the file columns but not the partition columns.

Would it be possible to add an alias to file columns as we would do when joining 2 dataframes?

Spark 3.1



from Read spark data with column that clashes with partition name

No comments:

Post a Comment