Hemant Vishwakarma: Read spark data with column that clashes with partition name

Friday, 17 December 2021

Read spark data with column that clashes with partition name

I have the following file paths that we read with partitions on s3

prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json

When I read these with pyspark

self.spark \
    .read \
    .option("basePath", 'prefix') \
    .schema(self.schema) \
    .json(['company=abcd/service=xyz/date=2021-01-01/'])

All the files have the same schema and get loaded in the table as rows. A file could be something like this:

{"id": "foo", "color": "blue", "date": "2021-12-12"}

The issue is that sometimes the files have the date field that clashes with my partition code, like date. So I want to know if it is possible to load the files without the partition columns, rename the JSON date column and then add the partition columns.

Final table would be:

| id  | color | file_date  | company | service | date       |
-------------------------------------------------------------
| foo | blue  | 2021-12-12 | abcd    | xyz     | 2021-01-01 |
| bar | red   | 2021-10-10 | abcd    | xyz     | 2021-01-01 |
| baz | green | 2021-08-08 | abcd    | xyz     | 2021-01-01 |

EDIT:

More information: I have 5 or 6 partitions sometimes and date is one of them (not the last). And I need to read multiple date partitions at once too. The schema that I pass to Spark contains also the partition columns which makes it more complicated.

I don't control the input data so I need to read as is. I can rename the file columns but not the partition columns.

Would it be possible to add an alias to file columns as we would do when joining 2 dataframes?

Spark 3.1

from Read spark data with column that clashes with partition name

Hemant Vishwakarma

Friday, 17 December 2021

Read spark data with column that clashes with partition name

No comments:

Post a Comment