I have the following file paths that we read with partitions on s3
prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json
When I read these with pyspark
self.spark \
.read \
.option("basePath", 'prefix') \
.schema(self.schema) \
.json(['company=abcd/service=xyz/date=2021-01-01/'])
All the files have the same schema and get loaded in the table as rows. A file could be something like this:
{"id": "foo", "color": "blue", "date": "2021-12-12"}
The issue is that sometimes the files have the date field that clashes with my partition code, like date
. So I want to know if it is possible to load the files without the partition columns, rename the JSON date column and then add the partition columns.
Final table would be:
| id | color | file_date | company | service | date |
-------------------------------------------------------------
| foo | blue | 2021-12-12 | abcd | xyz | 2021-01-01 |
| bar | red | 2021-10-10 | abcd | xyz | 2021-01-01 |
| baz | green | 2021-08-08 | abcd | xyz | 2021-01-01 |
EDIT:
More information: I have 5 or 6 partitions sometimes and date is one of them (not the last). And I need to read multiple date partitions at once too. The schema that I pass to Spark contains also the partition columns which makes it more complicated.
I don't control the input data so I need to read as is. I can rename the file columns but not the partition columns.
Would it be possible to add an alias to file columns as we would do when joining 2 dataframes?
Spark 3.1
from Read spark data with column that clashes with partition name
No comments:
Post a Comment