I am trying to write a PySpark DataFrame
to a BigQuery table. The schema for this table may change between job executions (columns may be added or omitted). So, I would like to overwrite this table each execution.
An example:
df = spark.createDataFrame(data=[(1, "A")],schema=["col1","col2"])
df.write.format("bigquery")\
.option("temporaryGcsBucket","temporary-bucket-name")\
.mode("overwrite")\
.save(path="dataset_name.table_name")
When `dataset_name.table_name" doesn't already exists, the above works great to generate:
However, subsequent jobs may be as below:
df.withColumnRenamed("col1", "col3").write.format("bigquery")\
.option("writeDisposition", "WRITE_TRUNCATE")\
.option("temporaryGcsBucket","temporary-bucket-name")\
.mode("overwrite")\
.save(path="dataset_name.table_name")
The above job does not generate what I want. I get no col3
and col1
still appears:
Even more disturbing, I get no error message.
So, what options should I specify so that the result in BigQuery is just col2
and col3
with appropriate data?
Basically, I want to mimic the SQL statement CREATE OR REPLACE TABLE
from PySpark.
from How do I overwrite a BigQuery table (data and schema) from PySpark?
No comments:
Post a Comment