Wednesday, 29 November 2023

How do I overwrite a BigQuery table (data and schema) from PySpark?

I am trying to write a PySpark DataFrame to a BigQuery table. The schema for this table may change between job executions (columns may be added or omitted). So, I would like to overwrite this table each execution.

An example:

df = spark.createDataFrame(data=[(1, "A")],schema=["col1","col2"])
df.write.format("bigquery")\
    .option("temporaryGcsBucket","temporary-bucket-name")\
    .mode("overwrite")\
    .save(path="dataset_name.table_name")

When `dataset_name.table_name" doesn't already exists, the above works great to generate: enter image description here

However, subsequent jobs may be as below:

df.withColumnRenamed("col1", "col3").write.format("bigquery")\
    .option("writeDisposition", "WRITE_TRUNCATE")\
    .option("temporaryGcsBucket","temporary-bucket-name")\
    .mode("overwrite")\
    .save(path="dataset_name.table_name")

The above job does not generate what I want. I get no col3 and col1 still appears: enter image description here

Even more disturbing, I get no error message.

So, what options should I specify so that the result in BigQuery is just col2 and col3 with appropriate data?

Basically, I want to mimic the SQL statement CREATE OR REPLACE TABLE from PySpark.



from How do I overwrite a BigQuery table (data and schema) from PySpark?

No comments:

Post a Comment