Sunday, 27 October 2019

How to get the table name from Spark SQL Query [PySpark]?

To get the table name from a SQL Query,

select *
from table1 as t1
full outer join table2 as t2
  on t1.id = t2.id

I found a solution in Scala How to get table names from SQL query?

def getTables(query: String): Seq[String] = {
  val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
  import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
  logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}

which gives me the correct table names when I iterate over the return sequence getTables(query).foreach(println)

table1
table2

What would be the equivalent syntax for PySpark? The closest I came across was How to extract column name and column type from SQL in pyspark

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan(query)
print(f"table: {plan.tableDesc().identifier().table()}")

which fails with the traceback

Py4JError: An error occurred while calling o78.tableDesc. Trace:
py4j.Py4JException: Method tableDesc([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:835)

I understand, the problem stems up from the fact that I need to filter all plan items which are of type UnresolvedRelation but I cannot find an equivalent notation in python/pyspark



from How to get the table name from Spark SQL Query [PySpark]?

No comments:

Post a Comment