I am running the following function to profile a BigQuery query:
# q = "SELECT * FROM bqtable LIMIT 1'''
def run_query(q):
t0 = time.time()
client = bigquery.Client()
t1 = time.time()
res = client.query(q)
t2 = time.time()
results = res.result()
t3 = time.time()
records = [_ for _ in results]
t4 = time.time()
print (records[0])
print ("Initialize BQClient: %.4f | ExecuteQuery: %.4f | FetchResults: %.4f | PrintRecords: %.4f | Total: %.4f | FromCache: %s" % (t1-t0, t2-t1, t3-t2, t4-t3, t4-t0, res.cache_hit))
And, I get something like the following:
Initialize BQClient: 0.0007 | ExecuteQuery: 0.2854 | FetchResults: 1.0659 | PrintRecords: 0.0958 | Total: 1.4478 | FromCache: True
I am running this on a GCP machine and it is only fetching ONE result in location US (same region, etc.), so the network transfer should (I hope?) be negligible. What's causing all the overhead here?
I tried this on the GCP console and it says the cache hit takes less than 0.1s
to return, but in actuality, it's over a second. Here is an example video to illustrate: https://www.youtube.com/watch?v=dONZH1cCiJc.
Notice for the first query, for example, it says it returned in 0.253s from cache:
However, if you view the above video, the query actually STARTED at 7 seconds and 3 frames --
And it COMPLETED at 8 seconds and 13 frames --
That is well over a second -- almost a second and a half!! That number is similar to what I get when I execute a query from the command-line in python.
So why then does it report that it only took 0.253s
when in actuality, to do the query and return the one result, it takes over five times that amount?
In other words, it seems like there's about a second overhead REGARDLESS of the query time (which are not noted at all in the execution details). Are there any ways to reduce this time?
from What's causing so much overhead in Google BigQuery query?
No comments:
Post a Comment