Sunday, 29 September 2019

BiqQuery Storage. Python. Reading multiple streams in parallel issue (multiprocessing)

I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.

However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.

Using LIQUID strategy we can read all the data from one stream, which cannot be split.

According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream

I have the following toy code:

import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)

table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"

parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
    table_ref,
    parent,
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)

def read_rows(stream_position, session=session):
    reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
    return reader

if __name__ ==  '__main__': 
    p = Pool(2)
    output = p.map(read_rows,([i for i in range(0,2)]))
    print(output)

Need assistance to have multiple streams being read in parallel. Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated



from BiqQuery Storage. Python. Reading multiple streams in parallel issue (multiprocessing)

No comments:

Post a Comment