Hemant Vishwakarma: Guide to setup aws ETL glue pyspark job by step by step

Friday, 4 February 2022

Guide to setup aws ETL glue pyspark job by step by step

I am beginner for AWS pipelines.

Input I recieve csvs tables in S3 buckets RAW_input from ERPs everyday, time can be changed. For example- ERP1_folder contains erp1_sales.csv and erp1_customer.csv and same for ERP2_folder.

Transformation Then we need to apply tested query (SQLsfiles are in S3) and apply mapping + structure cleaning (Glue Jobs) such as int change, data format change etc. for each table in buckets and setting up output in required destination.

Destination: Output of the queries is expected in redshift tables. (We have clusters and database ready for it).

Req I would like to setup a pyspark glue job that trigger automatically when any file uploaded into S3 bucket.

Questions

Can only Glue job can do all work such as input s3 trigger+ mapping + cleaning ?
Do I still need to develop aws lambda function in this process to trigger glue job?
Allocation of query trigger per selected table by job sounds a tough work.
note: clusters in AWS glue is created mannually, I don't know its the only way to do it or not.
Any References / github codes for pyspark can be helpful

I can't have User secret access keys so I have to work only inside the AWS service by access roles and policy. NO CI/CD solution needed

Please relavent comments only if you need some explaination.

I tried: But i prefer looping over different Athena SQL files in S3 over similar ERPs. For example erp000.sql run over erp000.csv and send data to redshift

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SQLContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# EXTRACT: Reading parquet data
df = spark.read.format("csv").load('s3://<bucket_name>/<file_path>/<file_name>.csv')

# TRANSFORM: some transformation
df = df.distinct()

# LOAD: write data to Redshift
df.write.format("jdbc").\
    option("url", "jdbc:redshift://<host>:5439/<database>").\
    option("dbtable", "<table_schema>.<table_name>").\
    option("user", "<username>").\
    option("password", "<password>").\
    mode('overwrite').save()
    
print("Data Loaded to Redshift")

from Guide to setup aws ETL glue pyspark job by step by step

Hemant Vishwakarma

Friday, 4 February 2022

Guide to setup aws ETL glue pyspark job by step by step

No comments:

Post a Comment