I am beginner for AWS pipelines.
Input I recieve csvs tables in S3 buckets RAW_input from ERPs everyday, time can be changed. For example- ERP1_folder contains erp1_sales.csv and erp1_customer.csv and same for ERP2_folder.
Transformation Then we need to apply tested query (SQLsfiles are in S3) and apply mapping + structure cleaning (Glue Jobs) such as int change, data format change etc. for each table in buckets and setting up output in required destination.
Destination: Output of the queries is expected in redshift tables. (We have clusters and database ready for it).
Req I would like to setup a pyspark glue job that trigger automatically when any file uploaded into S3 bucket.
Questions
- Can only Glue job can do all work such as
input s3 trigger+mapping+cleaning? - Do I still need to develop
aws lambdafunction in this process totriggerglue job? - Allocation of
query triggerperselected tablebyjobsounds a tough work. - note:
clustersinAWSglue is created mannually, I don't know its the only way to do it or not. - Any References / github codes for pyspark can be helpful
I can't have User secret access keys so I have to work only inside the AWS service by access roles and policy. NO CI/CD solution needed
Please relavent comments only if you need some explaination.
I tried: But i prefer looping over different Athena SQL files in S3 over similar ERPs. For example erp000.sql run over erp000.csv and send data to redshift
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SQLContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# EXTRACT: Reading parquet data
df = spark.read.format("csv").load('s3://<bucket_name>/<file_path>/<file_name>.csv')
# TRANSFORM: some transformation
df = df.distinct()
# LOAD: write data to Redshift
df.write.format("jdbc").\
option("url", "jdbc:redshift://<host>:5439/<database>").\
option("dbtable", "<table_schema>.<table_name>").\
option("user", "<username>").\
option("password", "<password>").\
mode('overwrite').save()
print("Data Loaded to Redshift")
from Guide to setup aws ETL glue pyspark job by step by step
No comments:
Post a Comment