Let's say I have a Data Analysis Problem (e.g. csv data like Iris Dataset) where I want to do some data manipulation and processing with Pandas and Python. My Python Script is already written and each day when I receive a csv file, I want this data to be processed with my python script in the Azure cloud and the result will be written to an Azure Blob storage.
Now I have come across these links/approaches to solve this:
- Run Python Scripts via Data Factory using Azure Batch
- Run Databricks-Notebook activity in Data Factory
- Run Python Scripts via Azure Databricks Python activity in Data Factory
Does anybody has some experience with both approaches to run a python script as described above and maybe recommendations and what to consider (Pros/Cons)?
Goal of this question: What approach to choose or would you prefer: a) Azure Batch Service or b) Azure Databricks and why?
Things to consider for choosing the appropriate service:
- price
- convenience of setting up solution
- monitoring possibilities
- possibilities to scale if data grows or script-logic gets more complex over time
- ease of integration with other services (e.g. storage)
- flexibility with regards to libraries and frameworks (e.g. let's say later on it might become a data science problem and I want to add some h2o machine learning models into my analysis pipeline)
- (maybe more I did not consider ...?)
from Azure Batch Service vs. Azure Databricks for Python Job
No comments:
Post a Comment