I have a performance issue and after analyzing Spark web UI i found what it seems to be data skewness: 
Initially i thought partitions were not evenly distributed, so i performed an analysis of rowcount per partitions, but it seems normal(with no outliers): how to manually run pyspark's partitioning function for debugging
But the problem persists and i see there is one executor processing most of the data:
So the hypothesis now is partitions are not evenly distributed across executors, question is: how spark distributes partitions to executors? and how can i change it to solve my skewness problem?
from How spark distributes partitions to executors

No comments:
Post a Comment