Thursday, 4 August 2022

Calculate Silhouette coefficient for each sample in PySpark

I have a Spark ML pipeline in pyspark that looks like this,

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)

pipeline = Pipeline(stages=[scaler, pca, kmeans])

After training the model, I wanted to get silhouette coefficients for each sample just like this function in sklearn

I know that I can use ClusteringEvaluator and generate scores for the whole dataset. But I want to do it for each sample instead.

How can I achieve this efficiently in pyspark?



from Calculate Silhouette coefficient for each sample in PySpark

No comments:

Post a Comment