Thursday, 27 January 2022

python - matplot lib sub-plot grid: where to insert row/column arguments

For context, I'm working with SKLearn's text analysis topic extraction documentation script for displaying the top words for a given fit. But my actual issue is toggling matplotlib.

How to reference sub-plot row/column locations?

Extracting subplot coordinates in Python

This question asks about coordinates of subplots, but I can't find a way to use this info to help me with my for loop, which is supposed to plot the top words from a list of data inputs (running the model with different data at each iteration and plotting results in a distinct sub plot):

tf_list = [cm_array, xb_array, array_3, array_4, array_5, array_6, array_7]

for i in range(enumerate(tf_list)):
    tf = tf_vectorizer.fit_transform(tf_list[i])
    n_components = 1
    lda.fit(tf)
    n_top_words = 20
    tf_feature_names = tf_vectorizer.get_feature_names_out()
    top_word_comparison(lda, tf_feature_names, n_top_words, "Topics in LDA model")

I think this should work in theory, but the trouble is I can't figure out how to change the documentation's plot function to incorporate different fits. The furthest I got (with the help of Alex):

   def top_word_comparison(axes, model, feature_names, n_top_words, subplot_title):
    #column logic
    for j in range(len(tf_list)):
        top_features_ind = model.components_.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = model.components_[top_features_ind]
        
        #print(len(model.components_))
        print(weights)
        ax = axes[j]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(subplot_title, fontdict={"fontsize": 30})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)

#tf_list = [cm_array, xb_array]
fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
fig.suptitle("Topics in LDA model", fontsize=40)

for i in range(len(tf_list)):
    tf = tf_vectorizer.fit_transform(tf_list[i])
    n_components = 1
    lda.fit(tf)
    n_top_words = 20
    tf_feature_names = tf_vectorizer.get_feature_names_out()
    top_word_comparison(axes[0], lda, tf_feature_names, n_top_words, sector_list[i])

plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
plt.show()

Getting the error:

IndexError: index 735 is out of bounds for axis 0 with size 1

Which leads me to think that when I changed:

for topic_idx, topic in enumerate(model.components_):
    top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
    top_features = [feature_names[i] for i in top_features_ind]
    weights = topic[top_features_ind]

to:

for j in range(len(tf_list)):
        top_features_ind = model.components_.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = model.components_[top_features_ind]

Conclusion

Even though each fit only has `1` for `components_`, it seems that I can't just replace `topic` with `model.components_` every time it pops up. So, the trouble is:
  • My LDA model just has one component for each run, so we are not plotting one sub plot per component like we might see in the documentation
  • Instead, we are trying to plot sub plots based on entirely new model fits and for that reason, it would make sense to loop over the number of fits/data elements in tf_list. However, when we do so, the matrix algebra seems to collapse


from python - matplot lib sub-plot grid: where to insert row/column arguments

No comments:

Post a Comment