For context, I'm working with SKLearn's text analysis topic extraction documentation script for displaying the top words for a given fit. But my actual issue is toggling matplotlib.
How to reference sub-plot row/column locations?
Extracting subplot coordinates in Python
This question asks about coordinates of subplots, but I can't find a way to use this info to help me with my for loop, which is supposed to plot the top words from a list of data inputs (running the model with different data at each iteration and plotting results in a distinct sub plot):
tf_list = [cm_array, xb_array, array_3, array_4, array_5, array_6, array_7]
for i in range(enumerate(tf_list)):
tf = tf_vectorizer.fit_transform(tf_list[i])
n_components = 1
lda.fit(tf)
n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names_out()
top_word_comparison(lda, tf_feature_names, n_top_words, "Topics in LDA model")
I think this should work in theory, but the trouble is I can't figure out how to change the documentation's plot function to incorporate different fits. The furthest I got (with the help of Alex):
def top_word_comparison(axes, model, feature_names, n_top_words, subplot_title):
#column logic
for j in range(len(tf_list)):
top_features_ind = model.components_.argsort()[: -n_top_words - 1 : -1]
top_features = [feature_names[i] for i in top_features_ind]
weights = model.components_[top_features_ind]
#print(len(model.components_))
print(weights)
ax = axes[j]
ax.barh(top_features, weights, height=0.7)
ax.set_title(subplot_title, fontdict={"fontsize": 30})
ax.invert_yaxis()
ax.tick_params(axis="both", which="major", labelsize=20)
for i in "top right left".split():
ax.spines[i].set_visible(False)
#tf_list = [cm_array, xb_array]
fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
fig.suptitle("Topics in LDA model", fontsize=40)
for i in range(len(tf_list)):
tf = tf_vectorizer.fit_transform(tf_list[i])
n_components = 1
lda.fit(tf)
n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names_out()
top_word_comparison(axes[0], lda, tf_feature_names, n_top_words, sector_list[i])
plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
plt.show()
Getting the error:
IndexError: index 735 is out of bounds for axis 0 with size 1
Which leads me to think that when I changed:
for topic_idx, topic in enumerate(model.components_):
top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
top_features = [feature_names[i] for i in top_features_ind]
weights = topic[top_features_ind]
to:
for j in range(len(tf_list)):
top_features_ind = model.components_.argsort()[: -n_top_words - 1 : -1]
top_features = [feature_names[i] for i in top_features_ind]
weights = model.components_[top_features_ind]
Conclusion
Even though each fit only has `1` for `components_`, it seems that I can't just replace `topic` with `model.components_` every time it pops up. So, the trouble is:- My LDA model just has one component for each run, so we are not plotting one sub plot per component like we might see in the documentation
- Instead, we are trying to plot sub plots based on entirely new model fits and for that reason, it would make sense to loop over the number of fits/data elements in
tf_list
. However, when we do so, the matrix algebra seems to collapse
from python - matplot lib sub-plot grid: where to insert row/column arguments
No comments:
Post a Comment