Monday, 16 November 2020

Find nearest neighbors change algorithm

I am in the process of creating a recommender system that suggests 20 most suitable songs to a user. I've trained my model, I'm ready to recommend songs for a given playlist! However, one issue that I encountered is that I need the embedding of that new playlist in order to find the closest relevant playlists in that embedding space using kmeans.

To recommend songs, I first cluster the learned embeddings for all of the training playlists, and then select "neighbor" playlists for my given test playlist as all of the other playlists in that same cluster. I then take all of the tracks from these playlists and feed the test playlist embedding and these "neighboring" tracks into my model for prediction. This ranks the "neighboring" tracks by how likely they are (under my model) to occur next in the given test playlist.

desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')

mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())

# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))

print('\nPerforming kmeans to find the nearest users/playlists...')
# get 100 similar users
kmeans = KMeans(n_clusters=100, random_state=0, verbose=0).fit(user_latent_matrix)
desired_user_label = kmeans.predict(one_user_vector)
user_label = kmeans.labels_
neighbors = []
for user_id, user_label in enumerate(user_label):
    if user_label == desired_user_label:
        neighbors.append(user_id)
print('Found {0} neighbor users/playlists.'.format(len(neighbors)))

tracks = []
for user_id in neighbors:
    tracks += list(df[df['pid'] == int(user_id)]['trackindex'])
print('Found {0} neighbor tracks from these users.'.format(len(tracks))) 

users = np.full(len(tracks), desired_user_id, dtype='int32')
items = np.array(tracks, dtype='int32')

# and predict tracks for my user
results = model.predict([users,items],batch_size=100, verbose=0) 
results = results.tolist()
print('Ranked the tracks!')

results_df = pd.DataFrame(np.nan, index=range(len(results)), columns=['probability','track_name', 'track artist'])
print(results_df.shape)

# loop through and get the probability (of being in the playlist according to my model), the track, and the track's artist 
for i, prob in enumerate(results):
    results_df.loc[i] = [prob[0], df[df['trackindex'] == i].iloc[0]['track_name'], df[df['trackindex'] == i].iloc[0]['artist_name']]
results_df = results_df.sort_values(by=['probability'], ascending=False)

results_df.head(20)

Instead of this code above, I would like to use this https://www.tensorflow.org/recommenders/examples/basic_retrieval#building_a_candidate_ann_index or the official GitHub repository from Spotify https://github.com/spotify/annoy. Unfortunately I don't know exactly how to use this so that the new program gives me the 20 most popular tracks for a user. How do I have to change this?


Edit:

What I tried:

from annoy import AnnoyIndex
import random
desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')
    
mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())
    
# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))

t = AnnoyIndex(desired_user_id , one_user_vector)  #Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

u = AnnoyIndex(desired_user_id , one_user_vector)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
# Now how to I get the probability and the values? 


from Find nearest neighbors change algorithm

No comments:

Post a Comment