Friday, 21 December 2018

Second-order cooccurrence of terms in texts

Basically, I want to reimplement this video.

Given a corpus of documents, I want to find the terms that are most similar to each other.

I was able to generate a coocurance matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrance matrix.

Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k = 4, and we have n terms in our dictionary, then the matrix M has n rows and 4 columns.

HAVE:

M = [[18,34,54,65],   # Term IDs similar to Term t_0
     [18,12,54,65],   # Term IDs similar to Term t_1
     ...
     [21,43,55,78]]   # Term IDs similar to Term t_n.

So, M contains for each term ID, the most similar term IDs. Now, I would like to check how many of those similar terms match. In the example of M above, it seems that term t_0 and term t_1 are quite similar, because three out of four terms match, where as terms t_0 and t_nare not similar, because no terms match. Let's write M as a series of lists.

M = [list_0,   # Term IDs similar to Term t_0
     list_1,   # Term IDs similar to Term t_1
     ...
     list_n]   # Term IDs similar to Term t_n.

WANT:

C = [[f(list_0, list_0), f(list_0, list_1), ..., f(list_0, list_n)],
     [f(list_1, list_0), f(list_1, list_1), ..., f(list_1, list_n)],
     ...
     [f(list_n, list_0), f(list_n, list_1), ..., f(list_n, list_n)]]

I'd like to find the matrix C, that has as its elements, a function f applied to the lists of M. f(a,b) measures the degree of similarity between two lists a and b. Going, with the example above, the degree of similiarty between t_0 and t_1 should be high, whereas the degree of similarity of t_0 and t_n should be low.

My questions:
1) What is a good choice for comparing the ordering of two lists? That is, what is a good choice for function f?
2) Is there a transformation already available that takes as an input a matrix like M and produces a matrix like C? Prefferably, a python package?

Thank you, r0f1



from Second-order cooccurrence of terms in texts

No comments:

Post a Comment